Talha Obaid, A Machine Learning approach for detecting a Malware

Talha Obaid is an AntiSpam engineer for Email Security.cloud at Symantec, where he joined from MIT CENSAM research center. In his current role, he utilizes Data Science to fight spam and malware. He loves democratizing Machine Learning, and has recently held talks at the Google ML Experts Day (2017), as well as the Google GDG DevFest (2016). Within Symantec, Talha has conducted several sessions about Machine Learning. He has been acknowledged a number of times at Symantec. He has been named a Symantec Inventor twice, won Symantec STAR Innovation Day, and clinched numerous Symantec Applause Awards. Prior to Symantec, Talha worked at MIT Center for Environmental Sensing and Modeling (CENSAM) research center. While working on Hydraulic Modelling and Simulation, he helped found a spinoff, Visenti, which was acquired by Xylem. Previously, Talha also held a technical leadership position at Mentor Graphics. Throughout his career, his contributions have landed him four spinoffs, five patents, a trade secret, and a few publications. Talha holds a bachelor’s degree in Computer Science with Honors, and a Masters’ degree in Information Systems, from the National University of Singapore, where he specialized in Business Analytics and Cluster computing in his dissertation. Beyond his working hours, Talha actively contributes to the Data Science community as a lead co-organizer for PyDataSG, a 3,000-strong group, holding regular monthly meet-ups. He also conducts TeachEdison workshops, and is a certified First-Aider as well. You may follow and send him a tweet @obaidtal

A Machine Learning approach for detecting a Malware:
The project is to improve the way we detect script based malware using Machine Learning. Malware has become one of the most active channel to deliver threats like Banking Trojans and Ransomware. The talk is aimed at finding a new and effective way to detect the malware. We started with acquiring both malicious and clean samples. Later we performed feature identification, while building on top of existing knowledge base of malware. Then we performed automated feature extraction. After certain feature set is obtained, we teased-out feature which are categorical, interdependent or composite. We applied varying machine learning models, producing both binary and categorical outcomes. We cross validated our results and re-tuned our feature set and our model, until we obtained satisfying results, with least false-positives. We concluded that not all the extracted features are significant, in fact some features are detrimental on the model performance. Once such features are factored-out, it results not only in better match, but also provides a significant gain in performance.

One comment

Comments are closed.