Python and Machine Learning: How to use algorithms to create yara rules with a malware zoo for hunting

Machine learning can be useful for helping analysts and reverse engineers. This presentation will explain how to transform data to use machine-learning algorithms to categorize a malware zoo. To cluster a set of (numerical) objects is to group them into meaningful categories. We want objects in the same group to be closer (or more similar) to each other than to those in other groups. Such groups of similar objects are called clusters. When data is labeled, this problem is called supervised clustering. It is a difficult problem but easier than the unsupervised clustering problem we have when data is not labeled. All our experiments have been done with code written in Python and we have mainly used scikit-learn. With the dataset the Zoo, we present how to use unsupervised algorithms on labeled datasets to validate the model. When the model is finalized, the resulting clusters can be used to automatically generate yara rules in order to hunt down the malware.

Sebastien Larinier (sebdraven) Senior Researcher,

Sébastien Larinier is a freelance Senior Researcher and Incident Handler who created the CERT Sekoia located in Paris. Member of the honeyproject chapter France and co-organizer of botconf. Sébastien focuses his work on botnet hunting, malware analysis, network forensics, early compromise detection, forensic and incident response. As a Python addict, he supports different opensource projects like FastIR, veri-sig, Oletools, pymisp, and malcom…