ResearchProjects / Malware analysis, detection, and classification


Goals

As Symantec's report shows, we are drowning in malware. As numerous randomized obfuscation tools are freely available, it is becoming harder and harder for signature-based methods to keep up with the flood. So, in this project, I'm studying behavioral detection methods, especially those based on automata inference. To get a better idea about the project, see the Malware Analysis with Tree Automata Inference paper.

Key Ideas

The key idea of the project is that the system call data-flow dependency graphs obtained through taint analysis (see Newsome and Song and Clause et al.'s papers) can be expanded into trees, which in turn can be recognized by tree automata. Tree automata are a generalization of regular word finite-state automata, accepting trees rather than strings. In this project, I'm working on automatic inference of tree automata and their application to malware recognition and classification.

Dependency Graphs

Here you can download system call dependency graphs generated for 2631 malware samples and 35 commonly used benign applications. The graphs represent data-flow dependencies among executed system calls. The traces were produced by a tool developed by Daniel Reynaud and the libwst library developed by Lorenzo Martignoni and Roberto Paleari.

BenchmarksReleaseDateArchiveSize
Benign apps, 120sec timeout1.0Jan 1, 2010[tar.bz2]255KB
Benign apps, 800sec timeout1.0Jan 1, 2010[tar.bz2]758KB
Malware, 120sec timeout1.0Jan 1, 2010[tar.bz2]20MB

Here's a brief description of the file format:

  • Lines beginning with '#' are comments
  • The N line specifies the total number of different nodes in the graph
  • The V line specifies a unique identifier of the node, its name, and the numbers of input and output parameters
  • The E line specifies edges in the form SourceNodeId:OutputParameterNumber,DestinationNodeId:InputParameterNumber

Here you can find an example of a very small dependency graph obtained from executing a sample from the Hupigon malware family.

Implementation of the Tree-Automaton Inference

The source code (in C++) of the inference engine for analyzing dependency graphs and for tree-automata inference is available here: [tar.bz2]

Page last modified on April 20, 2011, at 10:17 PM