Building Semantic Lexicons for Source Code and Binary Analysis
From The Math Club
Princeton WordNET is an awesome resource for semantic relations between words. It's a human built network of word relations. The Visual Thesaurus is a visually browsable interface to WordNET. Spending a few minutes navigating this network, you can begin to get an idea of how these lexicons can be used to analyize lexical cohesion within documents. A number of papers have been written to automatically identify subject transistion and other semantic information from documents simply by looking at the proximity of similar words within the documents.
The basic idea I have here is that lexicons similar to WordNET can be built for analyzing program source. Two immediate sources of symbol connectivity are C include headers and function manpages. The premise with include files is that co-occuring symbols within an include header are functionally related, for example, all symbols within the socket.h header should be related to socket operations in some way. manpages have a straight forward see also section which describes exactly, neighboring terms. These two sources can easily be converted into a lexicon containing semantic relations between symbols.
The analogy in program code does not exist, as there is no already quantified lexicon source. I will present a method to extract statistical collocation information from binary program code using sequence motifs and a method to tokenize and segment program code by common heuristic boundries.

