Spam Clustering
From The Math Club
journal of aesthetics agenda fusion 6.86 serial online home loans rowley douglas pussy grinding new york city transit russell peters video online mechanics of materials gere solutions manual ceiling light fixtures amateur fucking umbrella tent hamer guitars egg drop ideas little april porn videos cape cod garden services limons
This is a project that I worked on prior to my employment at Cloudmark. I took interest in the idea that spam could be clustered and identified by its inheirent grammar structure and n-gram frequency characteristics. I wrote a bunch of pretty crappy scripts and code to perform different types of analysis on the stuff and I have a bunch of notes on it too. As far as anything visually presentable, all I have is this.
The graph might look interesting, but it really doesnt give you too much information as what the hell is going on. I will try to explain.
n-gram fingerprinting and "sprint"
The start of my research into fingerprinting and identifying human readable information (cognitive data) was in 2001 when I was posed with the problem of identifying IRC users who used many different nicknames and hosts. My assumption was that an individuals grammar and mannerisms were unique enough to idenfy them. It was just a matter of devising a useful metric to do this.
This is when my friend Seth (minus) introduced me to n-grams, apparently the NSA took an interest in them, so they must be useful! Anyhow they are closely related to sliding windows, and the digram instance of n-gram analysis has been used for decades in basic cipheranalysis, and there are many parallels with markov processes. In the case of digrams, computationally, they are very easy to deal with.
One of my first cracks at using these to somehow fingerprint information was a tool called sprint, a tool for fingerprinting arbitrary streams of data using digram dictionaries that would essentially pretend to compress the streams. The assumption was that the further a stream compressed with a given dictionary, the more that stream resembled the source used to build the the dictionary.
...MORE TO COME...


