Identifying Hierarchical Structure in Unordered Sets

From The Math Club

Jump to: navigation, search

This idea occurred to me as the natural progression of a work by Craig Nevill-Manning entitled Identifying Hierarchical Structure in Sequences: A linear-time algorithm and an associated tool, SEQUITUR. The basic premise of SEQUITUR is that grammar structure can be inferred by recursively replacing common two-byte sequences in a string with a new symbol and entering this replacing into a dictionary, and then any dictionary rules that end up being used once are expanded. This is essentially the same manner in which dictionary-based compression works.

The reason the SEQUITUR algorithm was interesting to me at the time was that I was working on finding better ways of string generation and document fingerprinting than using stochastic processes known as markov chains and n-grams as their ability to capture obvious structure outside of a given window.

The algorithm I came up with is a straight-forward extension on the SEQUITUR algorithm. The algorithm in plain language is as follows:

Given a collection of sets of distinguishable elements, iteratively replace the most common pair of items occuring across all of the sets with a new token set containing those two items, and stop when no more pairs of items occur in more than a single set. Finally, any token set contained within any set, including token sets, found only as a member of a single "token" set should be replaced within its containing set by its members.

And in pseudo-code:


structure_sets (W):

   let W be a collection of sets of distinguishable elements
   let H be an associative array indexable by sets
   S can be assigned sets
   a,b can be assigned sets and non-set elements
   K can be assigned sets, which are essentially keys of H

   R = {}

   beginning:

   delete all keys in H

   foreach S in W
      foreach a in S
         foreach b in S-{a}
            foreach T in W-{S}
            if {a,b} is a subset of T then
               if {a,b} is not a key in H
                  add {a,b} to the keys in H
                  H[{a,b}] = 0
               H[{a,b}]++

   max = 0
   M = {}

   foreach K in the keys of H
      if H[K] > max
         max = H[K]
         M = K

   if max < 2
      goto ending

   R = R union {M}

   foreach S in W
      if M is a subset of S
         S = S-M
         S = S union {M}

   foreach Q in R
      i = 0
      foreach S in (W union R-{Q})
         if Q is an element of S
            i++
      if i < 2
      foreach S in (W union R-{Q})
         if Q is an element of S
            S = S-{Q}
            S = S union Q             

   goto beginning

ending:

return R, W

In PERL, some code that is basically the above pseudo-code minus moving the unrolling part to the end can be found on the Small Code page under "I, Sushi" or directly here.

Here's an example of the results of this algorithm on randomly generated sets, although random sets are not the most interesting input.


Here's the unstructured sets

   S01 { A, G, H, L, S, V, Y }
   S02 { G, I, N, O, P, R, S, W, Y, Z }
   S03 { A, B, C, D, H, I, J, O }
   S04 { A, B, D, E, G, H, O, P, S, V, Y }
   S05 { B, C, E, I, L, N, O, P, T, V, W, X, Y, Z }
   S06 { C, I, L, N, O, S, T, U, V, W, X, Y, Z }
   S07 { A, B, D, G, H, L, R, T, V, W, X }
   S08 { D, F, G, I, K, L, U, W, X, Y }
   S09 { A, B, C, E, F, H, I, M, N, S, W, X, Z }
   S10 { B, C, D, I, J, K, O, P, U, V, W, Z }
   S11 { C, E, H, I, J, L, N, O, R, S, W, X, Z }
   S12 { C, D, F, J, K, N, O, R, T, V, W, Z }

Here's the resulting structuring after applying my algorithm

   S01 { L, T108, T447, V }
   S02 { N, P, T13, T447, T85, Z }
   S03 { C, I, J, O, T425 }
   S04 { E, T373, T425, T447 }
   S05 { T398, T409, T468, Y }
   S06 { T11, T170, T346, T398, U }
   S07 { G, T398, T425, T62, V }
   S08 { G, T526, T85, T99, U, Y }
   S09 { F, M, T323, T500, X }
   S10 { J, T311, T468, T58, U }
   S11 { H, J, T13, T500, T99 }
   S12 { J, N, T, T11, T49, T526, T62 }

With the following recursive expansion rules

   T108 { A, H }
   T11 { O, V }
   T13 { O, R }
   T170 { S, Y }
   T311 { T49, T85 }
   T323 { B, T108 }
   T346 { N, T311 }
   T373 { P, T11 }
   T398 { T, T99 }
   T409 { E, T346 }
   T425 { D, T323 }
   T447 { G, T170 }
   T468 { B, T373 }
   T49 { C, Z }
   T500 { S, T409 }
   T526 { F, T58 }
   T58 { D, K }
   T62 { R, W }
   T85 { I, W }
   T99 { L, X }

The major thing I currently need to do is optimize performance of this algorithm, because as-is it's pretty computationally expensive. The main application I've been testing the algorithm for are for building composite structured elements of sets used in Bayesian classification engines and applying that toward detecting gestalt characteristics of hosts during the post-scan phase of vulnerability assessment. Another application along the same lines is to help classify characteristics about bug tickets during remediation phase of vulnerability management to assist with prioritization. Basically, its most valuable application is helping improve expert classifiers by factoring in structural characteristics of what would normally be considered unordered "flat" data.

If I get to it before the conference presentations this year, I will be posting some graphical diagrams of structural characteristics of computer profiles on a network scanned with nmap and nessus and the quantization of what would normally be considered gestalt characteristics under practical analysis about the computers.

Personal tools