Try Similarity by Compression yourself

The zip.o.tron2.0beta is hungry for text

This version of the Zipotron is Ajaxified, so it needs Javascript enabled on your browser. If you'd rather not, try the Zipotron Web 1.0 edition.

The Zipotron uses compression (currently gzip) to compare two strings of text. The text can be anything, for instance canonicalised SMILES. You can find more information on our use of compression here.

Just enter some text into each form, hit submit, and the Zip-o-tron will measure the similarity between them.



String 1:
String 2:
Result: the similarity score will appear here!

It normally takes a few seconds for the zip-o-tron to come up with an answer, depending on how long the strings are. The similarity runs between 0.0 (completely dissimilar) to 1.0 (identical - at least to the palate of the zipotron). Happy zipping!

Example: try comparing the SMILES for vioxx: C1(=C(c2ccc(S(C)(=O)=O)cc2)COC1=O)c1ccccc1 with
viagra: c1(c(ccc(c1)S(N1CCN(C)CC1)(=O)=O)OCC)c1nc(c2c(c(CCC)nn2C)[nH]1)=O (they're not very similar).

How does this work?

The Kolmogorov-Chaitin algorithmic complexity of a sequence of symbols can be defined as the size of the program (that you would feed to a universal Turing machine) required to print the sequence and then terminate. We can extend this concept to similarity by concatenating one sequence to the end of another, then finding the shortest program that, when fed to the Turing machine, reproduces the concatenated sequence. We could then compare the length of this new program with the original program. If the program isn't much longer than the one that generated the original sequence, then we can say that there wasn't actually much extra information in the second sequence - the two sequences are similar. Conversely, if the program is long, then there's obviously not a lot of shared information in the sequences, so they're not very similar.

Before you get too excited, here's the bad news: all this talk of universal Turing machines may lead you to suspect that calculating the length of the program is not going to be very easy. And you'd be right. It's non-computable. But here's some good news. It turns out that compressing sequences is an approximation, and you can use standard compressors like gzip and bzip2 to do it.

That's the basic idea, anyway. If you want more details, you should check out the various papers by Ming Li and Paul Vitanyi. Emanuele Caglioti and Andrea Baronchelli have also published interesting work in this area. You could also read our paper on compressing for chemical similarity searching, which has a pretty good collection of references to other work (if we say so ourselves), although you'll need to be a subscriber to the Journal of Chemical Information and Modeling to read it (sorry).

load indicator courtesy of ajaxload.info