A Guide to Zippity

Last updated: September 1st 2006

Zippity is a set of ruby scripts that wraps the bzip2 and gzip commands to calculate the Normalized Compression Distance, originally developed by Li, Vitanyi and co-workers. It’s designed to aid in carrying out similarity searching in cheminformatics.

For more on the NCD, a good place to start is: "Clustering by Compression", Cilibrasi R and Vitanyi PMB. IEEE Trans. Inform. Theory. 2005, 51, 1523-1545.

Our experiments with the NCD in similarity searching is given in "Similarity by Compression", Melville JL, Riley JF and Hirst JD. J. Chem. Inf. Model. 2007, 47, 25-33.

Necessary software


Most of the work with Zippity was done using Ruby 1.8.4, so I recommend you use that.

Mac users - Ruby is installed by default on Mac OS X 10.4, but it’s not very well configured (it’s not been configured to work with readline, for example). You would be advised to reinstall it using MacPorts.

Linux users - Ruby should be available for most distros, if it isn’t installed by default. However, sometimes the version isn’t very new, so you might want to consider building a new version from source, which is pretty easy. If you’re brave enough to use Linux, I assume compiling from source doesn’t frighten you.

Windows users - This should work with Cygwin.

Ruby Gems

This is the Ruby package management system:


Once again, it should be available through MacPorts or your Linux package manager.

Compression software

You need gzip and bzip2 installed on your computer. To test this, try typing their name at a prompt. For gzip, you should see something like:

			$ gzip
			gzip: compressed data not written to a terminal. Use -f to force


			For help, type: gzip -h

For bzip2, you should see:

			$ bzip2
			bzip2: I won't write compressed data to a terminal.
			bzip2: For help, type: `bzip2 --help'.

You definitely need to be able access gzip and bzip2 from the command line, as Zippity does its similarity searching by writing files to your temporary directory and then zipping them up via a call to the command line. So you also need to be able to read and write to the temporary directory on your computer.

If you can’t use bzip2 and gzip, you could still use zippity’s scripts with the euclidean distance or the tanimoto coefficient options, but you need to have generated descriptors for that already, zippity doesn’t generate fingerprints.

Canonicalising strings

Also, you need a source of canonicalised strings to represent your molecules. We used SMILES strings in our study, but you don’t have to. It’s probably a good idea to get canonicalised SMILES strings, as you may get very inconsistent results otherwise (e.g. the self similarity of two different SMILES strings representing the same molecule will not be 1.0).

We‘re not aware of many freely available tools that canonicalise SMILES strings. We‘re currently working on a tool to generate canonicalised strings, but it’s not quite ready yet. Watch this space! Alternatively, PubChem is a source of canonicalised SMILES, and in a continued Ruby vein, the ChemRuby can canonicalise SMILES, but I don’t know much about that project.

The Zippity Gem

Once you have Ruby and Ruby Gems installed, you can install the gem with

gem install zippity

You may need superuser permissions to do this.

If you’ve installed the zippity gem correctly, it lives wherever the gem directory has been installed.

Ruby itself normally lives somewhere like usr/local/lib/ruby (or /opt/local/lib/ruby if you installed it on Mac OS X using MacPorts) and then zippity should be found in the gems/1.8/gems/zippity-0.1 directory, unless that is, you installed a different version number. Scripts live in the bin/ subdirectory, you will probably need to change their executable permissions the first time you install them, e.g. using Bash:

chmod +x /usr/local/lib/ruby/gems/1.8/gems/zippity-0.1/bin/*.rb

again, you may need to be a superuser to do that.

Similarity searching with Zippity


First, let’s pretend you want to genuinely carry out some similarity searching, i.e. you have a small set of known actives, and then a larger set of molecules about which the activity is unknown. Of course, a lot of the time, you’re interested in carrying out a simulation of this situation, with a set of known actives and known inactives, but that complicates the terminology slightly, so we’ll get to that later.

The form of all data, whether inactive or active should be:

			<string>    <name>

one per line, where <string> is the string you’re representing the molecule with, and <name> is an identifier (which must be unique). The SMILES format is a perfect example, e.g.:

			C           methane
			CC          ethane
			c1ccccc1    benzene

The plan then is to compare each molecule with unknown activity to each of the known actives. The ones that are similar to the known actives should also have a higher probability of being active themselves.

I’m going to call the known actives references. The molecules with unknown activities are queries. Trust me, this will make things easier later on.


To carry out this similarity search use the simsearch.rb script:

			simsearch.rb <options> <references> <queries>

where <references> is the file with the references in, and <queries> is the queries file. This outputs to standard output, so you will want to redirect the output to an appropriate file.

simsearch.rb options

options available for simsearch are:

-n:normalise the similarities. This scales all the similarities by the self similarity of the reference, i.e. the similarity of the reference to itself. This is not necessarily 1.0!
-y:symmetrize the similarities. The official equation for the NCD involves compressing string X + string Y and string Y + string X separately (due to the vagaries of the various compression programs, this doesn’t necessarily give the same sized file). Work by Li, Vitanyi, and Cilibrasi only compresses string X + string Y to save time, as they found little difference in the similarities so produced. In our work, we found that there was a detectable difference for SMILES strings, so we used the ‘official’ version, which is used if you provide this option. This ensures that the similarity of a string X relative to string Y is the same as that of string Y to string X. It doesn’t make a huge difference, but it may give you peace of mind. If you do set the -y option, you should also set the -n option.
-s similarity coefficient:the similarity coefficient to use. The current options are:
gzip:if you have gzip installed on your computer.
tanimoto:if your string is binary, see below
euclidean:if your string is real-valued, see below.

The tanimoto, cosine and hamming coefficients follow the standard definition for dichotomous variables. The binary string should just be a string of zeros and ones with no separator, e.g.:

			000100    methane
			001100    ethane
			101101    benzene

For the Euclidean distance, the actual value of the distance is the squared distance, to avoid an unnecessary square root computation. The Euclidean similarity is defined as 1 / ( 1 + distance), so that the maximum similarity is 1.0. The Euclidean distance can be applied to an integer or real-valued string, which require a comma separator between each variable, e.g.:

			0.0,0.0,0.0,1.0 methane
			0.0,0.0,2.0,1.0 ethane
			6.0,0.0,1.0,1.0 benzene

The bzip2 and gzip distances can be applied to any string-based representation. That includes the binary and real-string representations above, but that’s probably not a very good idea. Ideally, the string based representation should be alphabetic, and not rely on numbers, as general purpose compressors will consider 1 and 2 to be as different as 1 and 9.

-w padding:pads the string a number of times by concatenating the string to itself. There’s a certain overhead associated with a compressed file, such as storing the dictionary (for dictionary-based compressors). If your string is very small then compressing it result in a larger, not smaller file. As a result, the size of the files are dominated by the book-keeping of the compression algorithm, not the relative information content of the strings. Given that SMILES strings are normally on the order of tens of characters, it seemed a possibility that this could have a major effect on the results. Therefore, we decided to see if we could bulk up the file a bit by repeating the text a few times. In our experiments, doubling the length of the string had the optimum effect for gzip. With bzip2 the effect was erratic. When padding, an extra line break ’\n’ is added to mark the end of the molecule, so that, for example, the SMILES string for methane, ‘C’, becomes ‘C\nC’, not just ‘CC’, which would be indistinguishable from ethane. I suggest using padding with a value of 2.

The other options are safe to ignore, although you’re welcome to fiddle about with them. They represent stuff we tried and which didn’t pan out very well.

To summarise the above, and assuming your references are in references.smi and your queries are in queries.smi, I recommend the following settings:

			simsearch.rb -s gzip -w 2 -y -n references.smi queries.smi


			simsearch.rb -s gzip -w 2 references.smi queries.smi

Both use gzip for the compressor and repeat the string an extra time. The first ensures that the similarities you get behave like a metric, but the latter is faster, and in our tests, the improvement on using the symmetrised value is not enormous.

HTS simulation

Having introduced the simsearch script for ‘real’ similarity searching, what about when you have a set of known actives and known actives and want to run a simulation to see how well different searches perform?

Some new terminology is needed first. The normal way to go about this is to split your new actives into two sets. The first set are the references, just like in a ‘real’ similarity search. However, for the purposes of the simulation we then pretend that we don’t know that the other actives are active and that the inactives are really inactive. We call these actives that we’re developed selective amnesia about, baits. We add the baits to the inactives, and the combined set of baits and inactives is now the query set.

Now that we have a set of references and queries, we can proceed as before. You could add the baits to the inactives manually, but that’s not only a bit tedious and complicated (especially if you repeat with a different split of references and bait several times), you also want to go back to remembering which queries were active and which ones were inactive after the similarity search is over, when it’s time to assess how well you’ve done. simsearch assumes that the second file you give it consists of molecules which are all unknown (not just ones we’re temporarily pretending are unknown), so that’s no good.

Instead, you can provide simsearch with three files: the references, the baits, and the inactives. Here’s an example:

			simsearch.rb -s gzip -w 2 -y -n references.smi baits.smi inactives.smi

Now simsearch will merge the baits and inactives into a query set for you, but it will also keep track of which ones were from the inactives and which were actives. This is reflected in the output of simsearch.

simsearch output

The output of a simsearch run looks like:

			<reference name>    <query name>    <similarity>    <query category>

one per line. Output is sorted by reference name, then similarity. So if you had more than one reference in your references file, you will get the sorted list of similarities for the first reference (by name), followed by those of the second reference. For each reference, the similarities are sorted in non-increasing order, so the queries at the top are the most similar.

The <query category> is (currently) one of three things: active, inactive or query. If you gave simsearch two files, all the categories are considered to be queries. If you gave it three files, all the molecules that came from the first file are tagged as active, and those from the second, as inactive, so you will be able to tell how well the search has done in finding actives. An example is:

			a1a_inh38       a1a_inh42          0.64829357   active

Here, a1a_inh38 was the reference, a1a_inh42 was the query, their similarity was approximately 0.65, and a1a_inh42 came from the actives file. Note that the names of each molecule are given, but not the strings used in the similarity calculation (these can get quite long for unfolded fingerprints, for example).

If you’re using a set of genuine queries, this is pretty much as far as you can go without actually measuring some experimental activities, but if you’re carrying out a simulation, there are some other statistics you can run to help evaluate the similarity search.


searchstats.rb is easy to run:

			searchstats.rb <simsearch output file>

It outputs three statistics: the AUC, the enrichment factor at 1% and the enrichment factor at 5%.

For the enrichment factors, the 1% and 5% values may only be approximate, because it’s possible that at the cutoff, there are several molecules with the same similarity. All these molecules will be used in the calculation, so slightly more than 1% or 5% may be returned. Only in the most disastrous of circumstances should this have a major effect, but this is something to bear in mind when comparing different results.

If you have more than one reference in the output file, the statistics are calculated for the results for each reference and then printed out sequentially.

You can’t currently add any extra statistics to the output without going into the source code and tinkering. I don’t currently recommend doing this.

makeconsensus.rb - Consensus scoring

It is common to attempt data fusion when you have more than one active (or different coefficients). For a given query with multiple results for different actives and coefficients, you can combine all the similarities into one value. The makeconsensus.rb script can help with this. Its usage is:

			makeconsensus.rb <options> file1 file2 file3...

You can fuse as many simsearch outputs as you want, but they must all have the same queries. Each output file can contain search results from one or more references.

The options available are:

-f method:the fusion method. This can be max or mean. For a given set of scores, either the mean value or their maximum will be used as the consensus score. By default this uses max.
-s method:the value to use as a score. This can be score or rank. If you choose score, then the similarity value is used as the score. If you choose rank, the rank of the molecule in the search is used as the score, with the most similar molecule being 1, the next most 2, and so on. If there are ties, the rank is the average of the ranks that would have been assigned if there weren’t any ties. e.g. if the top three molecules all have the same similarity, then they get the average rank of (1 + 2 + 3) / 3 = 2.0. The default is score.
-r:use this option to range scale the minimum and maximum similarities found for a given reference between 0.0 and 1.0. This is often recommended, but in our studies it generally didn’t help.

The following example assumes you have similarity search results in file called +search-results.out+, and want to output the consensus score to a file called +search-results.cons.out+:

			makeconsensus.rb -f max -s score search-results.out > search-results.cons.out

The following command does the same (because -f defaults to max and -s defaults to score), but range scales the inputs with the -r flag:

			makeconsensus.rb -r search-results.out > search-results.cons.out

The output of the makeconsensus is very similar to the output of simsearch, except that the names of the references is now the names of all references used in the consensus, concatenated together and separated with a colon, e.g.

			a1a_inh04:a1a_inh22:a1a_inh44    a1a_inh21    0.22544084    query

indicates that the consensus similarity for a1a_inh21 is 0.23, the result of a consensus of a1a_inh04, a1a_inh22 and a1a_inh44.

The output of the consensus can be passed to searchstats, just like the output of simsearch.

Other useful (maybe) scripts

In our simulations, we repeated each simulation ten times, each time using a different five references from our active file. makeref.rb can do this for you. Again, assume that you have all your actives in actives.smi:

			makeref.rb -n 5 -s 10 actives.smi references.smi baits.smi

the -n options indicates the number of references required. The -s switch sets the random number seed (useful if you want to partition different representations of the same set of molecules the same way).

The first file is the input filename. The last two indicate the name of the output file for the references and the baits, respectively.

After carrying out the simulations, you will probably want to know the average performance. collatestats.rb will do this for you. It can read in any number of output files from searchstats.rb and calculate the average and standard deviation for each statistic.

If your statistics from the different splits are in search.stat.1, search.stat.2 etc. you can find out their averages with:

			collatestats.rb *.stat

There are some other scripts in the bin/ directory, but they’re probably not very useful.