TMACC descriptor code ===================== Requirements: - Java 5.0 or later - tmacc.jar - nott-JOELib2-bin nott-JOELib2 is a modified version of the JOELib2 API. Unfortunately, you can't use any other versions of JOELib2 at the moment. The classpath must be set to match the location of: - the tmacc.jar file - joelib2.jar and itext-0.94.jar in the joelib /lib directory the program can then be run with java -cp ac.nott.tmacc.TmaccMaker the output type is determined by the file extension: .csv for csv, .R for R and .arff for Weka. Input requirements ------------------ - the dataset should be stored in one SDF file. - the response variable you want to use should be included with each molecule. e.g. after the "M END" line, but before the "$$$$" line, you should include something like: - each molecule should have a title on the first line of each entry (i.e. the first line after the $$$$ separator). > 5.00 for each molecule. Output requirements ------------------- - the program can output to one of: o ARFF format (Weka) o R format o CSV format o A special 'fgram' format for interpretation (see below) An example using the Bash shell (Linux and Mac OS X users) ---------------------------------------------------------- assuming that the JOELib2 lib directory lives at: /home/jamesm/nott-joe/lib then something like: for a in /home/jamesm/nott-joe/lib/*.jar do TMACCCP=$TMACCCP:/home/jamesm/nott-joe/lib/$a done will add all jar files in the lib directory. Finally, you need to add tmacc.jar: TMACCCP=$TMACCCP:/home/jamesm/tmacc/tmacc.jar Then, assuming you had a dataset called ace.sdf, with the responses called ACTIVITY, and you wanted to output the descriptors in CSV format, you would call: java -cp $TMACCCP ac.nott.tmacc.TmaccMaker ace.sdf ACTIVITY ace.csv Presumably other Linux users using different shells are savvy enough to work out how to change this for their own use. Windows XP users should edit or create a CLASSPATH environment variable (or make a new one) by: Start -> Control Panel -> System. Then click the Advanced tab, folllowed by the Environment Variables button. The fgram output ---------------- To see which atom pairs are contributing to which descriptor, provide an output filename ending with 'fgram'. For each molecule and pair of properties, there will be an entry for each distance, e.g. ScaledAtomPartialPositiveCharge:ScaledAtomPartialPositiveCharge:2 where the number at the end is the topological distance. The next number is the actual value of the largest interaction, in this case the largest positive partial charge-positive partial charge interaction separated by two bonds. The following two numbers are the indexes (starting from one) of the atoms involved in the interaction, in the same order as in the SDF file, except that any nonpolar hydrogens will have been removed, leading to renumbering. Where there was more than one maximum interaction in the molecule, all pairs were stored. These are separated by semi-colons. An example is: LogP:LogP:3 1.5; 1 2 1.5; 6 8 1.5; In this case, the interaction is the logP-logP interaction separated by 3 bonds. The largest recorded interaction for that distance was 1.5; two atom pairs in the molecule had that value - atoms 1 & 2, and 6 & 8. To help match atom indexes to atoms in the input file, the following utility will generate the SDF file with the non-polar hydrogens removed: java -cp ac.nott.tmacc.RemoveNonPolarH Source files ------------ The source files are also available. There is no documentation with this, use it at your own risk! Like JOELib2, all the source code is licenced under the GPL. Problems? --------- e-mail jonathan.hirst@nottingham.ac.uk