Precompute command

Precomputing data means parsing a dataset and computing the signatures descriptors and generating a sparse data set. This is done to speed up future predictions, so that CPSign will not have to compute signatures for every new prediction. If you are running ACP or CCP, it is likely that you are better off by just running the trained command directly to train your models (computation of signatures descriptors are done within trained already).


The full usage menu can be retrieved by running command:

> java -jar cpsign-[version].jar precompute -h

   precompute [options]
   precompute @/tmp/runconfigs/parameters.txt [options]
   precompute @C:\Users\User\runconfigs\parameters.txt [options]

  The precompute command performs signature generation of SMILES and SDF files in TCP problems,
  producing a sparse data file and a signatures file. TCP, in contrast to ACP/CCP, will train new
  models in each prediction thus not making it possible to train models before hand. By precomputing
  the records and signatures before hand, the precomputed data can be used as input to predict, thus
  skipping the signature generation otherwise needed.

  Input options:
      -t, --trainfile  [URI] or [path]
         Training file in SDF or SMILES format
      -m, --modelfile  [URI] or [path]
         Model file with precomputed data
      -pt, --proper-trainfile  [URI] or [path]
         Training file for molecules that exclusively should be used for training the scoring
         algorithm. In SMILES, SDF or JSON format
      -ct, --calibration-trainfile  [URI] or [path]
         Training file for molecules that exclusively should be used for calibrating the predictions.
         In SMILES, SDF or JSON format
      -rn, --response-name  [text]
         (SDFile) Name of response value to model, should match a property in the train file
         (SMILES file) Name of the column to model, should match header of that column
      -l, --labels  [label1 label2] or [label1,label2]
         Label(s) for response values in classification mode. If a label is a negative numerical
         number, the minus sign must be escaped so that the command parser does not think it's a new
         option flag. E.g.: --labels [-1,1] (no blank-space permitted!) or --labels "\-1" 1
      -c, --cptype  [integer]
         Type: 1) classification, 2) regression
         Default: 1

   Signature generation options:
      -hs, --height-start  [integer]
         Signatures start height
         Default: 1
      -he, --height-end  [integer]
         Signatures end height
         Default: 3
      -sg, --signatures-generator  [text]
         Type of signatures that should be used, note that stereo-signatures take much longer time to
         compute. Options:
          normal (default)
          stereo (experimental mode)
         Default: default

   Output options:
    * -mo, --model-out  [path]
         Model file to generate. Either give a fully specified file including a valid file suffix
         (.cpsign, .osgi, .jar) or a directory where the model should be generated (cpsign will create
         a unique file name for you)
    * -mn, --model-name  [text]
         Model name for the OSGi plugin
      -mc, --model-category  [text]
         The category of the model, will end up as model-endpoint in the OSGi
      -mv, --model-version  [text]
         Optional model version in SemVer versioning format
         Default: 1.0.0_2017-10-09_15:15:57.861

   Encryption options:
      --encrypt  [path]
         Path to the license file that the model should be encrypted by (can be the same as passed to
         If two-factor encryption is used and key has a non-default PIN

   General options:
    * --license  [path]
         Path to license file
      --logfile  [path]
         Path to a user set logfile, will be specific for this run
         Silent mode (only print output to logfile)
         Default: false
         Echo the input arguments given to CPSign
         Default: false
      -h, --help
         Get help for this command
         Default: false
         Print wall-time for all individual steps in execution
         Default: false

"Exclusive" datasets

In the 0.6.0 version there is the addition of the flags --proper-trainfile and --calibration-trainfile. These two flags makes it possible to use data exclusively for either proper training or for calibration. This could theoretically make it possible to use data from old assays for training, even though not wanting to use it for calibration. These flags will parse the data in the same way as the normal training file.

Note that these parameters are not valid in TCP, as all data is always used and there is no division into proper training and calibration sets.

Example usage

> java -jar cpsign-[version].jar precompute \
   --license /path/to/Standard-license.license \
   -t /path/to/datafile.sdf \
   -l nonmutagen mutagen \
   -rn "Ames test categorisation" \
   --model-out /tmp/ \
   --model-name Ames_precomputed

Running with Standard license: License registered to: [Name] [Company]. Expiry date is: [Date]

Reading precompute file and performing signature generation..
Parsed: 10 molecules from SDFile. Detected labels: 'nonmutagen'=4, 'mutagen'=6

Saving model to file..
Packaged model file: /tmp/Ames_precomputed.cpsign