Precomputing data means parsing a dataset and computing the signatures descriptors and generating a numeric data set. This is done to speed up future predictions, so that CPSign will not have to compute signatures for every new prediction. If you are running ACP or VAP, it is likely that you are better off by just running the train command directly to train your models (computation of signatures descriptors are done within train program). If you on the other hand wish to distribute the training into several splits, it is likely that you gain in runtime and reduce computational resources by running this program before doing the split at the training step.


The full usage menu can be retrieved by running command:

> java -jar cpsign-[version].jar precompute -h

  precompute [options]
  precompute @/tmp/runconfigs/parameters.txt [options]
  precompute @C:\Users\User\runconfigs\parameters.txt [options]

  The precompute program computes signature descriptors for chemical files, producing a
  numerical data file and a signatures descriptor file. The precompted data can later be
  used as input to other programs like train or tcp-predict, greatly reducing overall
  runtime in case several programs are used. For instance if TCP is used and different
  predictor or modeling parameters are performed using the same data, the precomputed data
  can be re-used several times instead of being recomputed for every prediction

    -mt | --model-type                       [id | text]
       Modeling type:
         (1) classification
         (2) regression
       Default: classification
    -td | --train-data                       [URI | path]
       File with molecules in SMILES, SDF or JSON format
    -md | --model-data                       [URI | path]
       File with molecules that exclusively should be used for training the scoring
       algorithm. In SMILES, SDF or JSON format
    -cd | --calibration-data                 [URI | path]
       File with molecules that exclusively should be used for calibrating predictions. In
       SMILES, SDF or JSON format
    -e  | --endpoint                         [text]
       Endpoint property that should be used for modeling (the endoint of the model)
    -l  | --labels                           [label label] | [label,label]
       Label(s) for response values in classification mode. If a label is a negative
       numerical number, the minus sign must be escaped so that the command parser does
       not think it's a new option flag. E.g.: --labels [-1,1] (no blank-space permitted!)
       or --labels "\-1" 1

  Signature generation:
    -hs | --height-start                     [integer]
       Signatures start height
       Default: 1
    -he | --height-end                       [integer]
       Signatures end height
       Default: 3
    -sg | --signatures-generator             [id | text]
       Type of signatures that should be used, note that stereo-signatures take much
       longer time to compute. Options:
         (1) default/normal
         (2) stereo (experimental mode)
       Default: 1

    -mo | --model-out                        [path]
       Model file to generate (--model-out or --model-out-dir are required to pass)
    --model-out-dir                          [path]
       Specify a directory where the model should be saved, leave naming to cpsign
       (--model-out or --model-out-dir are required to pass). Specify '.' if model should
       be generated in the current directory.
  * -mn | --model-name                       [text]
       Model name for the OSGi plugin
    -mc | --model-category                   [text]
       The category of the model, will end up as model-endpoint in the OSGi
    -mv | --model-version                    [text]
       Optional model version in SemVer versioning format
       Default: 1.0.0_{date-time-string}

    --encrypt                                [URI | path]
       Path to the license file that the model should be encrypted by (can be the same as
       passed to --license)
       If two-factor encryption is used and key has a non-default PIN

  * --license                                [URI | path]
       Path or URI to license file
    -h  | --help | man
       Get help text
       Use shorter help text (used together with the --help argument)
    --logfile                                [path]
       Path to a user-set logfile, will be specific for this run
       Silent mode (only print output to logfile)
       Echo the input arguments given to CPSign
       Print wall-time for all individual steps in execution


"Exclusive" datasets

In the 0.6.0 version there is the addition of the flags --model-data and --calibration-data. These two flags makes it possible to use data exclusively for either proper training or for calibration. This could theoretically make it possible to use data from old assays for training, even though not wanting to use it for calibration. These flags will parse the data in the same way as the normal training file.

Note that these parameters are not valid in TCP, as all data is always used and there is no division into proper training and calibration sets.

Example usage

> java -jar cpsign-[version].jar precompute \
   --license /path/to/Standard-license.license \
   --train-data /path/to/datafile.sdf \
   --labels mutagen nonmutagen \
   --endpoint "Ames test categorisation" \
   --model-out /tmp/acp_classification.precomp \
   --model-name Ames_precomputed

Running with Standard License registered to [Name] at [Company]. Expiry
date is [Date]

Reading precompute file and performing signature generation..
Successfully parsed 123 molecules. Detected labels: 'mutagen'=64, 'nonmutagen'=59.
Generated 1930 new signatures.

Saving model to file..
Finished model saved at: