Predict command

The predict command performs predictions either on individual molecules or files of molecules. The predictions are made using already trained ACP or CCP models. In TCP the predictions are made either on precomputed data or molecule files that will be converted into sparse data on the fly.


The full usage menu can be retrieved by running command:

> java -jar cpsign-[version].jar predict -h

   predict [options]
   predict @/tmp/runconfigs/parameters.txt [options]
   predict @C:\Users\User\runconfigs\parameters.txt [options]

   Predict new examples given trained models (APC/CCP), precomputed data (TCP) or training data in
   SDF/SMILES format (TCP).

   Input options:
      -m, --modelfile  [URI] or [path]
         (ACP/CCP) Existing model files (libsvm/liblinear format)
         (TCP) Existing precomputed data
      -t, --trainfile  [URI] or [path]
         (TCP) Training file in SDF or SMILES format
      -rn, --response-name  [text]
         Only required when running TCP classification without precomputed data.
         (TCP - SDFile) Name of response value to model, should match a property in the train file
         (TCP - SMILES file) Name of the column to model, should match header of that column
      -l, --labels  [label1 label2] or [label1,label2]
         Label(s) for response values in classification mode. If a label is a negative numerical
         number, the minus sign must be escaped so that the command parser does not think it's a new
         option flag. E.g.: --labels [-1,1] (no blank-space permitted!) or --labels "\-1" 1

   Modeling options:
    * -c, --cptype  [integer]
         Model type: 1) ACP classification, 2) ACP regression, 3) CCP classification, 4) CCP
         regression, 5) TCP classification

   (TCP only) modeling options:
      --cost  [number]
         User defined Cost value in SVM training
      --gamma  [number]
         User defined Gamma value in SVM training (only used in libsvm)
      --epsilon  [number]
         User defined Epsilon value in SVM training
      --epsilon-svr  [number]
         User defined epsilon in loss function of epsilon-SVR
         Default: 0.1

   (TCP only) Signature generation options:
      -hs, --height-start  [integer]
         Signatures start height
         Default: 1
      -he, --height-end  [integer]
         Signatures end height
         Default: 3
      -sg, --signatures-generator  [text]
         Type of signatures that should be used, note that stereo-signatures take much longer time to
         compute. Options:
          normal (default)
          stereo (experimental mode)
         Default: default

   (TCP only) Computing percentiles options:
      --percentiles  [integer]
         This option is solely for when running predict in tcp mode using chemical data (i.e. not run
         precompute or train before) and predicting gradients for molecules (either for images or the
         gradients themselves). This is a extremely time consuming step in predicting. Percentiles are
         only used for image rendering and calculating gradients for predictions, save time and set
         this flag to 0 if neither of this will be used.
         Default: 0

   Prediction options:
      -sm, --smiles  [SMILES]
         SMILES string to predict, can optionally include a blank space and a molecule name/identfier
      -p, --predictfile  [URI] or [path]
         File to predict. Accepted formats are SMILES-file (one SMILES per line, optionally including
         tab-delimited data), SDFfile
      -co, --confidences  [confidence1 confidence2 .. ] or [confidence1, confidence2, .. ]
         Confidences for predictions (e.g. '0.5,0.7,0.9' or '0.5 0.7 0.9'). Should be in the range
      -di, --distances  [distance1 distance2 .. ] or [distance1, distance2, .. ]
         (ACP/CCP regression only) Distances from to predicted midpoint (e.g. '0.5,2,5' or '0.5 2 5')
      -cg, --calculate-gradient
         Calculate the Significant Signature of molecules
         Default: false

   Output options:
      -of, --output-format  [text value]
         output format of predictions, options:
         Default: json
         Generate InChI and InChIKey in the output
         Default: false
      -o, --output  [path]
         File to write prediction output to (default is printing to screen)
         If the outputfile should be compressed
         Default: false

   Encrypt/compress options:
         If two-factor encryption is used and key has a non-default PIN

   Image output options:
      -im, --images
         Create images for each predicted molecule. Image creation is turned on by passing this flag
         Default: false
      -if, --imagefile  [path]
         Path to where generate images should be saved, can either be a path to a specific folder or a
         full path include a file name (only .png file ending supported). Every image will be named
         '[name]-[count].png' or '[name]-[$cdk:title].png' where name is either a default name or the
         specified name to this flag (e.g. '.' - current folder using default file name, '/tmp/imgs/D-
         efaultImageName.png' - use /tmp/imgs/ as directory and use 'DefaultImageName' as file name)
         Default: imgs/Image.png
         Colors only the atoms belonging to the significant signature of the prediction.
         (Classication) If the p-value for first class is highest, the atoms part of the significant
         signature will be colored as the lowest color of the set gradient, or vise versa (if p-value
         for second class is highest, atoms part of signifiant signature will be colored after the
         highest color of the set gradient). Note that color-scheme 'rainbow' is not recommended when
         using this option, as both classes will be colored red.
         (Regression) The atoms part of the significant signature will be colored to the highest color
         of the set gradient.
         Default: false
      -cs, --color-scheme  [text]
         The specified color-scheme (case in-sensitive), options:
           blue:red (default)
         Default: default
         Add a color legend at the bottom of the image
         Default: false
         Use a custom color scheme
         Depict atom numbers
         Default: false
      --atom-number-color  [color name] or [hex color]
         Color of the atom numbers
         Default: BLUE
      -ih, --image-height  [integer]
         The height of the generated images (in pixels)
         Default: 400
      -iw, --image-width  [integer]
         The width of the generated images (in pixels)
         Default: 400

   General options:
    * --license  [path]
         Path to license file
      --logfile  [path]
         Path to a user set logfile, will be specific for this run
         Silent mode (only print output to logfile)
         Default: false
         Echo the input arguments given to CPSign
         Default: false
      -h, --help
         Get help for this command
         Default: false
         Print wall-time for all individual steps in execution
         Default: false

The full list of prediction parameters are a bit overwhelming, so each main part is described in its own sub-section below. For Image options, please refer to the Image rendering page.

Input Options ACP/CCP

The input for ACP and CCP predictions is already trained models that should have been generated using the train command. Flags that are used in ACP/CCP are the following:

Flag: -m, --modelfile
Description: The model generated by the train command.

Input Options TCP

Flag: -m, --modelfile
Description: If precomputed data has been generated using the precompute command, give the model generated there.

Flag: -t, --trainfile
Description: If no precomputed data exists, pass the SMILES/SDF file that should be used for training to this flag. CPSign will perform signature generation before predicting.

Flag: -rn, --response-name
Description: Used togheter with the --trainfile flag, either specifiying which property should be modeled in a SDF file, or deciding which column in a SMILES file should be modeled. If the trainfile is in SMILES format and the first column after the SMILES column is the property that should be modeled, this flag does not need to be passed as the second column is the default column to model on.

Flag: -l, --labels
Description: Used when using the --trainfile as the labels needs to be specified. If using the --modelfile flag, labels are saved within the model bundle and does not need to be passed again.

Prediction Options

Flag: -cg, --calculate-gradient
Description: It is possible to calculate gradient and the significant signature of a molecule by passing this flag. It is rather computationally heavy and is therefor not computed by default.

Flag: -co, --confidences
Description: A list of confidences that should be used for prediction.

Flag: -di, --distance
Description: Given a distance d (given by the input to --distance), predict the confidence of the real value (y) being within an interval +/-d from the estimated value (ŷ) from the model (i.e. for y laying within [ ŷ-d, ŷ+d ]).

Flag: -p, --predictfile
Description: A SMILES or SDF file with molecules to make predictions on.

Flag: -sm, --smiles
Description: A single SMILES to predict on, can also optionally include a blank space after the SMILES string and include an identifier.

Output Options

Flag: -of, --output-format
Description: The output type of the prediction. Can be json, smiles/plain or sdf (in mdl v2000 eller v3000 format), where json is the default option.

Flag: -o, --output
Description: If the prediction output should be printed to a file.

Flag: --output-inchi
Description: Generate InChI and InChIKey within CPSign and add it to the output

Flag: --compress
Description: If the output file should gziped. Note that compression can only be performed to a file or using the --silent flag so that only the prediction result is printed, then the prediction can be piped to an output file.

Examples Usage

Example (TCP classification with precomputed model):

> java -jar cpsign-[version].jar predict \
   --license /path/to/Standard-license.license \
   -c 5 \
   --smiles CC1=CC(=CC=C1C(=O)C2=CC=C(C=C2)C#N)N3N=CC(=O)NC3=O \
   --modelfile models/Ames_precomputed.cpsign \
   -co 0.7,0.9 \

Running with Standard license: License registered to: [Name] [Company] . Expiry date is: [Date]

Loading precomputed model..
Loaded model with 10 records

Starting to do predictions..
                             "nonmutagen": 0.448,
                             "mutagen": 1.0
                                     "confidence": 0.7,
                                     "confidence": 0.9,
                     "InChI": "InChI=1S\/C18H12N4O3\/c1-11-8-14(22-18(25)21-16(23)10-20-22)6-7-15(11)17(24)13-4-2-12(9-19)3-5-13\/h2-8,10H,1H3,(H,21,23,25)",
                     "SMILES": "CC1=CC(=CC=C1C(=O)C2=CC=C(C=C2)C#N)N3N=CC(=O)NC3=O",
                     "InChIKey": "ZJYUMURGSZQFMH-UHFFFAOYSA-N"
Successfully predicted 1 molecule

Example (TCP classification without precomputed model)

> java -jar cpsign-[version].jar predict \
   --license /path/to/Standard-license.license \
   -c 5 \
   --smiles CC1=CC(=CC=C1C(=O)C2=CC=C(C=C2)C#N)N3N=CC(=O)NC3=O \
   --trainfile data/Ames_mini.sdf \
   --labels nonmutagen mutagen \
   -rn Ames test categorisation \
   -co 0.7,0.9 \
   -of smiles

Running with Standard license: License registered to: [Name] [Company] . Expiry date is: [Date]

Reading train file and performing signature generation..
Parsed: 123 molecules from SDFile. Detected labels: 'mutagen'=64, 'nonmutagen'=59. Generated 1930 signatures.

     Starting to do predictions..
SMILES       P-values        Predicted labels (confidence=0.9)       Predicted labels (confidence=0.7)
CC1=CC(=CC=C1C(=O)C2=CC=C(C=C2)C#N)N3N=CC(=O)NC3=O   {nonmutagen=0.172, mutagen=0.618}       [nonmutagen, mutagen]   [mutagen]
Successfully predicted 1 molecule

Example (ACP regression):

> java -jar cpsign-[version].jar predict \
   --license /path/to/Standard-license.license \
   --modelfile /tmp/datamodels/models.svm \
   -c 2 \
   -co 0.5, 0.7, 0.9 \
   --predictfile data/solubility_10.smi \
   -o output/prediction_out.sdf \
   --output-format sdf

Running with Standard license: License registered to: [Name] [Company] . Expiry date is: [Date]

Loading model from file..
 - Loaded model 1/5
 - Loaded model 2/5
 - Loaded model 3/5
 - Loaded model 4/5
 - Loaded model 5/5

Starting to do predictions..
Successfully predicted 10 molecules