Online-predict

The online-predict program performs predictions on individual molecules and/or files of molecules, just as the predict program. The difference is that no models need to be trained before hand, but instead the predictions are done in an online fashion, training models on the fly.

Parameters

The full usage menu can be retrieved by running command:

> java -jar cpsign-[version].jar online-predict -h

                                      online-predict
SYNOPSIS
------------------------------------------------------------------------------------------
  online-predict [options]
  online-predict @/tmp/runconfigs/parameters.txt [options]
  online-predict @C:\Users\User\runconfigs\parameters.txt [options]


DESCRIPTION
------------------------------------------------------------------------------------------
  Train and predict new examples on the fly. Will not save the model. Currently only
  available for TCP.


OPTIONS
------------------------------------------------------------------------------------------
  Input:
    -mi | --model-in                         [URI | path]
       Precomputed CPSign classification model
    -td | --train-data                       [URI | path]
       File with molecules in SMILES, SDF or JSON format (used for deriving the predictive
       model)
    -sm | --smiles                           [SMILES]
       SMILES string to predict, can optionally include a blank space and a molecule
       name/identfier
    -p  | --predict-file                     [URI | path]
       File to predict. Accepted formats are SMILES, SDF or JSON
    -e  | --endpoint                         [text]
       Endpoint property that should be used for modeling (the endoint of the model)
    -l  | --labels                           [label label] | [label,label]
       Label(s) for response values in classification mode. If a label is a negative
       numerical number, the minus sign must be escaped so that the command parser does
       not think it's a new option flag. E.g.: --labels [-1,1] (no blank-space permitted!)
       or --labels "\-1" 1

  Modeling:
    -i  | --impl                             [id | text]
       Scoring algorithm (i.e. underlying machine learning implementation):
         (1) liblinear
         (2) libsvm
       Default: 1
    --cost                                   [number]
       User defined Cost value in SVM training
       Default: 50.0
    --gamma                                  [number]
       User defined Gamma value in SVM training (only used in libsvm)
       Default: 0.002
    --epsilon                                [number]
       User defined tolerance of termination criterion
       Default: 0.001
    --epsilon-svr                            [number]
       User defined epsilon in loss function of epsilon-SVR
       Default: 0.1
    --seed                                   [integer]
       Set this flag if an explicit seed should be used in randomization of training data,
       default is using a random seed
    --percentiles                            [integer]
       The maximum number of molecules used for calculating percentiles. This will only be
       used in case image-generation should performed.
       Default: 1000

  Signature generation:
    -hs | --height-start                     [integer]
       Signatures start height
       Default: 1
    -he | --height-end                       [integer]
       Signatures end height
       Default: 3
    -sg | --signatures-generator             [id | text]
       Type of signatures that should be used, note that stereo-signatures take much
       longer time to compute. Stereo signatures also requires input data to have stereo
       information explicitly given in the file. Options:
         (1) default | normal
         (2) stereo (experimental mode)
       Default: 1

  Filtration:
    --duplicates                             [id | text]
       Resolve/remove potential duplicates which can make it difficult for the SVM to find
       a good decision plane. Replace duplicates by a single record with a new label or
       remove all conflicting records. Regression options:
         (1) median
         (2) mean
         (3) min
         (4) max
         (5) remove:[maximum allowed difference]
       Classification options:
         (5) remove
         (6) vote
         (7) keep:[label]
    --filters                                [id | text]
       Filters to apply on the records, currently only filters records based on the
       endpoint value for regression. Options:
         (1) min:[min]
         (2) max:[max]
         (3) range:[min]:[max]

  Prediction:
    -co | --confidences           [confidence confidence .. ] | [confidence,confidence,..]
       Confidences for predictions (e.g. '0.5,0.7,0.9' or '0.5 0.7 0.9'). Should be in the
       range [0,1]
    -cg | --calculate-gradient
       Calculate the Significant Signature of molecules

  Output:
    -of | --output-format                    [id | text]
       Output format of predictions, options:
         (1) json
         (2) smiles | plain
         (3) sdf | sdf-v2000
         (4) sdf-v3000
       Default: 1
    -o  | --output                           [path]
       File to write output to (default is printing to screen)
    --output-inchi
       Generate InChI and InChIKey in the output
    --compress
       If the outputfile should be compressed (only possible when writing to file)

  Encryption:
    --two-factor-pin
       If two-factor encryption is used and key has a non-default PIN

  Gradient image output:
    -gi | --gradient-images
       Create a Gradient image for each predicted molecule.
    -if | --image-file                       [path]
       Path to where generated images should be saved, can either be a path to a specific
       folder or a full path including a file name (only .png file ending supported).
       Every image will be named '[name]-[count].png' or '[name]-[$cdk:title].png' where
       name is either a default name or the specified name to this parameter (e.g. '.' -
       current folder using default file name, '/tmp/imgs/DefaultImageName.png' - use
       /tmp/imgs/ as directory and use 'DefaultImageName' as file name)
       Default: imgs/GradientDepiction.png
    -cs | --color-scheme                     [text]
       The specified color-scheme (case in-sensitive), options:
         (1) blue:red
         (2) red:blue
         (3) red:blue:red
         (4) cyan:magenta
         (5) rainbow
             custom - contact GenettaSoft for custom requirements!
       Default: 1
    --color-legend
       Add a color legend at the bottom of the image
    --atom-numbers
       Depict atom numbers
    --atom-number-color                      [color name] | [hex color]
       Color of the atom numbers
       Default: BLUE
    -ih | --image-height                     [text]
       The height of the generated images (in pixels)
       Default: 400
    -iw | --image-width                      [integer]
       The width of the generated images (in pixels)
       Default: 400

  Significant Signature image output:
    -si | --signature-images
       Create a Significant Signature image for each predicted molecule
    -sf | --signature-image-file             [path]
       Path to where generated images should be saved, can either be a path to a specific
       folder or a full path including a file name (only .png file ending supported).
       Every image will be named '[name]-[count].png' or '[name]-[$cdk:title].png' where
       name is either a default name or the specified name to this parameter (e.g. '.' -
       current folder using default file name, '/tmp/imgs/DefaultImageName.png' - use
       /tmp/imgs/ as directory and use 'DefaultImageName' as file name)
       Default: imgs/SigificantSignatureDepiction.png
    -hc | --highlight-color                  [color name] | [hex color]
       The color that should be used for the highlighting of the significant signature
       Default: BLUE
    --signature-color-legend
       Add a color legend at the bottom of the image
    --signature-atom-numbers
       Depict atom numbers
    --signature-atom-number-color            [color name] | [hex color]
       Color of the atom numbers
       Default: BLUE
    -sh | --signature-image-height           [text]
       The height of the generated images (in pixels)
       Default: 400
    -sw | --signature-image-width            [integer]
       The width of the generated images (in pixels)
       Default: 400

  General:
  * --license                                [URI | path]
       Path or URI to license file
    -h  | --help | man
       Get help text
    --short
       Use shorter help text (used together with the --help argument)
    --logfile                                [path]
       Path to a user-set logfile, will be specific for this run
    --silent
       Silent mode (only print output to logfile)
    --echo
       Echo the input arguments given to CPSign
    --progress-bar
       Add a Progress bar in the system error output
    --progress-bar-ascii
       Add a Progress bar in ASCII in the system error output
    --time
       Print wall-time for all individual steps in execution

------------------------------------------------------------------------------------------

The list of parameters are even larger than the one for predict as there are more input options, options for signature generation and modeling. Once again we can retrieved parameters by section individually, using for instance:

> java -jar cpsign-[version].jar online-predict input -h

                                      online-predict
------------------------------------------------------------------------------------------
  Input:
    -mi | --model-in                         [URI | path]
       Precomputed CPSign classification model
    -td | --train-data                       [URI | path]
       File with molecules in SMILES, SDF or JSON format (used for deriving the predictive
       model)
    -sm | --smiles                           [SMILES]
       SMILES string to predict, can optionally include a blank space and a molecule
       name/identfier
    -p  | --predict-file                     [URI | path]
       File to predict. Accepted formats are SMILES, SDF or JSON
    -e  | --endpoint                         [text]
       Endpoint property that should be used for modeling (the endoint of the model)
    -l  | --labels                           [label label] | [label,label]
       Label(s) for response values in classification mode. If a label is a negative
       numerical number, the minus sign must be escaped so that the command parser does
       not think it's a new option flag. E.g.: --labels [-1,1] (no blank-space permitted!)
       or --labels "\-1" 1

Examples Usage

TCP classification with chemical input data:

> java -jar cpsign-[version].jar online-predict \
   --license /path/to/Standard-license.license \
   --smiles O=Cc1ccc(O)c(OC)c1 \
   --endpoint "Ames test categorisation" \
   --labels mutagen,nonmutagen \
   --time \
   --percentiles 0 \
   --train-data data/ames_small.sdf.gz

Running with Standard License registered to [Name] at [Company]. Expiry
date is [Date]

Reading train file and performing signature generation..
Successfully parsed 123 molecules. Detected labels: 'mutagen'=64, 'nonmutagen'=59.
Generated 1930 new signatures.
(1 s)

Training TCP predictor..
Finished
(0 s)

Starting to do predictions..
{
     "prediction": {
             "pValues": {
                     "nonmutagen": 0.204,
                     "mutagen": 0.0
             }
     },
     "molecule": {
             "SMILES": "O=Cc1ccc(O)c(OC)c1"
     }
}
Successfully predicted 1 molecule
(0 s)

Parameters are fairly consistent with a mix of parameters for train and predict, apart for missing arguments for choosing predictor type as only TCP is available.