Train command

Train is for building ACP and CCP models that later can be used for prediction.

Parameters

The full usage menu can be retrieved by running command:

> java -jar cpsign-[version].jar train -h

                                             train
SYNOPSIS
----------------------------------------------------------------------------------------------------
   train [options]
   train @/tmp/runconfigs/parameters.txt [options]
   train @C:\Users\User\runconfigs\parameters.txt [options]


DESCRIPTION
----------------------------------------------------------------------------------------------------
   Train an Aggragated Conformal Predictor (ACP) or Cross Conformal Predictor (CCP). The trained
   models can later be used in predictions. It is also possible to run train for a Transductive
   Conformal Predictor (TCP), but that only means precomputing the training data and setting SVM
   options so that they don't need to be supplied when predicting new examples at a later stage.


OPTIONS
----------------------------------------------------------------------------------------------------
   Input options:
      -t, --trainfile  [URI] or [path]
         Training file in SDF or SMILES format
      -m, --modelfile  [URI] or [path]
         Model file with precomputed data
      -pt, --proper-trainfile  [URI] or [path]
         Training file for molecules that exclusively should be used for training the scoring
         algorithm. In SMILES, SDF or JSON format
      -ct, --calibration-trainfile  [URI] or [path]
         Training file for molecules that exclusively should be used for calibrating the predictions.
         In SMILES, SDF or JSON format
      -rn, --response-name  [text]
         (SDFile) Name of response value to model, should match a property in the train file
         (SMILES file) Name of the column to model, should match header of that column
      -l, --labels  [label1 label2] or [label1, label2]
         Label(s) for response values in classification mode. If a label is a negative numerical
         number, the minus sign must be escaped so that the command parser does not think it's a new
         option flag. E.g.: --labels [-1,1] (no blank-space permitted!) or --labels "\-1" 1

   Training options:
      -nr, --nr-models  [integer]
         Number of ACP models or CCP folds (min 1 for ACP and min 2 for CCP)
         Default: 1
      -cr, --calibration-ratio  [decimal number]
         (ACP) Part of training set used as calibration set, range (0,1)
         Default: 0.2
      --stratified
         (ACP/CCP classification) Stratified splitting of calibration- and proper training set
         Default: false
      --percentiles  [integer]
         (ACP/CCP) The maximum number of molecules used for calculating percentiles. This is a very
         time consuming step in the training. Percentiles are only used for image rendering and
         calculating gradients for predictions, save time and set this flag to 0 if neither of this is
         used.
         Default: 1000
      --seed  [integer]
         Set this flag if an explicit seed should be used in randomization of training data, default
         is using a random seed

   Modeling options:
    * -c, --cptype  [integer]
         Model type: 1) ACP classification, 2) ACP regression, 3) CCP classification, 4) CCP
         regression, 5) TCP classification
      -i, --impl  [text value]
         Options: liblinear or libsvm
         Default: liblinear
      --cost  [number]
         User defined Cost value in SVM training
      --gamma  [number]
         User defined Gamma value in SVM training (only used in libsvm)
      --epsilon  [number]
         User defined Epsilon value in SVM training
      --epsilon-svr  [number]
         User defined epsilon in loss function of epsilon-SVR
         Default: 0.1
      --nonconf-measure  [text value]
         Nonconformity score that should be used, see documentation for clarifications
         (Regression) Options: abs-diff, normalized or log-normalized
         Default: default
      --nonconf-beta  [decimal number]
         If log-normalized nonconformity score is chosen, optionally set a beta value (>= 0)
         Default: 0.0

   Signature generation options:
      -hs, --height-start  [integer]
         Signatures start height
         Default: 1
      -he, --height-end  [integer]
         Signatures end height
         Default: 3
      -sg, --signatures-generator  [text]
         Type of signatures that should be used, note that stereo-signatures take much longer time to
         compute. Options:
          normal (default)
          stereo (experimental mode)
         Default: default

   Output options:
    * -mo, --model-out  [path]
         Model file to generate. Either give a fully specified file including a valid file suffix
         (.cpsign, .osgi, .jar) or a directory where the model should be generated (cpsign will create
         a unique file name for you)
    * -mn, --model-name  [text]
         Model name for the OSGi plugin
      -mc, --model-category  [text]
         The category of the model, will end up as model-endpoint in the OSGi
      -mv, --model-version  [text]
         Optional model version in SemVer versioning format
         Default: 1.0.0_2017-10-09_15:15:57.861

   Encryption options:
      --encrypt  [path]
         Path to the license file that the model should be encrypted by (can be the same as passed to
         --license)
      --two-factor-pin
         If two-factor encryption is used and key has a non-default PIN

   General options:
    * --license  [path]
         Path to license file
      --logfile  [path]
         Path to a user set logfile, will be specific for this run
      --silent
         Silent mode (only print output to logfile)
         Default: false
      --echo
         Echo the input arguments given to CPSign
         Default: false
      -h, --help
         Get help for this command
         Default: false
      --time
         Print wall-time for all individual steps in execution
         Default: false

Example Usage (ACP regression)

> java -jar cpsign-[version].jar train \
   --license /path/to/Standard-license.license \
   -t /path/to/datafile.sdf \
   -rn "BIO" \
   --nr-models 5 \
   -i liblinear \
   --model-out /tmp/datamodels/Chang_BIO.cpsign \
   --model-name Chang_BIO \
   --compress \
   -c 2

Running with Standard license: License registered to: [Name] [Company] . Expiry date is: [Date]

Reading train file and performing signature generation..
Parsed: 34 molecules from SDFile.

Training Aggregated Conformal Predictor with 5 models
 - Trained model 1/5
 - Trained model 2/5
 - Trained model 3/5
 - Trained model 4/5
 - Trained model 5/5

Saving model to file..
Packaged model file: /tmp/datamodels/Chang_BIO.cpsign

"Exclusive" datasets

For parameters --proper-trainfile and --calibration-trainfile the same holds as for the precompute command, see Exclusive datasets.

Important performance note

The default behavior of CPSign is to compute percentiles (see Molecule Gradient) when training an ACP or CCP model. This is in most cases the most time consuming part of the training procedure as it requires to make a huge amount of predictions. This is only required when rendering images or calculating molecule gradients, so if you do not intend to use this with the trained models, you can gain a lot in runtime by setting the --percentiles flag to 0. Or you can at least lower the amount of molecules used for calculating the gradients to less than 1000 molecules which is the default. When using LibLinear this is not likely to have as big impact on runtime as the time for making predictions is so small.

Nonconformity Measures

See information about the available nonconformity measures in the section Nonconformity measures .