Crossvalidate

Crossvalidatation can be performed with ACP or CVAP, in both regression and classification. It will perform a k-fold crossvalidation using k number of folds.

Parameters

The full usage menu can be retrieved by running command:

> java -jar cpsign-[version].jar crossvalidate -h

                                      crossvalidate
SYNOPSIS
------------------------------------------------------------------------------------------
  crossvalidate [options]
  crossvalidate @/tmp/runconfigs/parameters.txt [options]
  crossvalidate @C:\Users\User\runconfigs\parameters.txt [options]


DESCRIPTION
------------------------------------------------------------------------------------------
  Performs a k-fold cross validation of the given dataset. This give an estimate on how
  good predictions will be given this dataset and these settings.


OPTIONS
------------------------------------------------------------------------------------------
  Input:
    -mi | --model-in                         [URI | path]
       Model file with precomputed data
    -td | --train-data                       [URI | path]
       File with molecules in SMILES, SDF or JSON format
    -md | --model-data                       [URI | path]
       File with molecules that exclusively should be used for training the scoring
       algorithm. In SMILES, SDF or JSON format
    -cd | --calibration-data                 [URI | path]
       File with molecules that exclusively should be used for calibrating predictions. In
       SMILES, SDF or JSON format
    -e  | --endpoint                         [text]
       Endpoint property that should be used for modeling (the endoint of the model)
    -l  | --labels                           [label label] | [label,label]
       Label(s) for response values in classification mode. If a label is a negative
       numerical number, the minus sign must be escaped so that the command parser does
       not think it's a new option flag. E.g.: --labels [-1,1] (no blank-space permitted!)
       or --labels "\-1" 1

  Predictor:
    -pt | --ptype | --predictor-type         [id | text]
       Predictor type:
         (1) ACP Classification
         (2) ACP Regression
         (5) VAP Classification
       Default: 1
    -ss | --sampling-strategy                [id | text]
       Strategy used for sampling data to aggregated models (non TCP):
         (1) random
         (2) random stratified (classification only)
         (3) folded
         (4) folded stratified (classification only)
       Default: 1
    -nr | --nr-models                        [integer]
       (ACP/VAP) Number of models that should be aggregated
       Default: 1
    -cr | --calibration-ratio                [number]
       (ACP/VAP) Part of training set used as calibration set, range (0,1)
       Default: 0.2
    --seed                                   [integer]
       Set this flag if an explicit seed should be used in randomization of training data,
       default is using a random seed
    --nonconf-measure                        [text]
       (Regression) Nonconformity score that should be used, see documentation for
       clarifications. Options:
         (1) normalized
         (2) log-normalized
         (3) abs-diff
       Default: 1
    --nonconf-beta                           [number]
       If log-normalized nonconformity score is chosen, optionally set a beta value (>= 0)
       Default: 0.0

  Modeling:
    -i  | --impl                             [id | text]
       Scoring algorithm (i.e. underlying machine learning implementation):
         (1) liblinear
         (2) libsvm
       Default: 1
    --cost                                   [number]
       User defined Cost value in SVM training
       Default: 50.0
    --gamma                                  [number]
       User defined Gamma value in SVM training (only used in libsvm)
       Default: 0.002
    --epsilon                                [number]
       User defined tolerance of termination criterion
       Default: 0.001
    --epsilon-svr                            [number]
       User defined epsilon in loss function of epsilon-SVR
       Default: 0.1

  Cross validation:
    -k  | --cv-folds                         [integer]
       Number of folds in cross validation (min 2, max #Training examples)
       Default: 10
    -cp | --calibration-points               [number number ..] | [number,number,..]
       Calibration points used in cross validation, equals confidences in Conformal
       Prediction and observed probabilities for Venn Prediction (each value: min 0, max
       1)
       Default: 0.8
    --calibration-points-width
       (CVAP only) the width around each calibration point that should be considered for
       each calibration point, default is to use 1/[number of calibration points]. Note
       that the parameter is taken as the total width, the intervals will be
       [midpoint-0.5*width, midpoint+0.5*width].

  Signature generation:
    -hs | --height-start                     [integer]
       Signatures start height
       Default: 1
    -he | --height-end                       [integer]
       Signatures end height
       Default: 3
    -sg | --signatures-generator             [id | text]
       Type of signatures that should be used, note that stereo-signatures take much
       longer time to compute. Options:
         (1) default/normal
         (2) stereo (experimental mode)
       Default: 1

  Output:
    -of | --output-format                    [text]
       Output format, options:
         (1) json
         (2) text/plain
         (3) CSV
         (4) TSV
       Default: 2
    -o  | --output                           [path]
       File to write cross validation results to (default is printing to screen)
    --roc
       Output the ROC curve (VAP only), the ROC curve has many points and lead to verbose
       output. Default is to only print the AUC score

  General:
  * --license                                [URI | path]
       Path or URI to license file
    -h  | --help | man
       Get help text
    --short
       Use shorter help text (used together with the --help argument)
    --logfile                                [path]
       Path to a user-set logfile, will be specific for this run
    --silent
       Silent mode (only print output to logfile)
    --echo
       Echo the input arguments given to CPSign
    --progress-bar
       Add a Progress bar in the system error output
    --progress-bar-ascii
       Add a Progress bar in ASCII in the system error output
    --time
       Print wall-time for all individual steps in execution

------------------------------------------------------------------------------------------

Example Usage

Example (ACP classification):

> java -jar cpsign-[version].jar crossvalidate \
   --license /path/to/Standard-license.license \
   -pt 1 \
   -td /path/to/datafile.sdf \
   -e "Ames test categorisation" \
   -l mutagen, nonmutagen \
   -k 5

Running with Standard License registered to [Name] at [Company]. Expiry
date is [Date]

Randomization seed used: 1531322226985

Reading train file and performing signature generation..
Successfully parsed 123 molecules. Detected labels: 'mutagen'=64, 'nonmutagen'=59.
Generated 1930 new signatures.

Starting the cross validation..
Finished

Cross validation finished with the following stats:
Classification Confidence: 0.951
Classification Credibility: 0.564
Observed Fuzziness: 0.146
Observed Fuzziness (mutagen): 0.105
Observed Fuzziness (nonmutagen): 0.19
Set confidence: 0.8
Accuracy: 0.789
Efficiency: 0.106

Example (ACP regression):

> java -jar cpsign-[version].jar crossvalidate \
   --license /path/to/Standard-license.license \
   -pt 2 \
   -td /path/to/datafile.sdf \
   -e BIO \
   --cv-folds 5


Running with Standard License registered to [Name] at [Company]. Expiry
date is [Date]

Randomization seed used: 1531322540354

Reading train file and performing signature generation..
Successfully parsed 34 molecules. Generated 286 new signatures.

Starting the cross validation..
Finished

Cross validation finished with the following stats:
RMSE: 7.593
Set confidence: 0.8
Accuracy: 0.941
Efficiency: 28.883

Example (AVAP classification):

> java -jar cpsign-[version.jar cv \
   --license /path/to/Standard-license.license \
   -pt 5 \
   -td /path/to/datafile.sdf \
   -e "Ames test categorisation" \
   -l mutagen, nonmutagen \
   -k 5


Running with Standard License registered to [Name] at [Company]. Expiry
date is [Date]

Randomization seed used: 1531323046186

Reading train file and performing signature generation..
Successfully parsed 123 molecules. Detected labels: 'mutagen'=64, 'nonmutagen'=59.
Generated 1930 new signatures.

Starting the cross validation..
Finished

Cross validation finished with the following stats:
Logloss: 0.497
AUC: 0.85
Median interval width: 0.09376
Mean interval width: 0.10487

Calibration curve:
Expected     Observed        Num examples
0.05 0.0     9.0
0.15 0.067   15.0
0.25 0.0     6.0
0.35 0.333   12.0
0.45 0.524   21.0
0.55 0.733   15.0
0.65 1.0     5.0
0.75 0.727   11.0
0.85 0.789   19.0
0.95 0.9     10.0

The VAP outputs a calibration curve, that ideally should be a straight line with slope 1 and intersect 0. For this very small dataset the're are too few examples to get a descent calibration curve. In case more/less points are desired on the calibration curve, set the desired points to the --calibration-points flag. For instance running with --calibration-points 0.1:0.9:0.2 gave the following curve instead:

Calibration curve:
Expected     Observed        Num examples
0.1  0.077   13.0
0.3  0.259   27.0
0.5  0.5     32.0
0.7  0.667   30.0
0.9  0.947   19.0