Tune

The tune program is used for parameter optimization of the Support Vector Machine parameters C and gamma. The standard options used in CPSign are normally good when using the signatures descriptors in SVM problems, but here you can optionally run tuning of the parameters.

Parameters

The full usage menu can be retrieved by running command:

> java -jar cpsign-[version].jar tune -h

                                           tune
SYNOPSIS
------------------------------------------------------------------------------------------
  tune [options]
  tune @/tmp/runconfigs/parameters.txt [options]
  tune @C:\Users\User\runconfigs\parameters.txt [options]


DESCRIPTION
------------------------------------------------------------------------------------------
  Perform an exhaustive grid search to find optimal parameter values for Cost and Gamma.
  For regression problems using the log-normalized nonconformity measure, it is also
  possible to grid search for optimizing the beta-parameter of the nonconformity measure.


OPTIONS
------------------------------------------------------------------------------------------
  Input:
    -mi | --model-in                         [URI | path]
       Model file with precomputed data
    -td | --train-data                       [URI | path]
       File with molecules in SMILES, SDF or JSON format
    -md | --model-data                       [URI | path]
       File with molecules that exclusively should be used for training the scoring
       algorithm. In SMILES, SDF or JSON format
    -cd | --calibration-data                 [URI | path]
       File with molecules that exclusively should be used for calibrating predictions. In
       SMILES, SDF or JSON format
    -e  | --endpoint                         [text]
       Endpoint property that should be used for modeling (the endoint of the model)
    -l  | --labels                           [label label] | [label,label]
       Label(s) for response values in classification mode. If a label is a negative
       numerical number, the minus sign must be escaped so that the command parser does
       not think it's a new option flag. E.g.: --labels [-1,1] (no blank-space permitted!)
       or --labels "\-1" 1

  Predictor:
    -pt | --ptype | --predictor-type         [id | text]
       Predictor type:
         (1) ACP Classification
         (2) ACP Regression
         (5) VAP Classification
       Default: 1
    -ss | --sampling-strategy                [id | text]
       Strategy used for sampling data to aggregated models (non TCP):
         (1) random
         (2) random stratified (classification only)
         (3) folded
         (4) folded stratified (classification only)
       Default: 1
    -nr | --nr-models                        [integer]
       (ACP/VAP) Number of models that should be aggregated
       Default: 1
    -cr | --calibration-ratio                [number]
       (ACP/VAP) Part of training set used as calibration set, range (0,1)
       Default: 0.2
    --seed                                   [integer]
       Set this flag if an explicit seed should be used in randomization of training data,
       default is using a random seed
    --nonconf-measure                        [text]
       (Regression) Nonconformity score that should be used, see documentation for
       clarifications. Options:
         (1) normalized
         (2) log-normalized
         (3) abs-diff
       Default: 1
    --nonconf-beta                           [number]
       If log-normalized nonconformity score is chosen, optionally set a beta value (>= 0)
       Default: 0.0

  Modeling:
    -i  | --impl                             [id | text]
       Scoring algorithm (i.e. underlying machine learning implementation):
         (1) liblinear
         (2) libsvm
       Default: 1

  Grid Search:
    -op | --optimization                     [id | text]
       The criterion that should be used for optimizing the parameters, Options:
         (1) efficiency
         (2) rmse (only for regression)
         (3) logloss (VAP only)
         (4) AUC (VAP only)
       Default: 1
    --gamma-range                            [start:end:step] | [number number ..]
       The range of gamma values that should be used, either specified as a non-empty list
       or using 'start:end:step' (only integers allowed). The values tested will be
       {2^start, 2^(start+step),..,2^end}
       Default: -5:3:2
    --cost-range                             [start:end:step] | [number number ..]
       The range of cost values that should be used, either specified as a non-empty list
       or using 'start:end:step' (only integers allowed). The values tested will be
       {2^start, 2^(start+step),..,2^end}
       Default: -5:15:2
    --beta-values                            [number number ..] | [number,number,..]
       (Regression) If log-normalized nonconformity measure is used, tune the beta value
       by giving a list of values that should be tested. Beta values must be >= 0

  Signature generation:
    -hs | --height-start                     [integer]
       Signatures start height
       Default: 1
    -he | --height-end                       [integer]
       Signatures end height
       Default: 3
    -sg | --signatures-generator             [id | text]
       Type of signatures that should be used, note that stereo-signatures take much
       longer time to compute. Options:
         (1) default/normal
         (2) stereo (experimental mode)
       Default: 1

  Output:
    -a  | --all
       Print all grid search results (otherwise will just print the optimal result)
    -o  | --output                           [path]
       File to write all grid search results to (default is to print to screen)

  Encryption:
    --two-factor-pin
       If two-factor encryption is used and key has a non-default PIN

  General:
  * --license                                [URI | path]
       Path or URI to license file
    -h  | --help | man
       Get help text
    --short
       Use shorter help text (used together with the --help argument)
    --logfile                                [path]
       Path to a user-set logfile, will be specific for this run
    --silent
       Silent mode (only print output to logfile)
    --echo
       Echo the input arguments given to CPSign
    --time
       Print wall-time for all individual steps in execution

------------------------------------------------------------------------------------------

Picking parameter search space

In case a larger than normal parameter space is required to be searched, it is possible to set the ranges of the C and gamma values. The parameter ranges can either be set by giving each desired value, or by sending a triplet on the form start:end:step to the flags --gamma-range and --cost-range. The parameter values that finally will be tried are the set {2^start, 2^(start+step),..,2^end}. When searching a large parameter space, it is possible to do a coarse-grained search by setting a larger step size, when the region of interest has been found, lower the step size and do a fine grained search. Currently only available in ACP and CVAP modes, but the parameters obtained in ACP/CVAP is transferable to the TCP case.

Parameter tuning β

The smoothing factor, β, of the logarithmically normalized nonconformity measure introduced in Nonconformity measures can be optimized with the tune program. This is done slightly different than with the C and gamma values, here you can simply add a list of β values that you wish to test (given that you have set the logarithmically normalized nonconformity measure in a regression case):

> java -jar cpsign-[version].jar tune \
  --beta-values 0.0 0.1 0.2 0.5 \
  ..