Usage - ACP & CCP

Aggregated Conformal Prediction and Cross Conformal Prediction

ACP vs CCP

Cross-Conformal Prediction (CCP) uses a user-defined number of folds k to divide the dataset into. There will be k models built and trained, where each model is trained on k-1 of the folds and leavning the last fold as calibration set. The training and calibration set will be picked in such way that each fold will end up as calibration set once and in the training set in the remaining k-1 models.

The API for the two classes is the same, only differing when instantiating the class from the CPSignFactory. The only difference internally is how the datasets for each individual ICP is picked, where CCP is done according the the given k folds and ACP is done randomly for each ICP.

Instantiation

Instantiation is done through the CPSignFactory factory methods. The general instantiation procedure is to create the underlying Conformal Prediction implementation and then wrapping the implementation with a Signatures "wrapper" that gives some utility functionality. For ACP Regression the procedure is as follows (given that you've instantiated the :code:`CPSignFactory`=:

// CPSignFactory should be instantiated
CPSignFactory factory = ...

ACPRegression acpLibSVM = factory.createACPRegressionLibSVM(); //LibSVM
ACPRegression acpLibLinear = factory.createACPRegressionLibLinear(); //LibLinear
SignaturesCPRegression signACP = factory.createSignaturesCPRegression(acpLibSVM, 1, 3);

For ACP Classification:

// Either use the CPSignFactory
ACPClassification acpLibSVM = factory.createACPClassificationLibSVM(); //LibSVM
ACPClassification acpLibLinear = factory.createACPClassificationLibLinear(); //LibLinear
SignaturesCPClassification signACP = factory.createSignaturesCPClassification(acpLibSVM, 1, 3);

Loading data

The SignatursCPRegression and SignaturesCPClassification classes give access to some utility methods for loading data (fromChemFile and fromMolsIterator). These methods are only accessible with the full license. fromChemFile loads data from SMILES, SDFiles and JSON files (see Input formats in CPSign). Note that you in this way can load data from multiple files, simply by calling fromChemFile or fromMolsIterator once for each file/data source. CPSign can in this way merge multiple datasources, from multiple formats. Here's some code examples for how this is done in SignaturesCPRegression, this is equivalent to usage for SignaturesCPClassification except for having to also add labels.

signACPRegression.fromChemFile(dataFile.getURI(), "activityValue"); // SDFile or JSON file
// OR
signACPRegression.fromChemFile(dataFile.getURI(), null); //SMILES-file with activity in second column
// OR
signACPRegression.fromMolsIterator(molsIterator); // Iterator<Pair<IAtomContainer, Double>> - any source!

CPSign version 0.6.0 introduced the possibility to use partitions of data exclusively for either training of models (proper training) or for calibration. This is handled at the API level by introducing the Dataset.java class that holds a single dataset and the Problem.java class now holds three datasets; dataset, calibrationExclusive and modelingExclusive. These can be manipulated directly if one would like to do so, or if the datasets are kept in separate files that is solved at the SignaturesCPClassification and SignaturesCPRegression level with the fromChemFile and fromMolsIterator that takes the enum RecordType as such:

// Use "dataFile" for only modeling
signACPRegression.fromChemFile(dataFile.getURI(), "activityValue", RecordType.MODELING_EXCLUSIVE); // SDFile or JSON file

// Use records in molsIterator for only calibration set
signACPRegression.fromMolsIterator(molsIterator, RecordType.CALIBRATION_EXCLUSIVE);

At the Problem level it is possible to read in a complete dataset using the fromChemFile or fromMolsIterator, then use the getters and setters for Problem and Dataset to manipulate which record should be in each dataset.

fromChemFile

If you pass a SDFile to fromChemFile you also need to give the property-name where the activity of the molecules are recorded. In case of a SMILES-file, you can get away with only passing null as property if the desired activity is in the second column in the SMILES file, or if the desired activity is in a different column, simply send the header of that column as property. Read SMILES file format to see what requirements we put on SMILES files.

Training models

In ACP and CCP problems the underlying ICP models have to be trained. This is done by calling the train() method, that will use the parameters set once you instantiated the object from the factory class. These models will later be re-used and not re-trained for every prediction as in the transductive prediction. The ACP training is based on two parameters: calibrationRatio (how big percentage of the dataset that will be taken out into the calibration set, should be in the range [0.0, 1.0]) and numModels that controls how many models should be trained and aggregated. For CCP it is enough to give the number of folds that should be used, and the calibration part will always be 1/numFolds.

Nonconformity Measure

See the general information about the available nonconformity measures in the section Nonconformity measures. Available nonconformity measures can be accessed in the NonconfMeasureFactory, or you can define your own nonconformity score by implementing either the NonconfMeasureClassification or NonconfMeasureRegression interfaces. Setting the nonconformity measure is done at the ACP/CCP-level and is straight forward:

ACPRegression acpReg = ...
acpReg.setNonconfMeasure(NonconfMeasureFactory.getAbsDiffMeasureRegression());
..

Note The nonconformity score for classification must be handled with care as LibLinear and LibSVM handles binary classification as a special case. They will only produce one predicted probability, the probability of the example beloning to the first class. So when implementing the methods in the interface, you have to take care of this. In the default implementation we use the negated predicted probablity of the example belonging to first class, when handling the second class.

Note When using a custom nonconformity measure (i.e. one you've implemented yourself), you will have to set the same nonconformity measure on the predictor if you wish to load in a model that you've saved to file. If you missing setting a nonconformity measure there is no default one, so the code will throw an exception when trying to predict something. So there is no possibility that you will get strange predictions due to missing something.

Computing Percentiles

If you wish to either render images of predicted molecules or are interested of the molecule gradient, it's a good to normalize the output somehow. The gradients are dataset dependent and without the normalization, it's hard to assess what the gradient values actually mean. Read how we calculate gradient for get a deeper understanding of how we calculate the molecule gradients. Short story - use the computePercentiles method in case you are interested in image rendering or molecule gradients, otherwise skip it because it's computationally heavy.

Save/Load models & Make predictions

CPSign saves both precomputed and trained models in OSGi/jar-format. Saving the current model is done by calling any of the saveModel or saveModelEncrypted methods (here you can choose to save models in plain text, compressed or encrypted). In case you have not performed any training the output model will just be the precomputed data that has been generated when you've parsed the training file in the fromChemFile method. If you have trained ICPs the saveModel will save the ICPs and the list signatures.

Loading models is done by calling the addModel method, this method both works for precomputed data and for trained models. The method will throw an exception in case:

  • Model is of different type than the Signatures Predictor (i.e. regression vs classification model)
  • Data has already been loaded and the signatures/labels of the new model does not match the current signatures
signACP.fromChemFile( trainingFile.toURI(), "activity" ); // precomputes the data

// save precomputed data (can be loaded from any other regression Signatures Predictor)
sigACP.saveModel( precomputedModel, compress );

// train ICPs
signACP.train();

// now 'saveModel' will save the trained ICPs

signACP.saveModel( modelFile, compress );

// OR encrypted
signACP.saveModelEncrypted( modelFile, encryptionSpec );

// ...
// load in models at a later stage
signACP.addModel( modelFile, encryptionSpec or null );
signACP.addModel( modelFile2, encryptionSpec or null ); // can load several models

// Prediction can be done in three ways
// By giving confidences of the prediction (a list of confidences)
List<Double> confidences = Arrays.asList(0.7, 0.8, 0.9);
IAtomContainer testMol= ... // read in using parseSMILESFile or parseSDFFile
ACPRegressionResult result1 = signACP.predict(testMol, confidences);
// By giving a distance from the molecule, and predicting the interval and confidence of that interval
ACPRegressionResult result2 = acp.predict(testMol, 5.0);
// Predict the significant signature in the molecule
SignificantSignature ss = signACP.predictSignificantSignature(testMol);

The resulting output from ACP and CCP predict is a ACPRegressionResult, which have getters for the interval, confidence, distance and the ŷ (which is the predicted value from the underlying SVMs).

Cross validate to get statistics

To estimate the accuracy, validity and RMSE (in the regression case) it is possible to do a cross validation on a dataset using the settings that you want to use. The cross validation does not need to be follow any training, the training procedure will be done within the execution but no models will be saved. You only need to load data, set your desired SVM parameters if you wish to check anything else than the default settings.

// instantiate and load data, then it's possible to run cross_validate
CVResult result = signACP.crossValidate(cvFolds, nrModels, calibrationRatio, confidence);
result.getEfficiency();
result.getValidity();
result.getRmse(); // only for regression case