Aggregated Conformal Prediction and Cross Conformal Prediction
Table of Contents
Cross-Conformal Prediction (CCP) uses a user-defined number of folds k to divide the dataset into. There will be k models built and trained, where each model is trained on k-1 of the folds and leavning the last fold as calibration set. The training and calibration set will be picked in such way that each fold will end up as calibration set once and in the training set in the remaining k-1 models.
The API for the two classes is the same, only differing when instantiating the class from the
The only difference internally is how the datasets for each individual ICP is picked, where CCP is done
according the the given k folds and ACP is done randomly for each ICP.
Instantiation is done through the
CPSignFactory factory methods. The general instantiation procedure
is to create the underlying Conformal Prediction implementation and then wrapping the implementation with
a Signatures "wrapper" that gives some utility functionality. For ACP Regression the procedure is as follows
(given that you've instantiated the :code:`CPSignFactory`=:
// CPSignFactory should be instantiated CPSignFactory factory = ... ACPRegression acpLibSVM = factory.createACPRegressionLibSVM(); //LibSVM ACPRegression acpLibLinear = factory.createACPRegressionLibLinear(); //LibLinear SignaturesCPRegression signACP = factory.createSignaturesCPRegression(acpLibSVM, 1, 3);
For ACP Classification:
// Either use the CPSignFactory ACPClassification acpLibSVM = factory.createACPClassificationLibSVM(); //LibSVM ACPClassification acpLibLinear = factory.createACPClassificationLibLinear(); //LibLinear SignaturesCPClassification signACP = factory.createSignaturesCPClassification(acpLibSVM, 1, 3);
The SignatursCPRegression and SignaturesCPClassification classes give access to some utility
methods for loading data (
These methods are only accessible with the full license.
fromChemFile loads data from
SMILES, SDFiles and JSON files (see Input formats in CPSign). Note that you in this way can load data from
multiple files, simply by calling
fromMolsIterator once for each file/data source.
CPSign can in this way merge multiple datasources, from multiple formats. Here's some code examples
for how this is done in SignaturesCPRegression, this is equivalent to usage for SignaturesCPClassification
except for having to also add labels.
signACPRegression.fromChemFile(dataFile.getURI(), "activityValue"); // SDFile or JSON file // OR signACPRegression.fromChemFile(dataFile.getURI(), null); //SMILES-file with activity in second column // OR signACPRegression.fromMolsIterator(molsIterator); // Iterator<Pair<IAtomContainer, Double>> - any source!
CPSign version 0.6.0 introduced the possibility to use partitions of data exclusively for either training of models (proper training)
or for calibration. This is handled at the API level by introducing the
Dataset.java class that holds a single dataset
Problem.java class now holds three datasets; dataset, calibrationExclusive and modelingExclusive. These can be manipulated
directly if one would like to do so, or if the datasets are kept in separate files that is solved at the
SignaturesCPRegression level with the
fromMolsIterator that takes the enum
RecordType as such:
// Use "dataFile" for only modeling signACPRegression.fromChemFile(dataFile.getURI(), "activityValue", RecordType.MODELING_EXCLUSIVE); // SDFile or JSON file // Use records in molsIterator for only calibration set signACPRegression.fromMolsIterator(molsIterator, RecordType.CALIBRATION_EXCLUSIVE);
Problem level it is possible to read in a complete dataset using the
then use the getters and setters for
Dataset to manipulate which record should be in each dataset.
If you pass a SDFile to
fromChemFile you also need to give the property-name where the
activity of the molecules are recorded. In case of a SMILES-file, you can get away with only passing
null as property if the desired activity is in the second column in the SMILES file, or
if the desired activity is in a different column, simply send the header of that column as property.
Read SMILES file format to see what requirements we put on SMILES files.
In ACP and CCP problems the underlying ICP models have to be trained. This is done by calling
train() method, that will use the parameters set once you instantiated the object from the factory class.
These models will later be re-used and not re-trained for every prediction as in the transductive prediction.
The ACP training is based on two parameters: calibrationRatio (how big percentage of the dataset
that will be taken out into the calibration set, should be in the range [0.0, 1.0])
and numModels that controls how many models should be trained and aggregated. For CCP it is enough
to give the number of folds that should be used, and the calibration part will always be 1/numFolds.
See the general information about the available nonconformity measures in the section Nonconformity measures.
Available nonconformity measures can be accessed in the
NonconfMeasureFactory, or you can define your own nonconformity
score by implementing either the
NonconfMeasureRegression interfaces. Setting the
nonconformity measure is done at the ACP/CCP-level and is straight forward:
ACPRegression acpReg = ... acpReg.setNonconfMeasure(NonconfMeasureFactory.getAbsDiffMeasureRegression()); ..
Note The nonconformity score for classification must be handled with care as LibLinear and LibSVM handles binary classification as a special case. They will only produce one predicted probability, the probability of the example beloning to the first class. So when implementing the methods in the interface, you have to take care of this. In the default implementation we use the negated predicted probablity of the example belonging to first class, when handling the second class.
Note When using a custom nonconformity measure (i.e. one you've implemented yourself), you will have to set the same nonconformity measure on the predictor if you wish to load in a model that you've saved to file. If you missing setting a nonconformity measure there is no default one, so the code will throw an exception when trying to predict something. So there is no possibility that you will get strange predictions due to missing something.
If you wish to either render images of predicted molecules or are interested of the molecule gradient, it's a
good to normalize the output somehow. The gradients are dataset dependent and without the normalization, it's hard
to assess what the gradient values actually mean. Read how we calculate gradient for get a deeper
understanding of how we calculate the molecule gradients. Short story - use the
in case you are interested in image rendering or molecule gradients, otherwise skip it because it's computationally heavy.
CPSign saves both precomputed and trained models in OSGi/jar-format. Saving the current model
is done by calling any of the
(here you can choose to save models in plain text, compressed or encrypted).
In case you have not performed any training the output model will just be the precomputed data that
has been generated when you've parsed the training file in the
If you have trained ICPs the
saveModel will save the ICPs and the list signatures.
Loading models is done by calling the
addModel method, this method both works for precomputed
data and for trained models. The method will throw an exception in case:
- Model is of different type than the Signatures Predictor (i.e. regression vs classification model)
- Data has already been loaded and the signatures/labels of the new model does not match the current signatures
signACP.fromChemFile( trainingFile.toURI(), "activity" ); // precomputes the data // save precomputed data (can be loaded from any other regression Signatures Predictor) sigACP.saveModel( precomputedModel, compress ); // train ICPs signACP.train(); // now 'saveModel' will save the trained ICPs signACP.saveModel( modelFile, compress ); // OR encrypted signACP.saveModelEncrypted( modelFile, encryptionSpec ); // ... // load in models at a later stage signACP.addModel( modelFile, encryptionSpec or null ); signACP.addModel( modelFile2, encryptionSpec or null ); // can load several models // Prediction can be done in three ways // By giving confidences of the prediction (a list of confidences) List<Double> confidences = Arrays.asList(0.7, 0.8, 0.9); IAtomContainer testMol= ... // read in using parseSMILESFile or parseSDFFile ACPRegressionResult result1 = signACP.predict(testMol, confidences); // By giving a distance from the molecule, and predicting the interval and confidence of that interval ACPRegressionResult result2 = acp.predict(testMol, 5.0); // Predict the significant signature in the molecule SignificantSignature ss = signACP.predictSignificantSignature(testMol);
The resulting output from ACP and CCP
predict is a ACPRegressionResult, which have getters for the interval,
confidence, distance and the ŷ (which is the predicted value from the underlying SVMs).
To estimate the accuracy, validity and RMSE (in the regression case) it is possible to do a cross validation on a dataset using the settings that you want to use. The cross validation does not need to be follow any training, the training procedure will be done within the execution but no models will be saved. You only need to load data, set your desired SVM parameters if you wish to check anything else than the default settings.
// instantiate and load data, then it's possible to run cross_validate CVResult result = signACP.crossValidate(cvFolds, nrModels, calibrationRatio, confidence); result.getEfficiency(); result.getValidity(); result.getRmse(); // only for regression case