Conformal Prediction Background

Conformal Prediction is described in greater depth in [4-5], but here is the short introduction of the basics.

Conformal Prediction light introduction

Conformal prediction is a well established mathematical framework that delivers object-based predictions, an alternative approach to the domain applicability estimation. The main advantages with conformal prediction is:

  • Object-based prediction, thus giving a larger prediction interval (regression) more difficult examples and a narrower one for easier examples)
  • Allows the user to set a desired confidence, the framework will then guarantee that the predictions will be at least of that confidence. Instead of estimating the accuracy of the model and then not being able to change that in any way.
  • Classification is done in a mondrian fashion, meaning that the predictions will be class-based, solving problems with uneven class-distributions.

Conformal Predictor Validity

Validity as computed in CPSign

In the regression case, CPSign computes validity of a model as the ratio of examples that has a confidence interval that surrounds the true or observed value of each example. Validity will thus be in the range [0,1] where 0 will mean that no examples were correctly predicted and 1 means that all of them were predicted correctly.

In the classification case, CPSign computes validity of a model as ratio of examples that is predicted to have the same label as the true or observed label, independently if the predicted region only contains a single label or multiple once.

Conformal Predictor Efficiency

Efficiency as computed in CPSign

Efficiency is defined as the interval size produced in regression predictions. A smaller prediction interval is more efficient than a larger one, the output Efficiency from CPSign from cross validation and grid search/parameter tuning is the median prediction interval size. A smaller Efficiency value is thus consider to be better or more efficient.

Efficiency in the classification case is defined as the number of prediction regions that contains two or more labels (binary classification can only contain at most 2 labels, but the general case can have more than 2 labels). The Efficiency value is computed to be relative to the number of tested molecules, and thus lay in the range [0,1]. A smaller Efficiency value is thus considered to be better or more efficient.

Conformal Prediction Gradient

In CPSign we have implemented a way to compute the gradient for the prediction of an example. Here we aim to clearify how this is done. Here is how the calculation is performed for each Inductive Conformal Predictor or Transductive conformal Predictor.

  1. We do a normal prediction with the example as it is.


    Normal prediction give a predicted value ŷN


    Normal prediction give a p-value for each class, the largest p-value is tanken as the selected class and only the p-value for this class is regarded from this point. We can call this pvalN.

  2. For each present feature f in the example we increase f by an amount called stepsize*. For each altered example we make a new prediction. For an example with N non-zero features, we make N of these predictions, each with only one feature being altered.

  3. Regression

    For each of the N altered predictions, we get a predicted ŷA(i) {i = 1,2,..N} value that might differ compared to the non-altered prediction in step 1. The difference in predicted values (ŷA(i) - ŷN)/stepsize can then be interpreted as the gradient for feature i.


    For each of the N altered predictions, we get the p-values for the selected class: pvalA(i) {i = 1,2,..N}. This p-value might differ compared to the non-altered p-value pvalN. The difference (pvalA(i) - pvalN)/stepsize can then be interpreted as the gradient for feature i.

* stepsize can be changed by setting the value in IACPClassificationImpl, IACPRegressionImpl, ICCPClassificationImpl or ICCPRegressionImpl respectively.

Note 1 (classification) If the gradient value for feature i is positive the altered prediction have given a larger pvalA(i) than pvalN meaning that adding more of this feature would move the prediction into being more likely to be of the selected class. Still note that we do only compute the gradient for the class that has the highest pvalue to start with.

Note 1 (regression) If the gradient value for feature i is positive the altered prediction have given a larger regression value, meaning that adding more of this feature would move the prediction into a higher response value, and vice-versa if the gradient value is negative.

Note 2 The gradients will not be normalized at this level, classification gradient values can be within [-1,1] and regression values can potentially be [-∞,∞]. On the Signatures level we can provide normalized gradients, see Molecule Gradient.

Aggregated Conformal Predictor- and Cross-conformal Predictor gradients

Aggragated Conformal Predictors (ACPs) and Cross-conformal Predictors (CCPs) uses several Inductive Conformal Predictors (ICPs), each which will produce their own gradient. It is fully possible that the gradients contradict each other. In CPSign we typically use the median value produced by the ICPs so that individual ICPs does not get too much influence on the results. When computing the gradient of the prediction, we use the median value per each feature.

Molecule Gradient

At the Signatures level we can infer further knowledge from the raw gradients produced in Conformal Prediction Gradient. If a Signatures Problem is predicted, each feature is in reality mapped to one or multiple atoms in a molecule. At the signatures level, we convert the signature based gradients into atom based gradients. This is done by the following steps:

  1. Each atom is initially set to have a gradient value of 0.
  2. For every index in the signature based gradient (gS(i)), get the set of atoms Ai that are part of that signature. For every atom in Ai, add the gradient value of gS(i) to current atom-gradient.

The molecule gradient will have the total contribution of an atom (from all the signatures that it's part of) and each atom-gradient can have any real value.

Note that classification gradients will no longer have any restrictions in their range.

Normalization of gradients

The molecule gradients will theoretically take any value on the real axis for each atom. The gradient values will also be model- and dataset-dependent, we need some sort of normalization to be able to gain more information. CPSign handles this in the CLI by predicting the gradient for a large set of molecules from the training set. All gradient values will then be ordered and the values from the lowest 10 % (lowPercentile) and the highest 10 % (highPercentile) will then be tanken as a lower and upper limit of the gradient values. This range [lowPercentile, highPercentile] will then be used to linearly normalize the results to a range of [-1, 1], the positive and negative values will be handled separately (each with their own linear normalization). Once normalized, it's a lot easier to interpret the results, values close to 0 has low influence on the prediction, values close to +1 has positive significant influence and values close to -1 has negative significant influence.

Note If you are running CPSign with API and wish to get normalized gradients, you have to call the computePercentiles method after you've trained your models. This has to be done with a "large enough" set of molecules, otherwise the lower and upper values might be misleading and affect the normalization. If you've not computed lower and upper values, the molecule gradients will be given but non-normalized and a logger.warning message will be printed.

Significant Signature

The Significant Signature of a molecule is simply the signature that produced the largest absolute gradient value in the signature based gradient. It is then easy to get the mapping of the signature to which atoms that the signature belongs to.

The significant signature atom mapping will be a map between atom to which class that has the highest p-value.
The significant signature atom mapping will be a map between atom to 1 (simply indicating that the atom is part of the significant signature).

The Significant Siganture atom mapping can be used in image rendering (thus coloring atoms after either the class having the hightest p-value or simply coloring the atoms that belong to the significant signature).

Nonconformity measures

A central concept of Conformal Prediction is the nonconformity measure, which is simply a way to compute how different an example is compared to the other examples in the dataset (see [4-5] for a thorough explanation). Here we simply state that there are different ways to compute the nonconformity of an example, and in the regression case we support three different measures described below. Furthermore, in the API it is possible to add your own custom measure.

Definitions The nonconformity value of example i is generally denoted as αi. In the regression case it is common to train and use an error model generally denoted as e. The predicted error for an example i is denoted êi. Moreover, the true label of an example i is denoted yi and the predicted label for the same example is denoted ŷi.

Absolute difference measure

When using the prediction error as the nonconformity score for an example i, we do not need to train any error model. The nonconformity of example i, denoted αi is calculated as:

αi = | yi - ŷi |

It is worth noting that by using this nonconformity measure, we do not need to train any error model, thus roughly performning the training in half the runtime of any nonconformity measure that requires an error model.

Normalized measure

By using a normalized nonconformity measure, one hopes to aquire a better prediction result by also training an error model that will increase the interval size for "difficult examples" and decrese the interval size for "easy examples". The nonconformity is calculated as follows:

αi = | yi - ŷi | / êi

The error model is then trained on the absoute error for each example, | yi - ŷi | .

Logarithmically normalized measure

A logarithmic normalization, proposed in [7] instead uses the logarithm of the predicted error in the error model. It also introduces a smoothing factor, β, that should is used for "smoothing" the interval sizes, making the small intervals a bit larger and the very large intervals a bit smaller. The smoothing factor, β, must itself be chosen, as found in [7], already a β of 0.5 was too big. Optimization of β can be done with the tune command.

αi = | yi - ŷi | / (exp(êi) + β), β>=0

The error model is then trained on the natural logarithm of the absolute error, ln(| yi - ŷi |).