# Conformal Prediction Background¶

Conformal Prediction is described in greater depth in [4-5], but here is the short introduction of the basics.

Table of Contents

## Conformal Prediction light introduction¶

Conformal prediction is a well established mathematical framework that delivers object-based predictions, an alternative approach to the domain applicability estimation. The main advantages with conformal prediction is:

- Object-based prediction, thus giving a larger
**prediction interval**(regression) more difficult examples and a narrower one for easier examples) - Allows the user to set a desired
**confidence**, the framework will then guarantee that the predictions will be at least of that confidence. Instead of estimating the accuracy of the model and then not being able to change that in any way. - Classification is done in a
**mondrian**fashion, meaning that the predictions will be**class-based**, solving problems with uneven class-distributions.

## Conformal Predictor Validity¶

### Validity as computed in CPSign¶

In the **regression** case, CPSign computes validity of a model as the ratio of examples that has a confidence interval that
surrounds the true or observed value of each example. Validity will thus be in the range [0,1] where 0 will mean that
no examples were correctly predicted and 1 means that all of them were predicted correctly.

In the **classification** case, CPSign computes validity of a model as ratio of examples that is predicted to have the same
label as the true or observed label, independently if the predicted region only contains a single label or multiple once.

## Conformal Predictor Efficiency¶

### Efficiency as computed in CPSign¶

Efficiency is defined as the interval size produced in regression predictions. A *smaller* prediction interval is more
efficient than a larger one, the output Efficiency from CPSign from cross validation and grid search/parameter tuning
is the median prediction interval size. A smaller Efficiency value is thus consider to be better or *more efficient*.

Efficiency in the classification case is defined as the number of prediction regions that contains two or more labels
(binary classification can only contain at most 2 labels, but the general case can have more than 2 labels). The Efficiency
value is computed to be relative to the number of tested molecules, and thus lay in the range [0,1]. A smaller Efficiency
value is thus considered to be better or *more efficient*.

## Conformal Prediction Gradient¶

In CPSign we have implemented a way to compute the *gradient* for the prediction of an example. Here we aim to clearify how this
is done. Here is how the calculation is performed for each Inductive Conformal Predictor or Transductive conformal Predictor.

We do a normal prediction with the example as it is.

- Regression
Normal prediction give a predicted value

*ŷ*_{N}- Classification
Normal prediction give a p-value for each class, the largest p-value is tanken as the

*selected class*and only the p-value for this class is regarded from this point. We can call this*pval*_{N}.

For each present feature

*f*in the example we increase*f*by an amount called*stepsize**. For each*altered example*we make a new prediction. For an example with*N*non-zero features, we make*N*of these predictions, each with only one feature being altered.- Regression
For each of the

*N*altered predictions, we get a predicted*ŷ*_{A(i)}{i = 1,2,..*N*} value that*might*differ compared to the non-altered prediction in step 1. The difference in predicted values (*ŷ*_{A(i)}-*ŷ*_{N})/*stepsize*can then be interpreted as the gradient for feature*i*.- Classification
For each of the

*N*altered predictions, we get the p-values for the*selected class*:*pval*_{A(i)}{i = 1,2,..*N*}. This p-value*might*differ compared to the non-altered p-value*pval*_{N}. The difference (*pval*_{A(i)}-*pval*_{N})/*stepsize*can then be interpreted as the gradient for feature*i*.

* *stepsize* can be changed by setting the value in IACPClassificationImpl, IACPRegressionImpl, ICCPClassificationImpl or ICCPRegressionImpl respectively.

**Note 1 (classification)** If the gradient value for feature *i* is **positive** the altered prediction have given a larger *pval*_{A(i)} than *pval*_{N} meaning that
adding more of this feature would move the prediction into being **more likely** to be of the selected class. Still note that we do only compute the gradient for the class that
has the highest pvalue to start with.

**Note 1 (regression)** If the gradient value for feature *i* is **positive** the altered prediction have given a larger regression value, meaning that adding more of this feature
would move the prediction into a higher response value, and vice-versa if the gradient value is negative.

**Note 2** The gradients will not be normalized at this level, classification gradient values can be within [-1,1] and regression values can potentially be [-∞,∞].
On the Signatures level we can provide normalized gradients, see Molecule Gradient.

### Aggregated Conformal Predictor- and Cross-conformal Predictor gradients¶

Aggragated Conformal Predictors (ACPs) and Cross-conformal Predictors (CCPs) uses several Inductive Conformal Predictors (ICPs), each which will produce their own gradient.
It is fully possible that the gradients contradict each other. In CPSign we typically use the median value produced by the ICPs so that individual ICPs does not get too much
influence on the results. When computing the gradient of the prediction, we use the median value per *each feature*.

### Molecule Gradient¶

At the *Signatures level* we can infer further knowledge from the raw gradients produced in Conformal Prediction Gradient. If a Signatures Problem is predicted, each feature
is in reality mapped to one or multiple atoms in a molecule. At the signatures level, we convert the *signature based* gradients into *atom based* gradients. This is done by
the following steps:

- Each atom is initially set to have a gradient value of 0.
- For every index in the
*signature based*gradient (*g*_{S(i)}), get the set of atoms*A*_{i}that are part of that signature. For every atom in*A*_{i}, add the gradient value of*g*_{S(i)}to current atom-gradient.

The molecule gradient will have the total contribution of an atom (from all the signatures that it's part of) and each atom-gradient can have any real value.

**Note** that classification gradients will no longer have any restrictions in their range.

#### Normalization of gradients¶

The molecule gradients will theoretically take any value on the real axis for each atom. The gradient values will also be model- and dataset-dependent, we need some sort of normalization to be able to gain more information. CPSign handles this in the CLI by predicting the gradient for a large set of molecules from the training set. All gradient values will then be ordered and the values from the lowest 10 % (lowPercentile) and the highest 10 % (highPercentile) will then be tanken as a lower and upper limit of the gradient values. This range [lowPercentile, highPercentile] will then be used to linearly normalize the results to a range of [-1, 1], the positive and negative values will be handled separately (each with their own linear normalization). Once normalized, it's a lot easier to interpret the results, values close to 0 has low influence on the prediction, values close to +1 has positive significant influence and values close to -1 has negative significant influence.

**Note** If you are running CPSign with API and wish to get normalized gradients, you have to call the `computePercentiles`

method after you've trained your models. This has to be done with
a "large enough" set of molecules, otherwise the lower and upper values might be misleading and affect the normalization. If you've not computed lower and upper values, the molecule gradients will
be given but non-normalized and a logger.warning message will be printed.

### Significant Signature¶

The Significant Signature of a molecule is simply the signature that produced the largest absolute gradient value in the *signature based* gradient. It is then easy to get the mapping of the signature
to which atoms that the signature belongs to.

- Classification
- The significant signature atom mapping will be a map between atom to which class that has the highest p-value.
- Regression
- The significant signature atom mapping will be a map between atom to 1 (simply indicating that the atom is part of the significant signature).

The Significant Siganture atom mapping can be used in image rendering (thus coloring atoms after either the class having the hightest p-value or simply coloring the atoms that belong to the significant signature).

## Nonconformity measures¶

A central concept of Conformal Prediction is the nonconformity measure, which is simply a way to compute how different an example is compared to the other examples in the dataset (see [4-5] for a thorough explanation).
Here we simply state that there are different ways to compute the nonconformity of an example, and in the regression case we support three different measures described below. Furthermore, in the API it is possible
to add your own *custom* measure.

**Definitions** The nonconformity value of example *i* is generally denoted as *α*_{i}. In the regression case it is common to train and
use an *error model* generally denoted as *e*. The *predicted error* for an example *i* is denoted *ê*_{i}. Moreover, the true label of an example *i* is
denoted *y*_{i} and the predicted label for the same example is denoted *ŷ*_{i}.

### Absolute difference measure¶

When using the prediction error as the nonconformity score for an example *i*, we do not need to train any error model. The nonconformity of example *i*, denoted
*α*_{i} is calculated as:

*α*_{i} = | *y*_{i} - *ŷ*_{i} |

It is worth noting that by using this nonconformity measure, we do not need to train any error model, thus roughly performning the training in half the runtime of any nonconformity measure that requires an error model.

### Normalized measure¶

By using a *normalized* nonconformity measure, one hopes to aquire a better prediction result by also training an *error model* that will increase the interval size
for "difficult examples" and decrese the interval size for "easy examples". The nonconformity is calculated as follows:

*α*_{i} = | *y*_{i} - *ŷ*_{i} | / *ê*_{i}

The error model is then trained on the absoute error for each example, | *y*_{i} - *ŷ*_{i} | .

### Logarithmically normalized measure¶

A logarithmic normalization, proposed in [7] instead uses the logarithm of the predicted error in the error model. It also introduces a smoothing factor, β, that should is used for "smoothing" the interval sizes, making the small intervals a bit larger and the very large intervals a bit smaller. The smoothing factor, β, must itself be chosen, as found in [7], already a β of 0.5 was too big. Optimization of β can be done with the tune command.

*α*_{i} = | *y*_{i} - *ŷ*_{i} | / (exp(*ê*_{i}) + β), β>=0

The error model is then trained on the natural logarithm of the absolute error, ln(| *y*_{i} - *ŷ*_{i} |).