Sparse Prediction

Usage - Sparse Prediction

From version 0.3.10, CPSign also make it possible to perform Conformal Prediction without any Signature Generation. This allows the user to load sparse data (using LibSVM file format) and make predictions using vectors, letting the user decide which descriptors and features to use, leaving the crunching of numbers to CPSign.

Usage of CPSign is only slightly changed compare to using CPSign with Signature Generation, but instead of wrapping a ACP, CCP or TCP implementation in a Signatures-wrapper, wrapp your implementation in a SparsePredictor-wrapper instead!

Typical usage:

CPSignFactory factory = new CPSignFactory(licenseStream);

ACPClassification acpImpl = factory.createACPClassificationLibLinear(nrModels, calibrationRatio);
// or factory.createACPClassificationLibSVM();

// Wrap the ACP implementation in a SparsePredictor-wrapper
SparsePredictorCPClassification sparseACP = factory.createSparsePredictorACPClassification(acpImpl);

// Load data
sparseACP.loadSparseRecords(recordsStream);

// Train models
sparseACP.train();

// Save models - no need to train the same models again
sparseACP.saveModels(modelsFile, true);
// or sparseACP.saveModelsEncrypted(encryptedModelsFile);

// Generate example-vectors to predict, using any of the utility-methods
List<SparseFeature> example = CPSignFactory.getSparseVector("1:0.44 3:0.88 5:0.44 6:1.32 18:0.44 19:1.76 21:2.2 23:2.2 49:0.222 52:0.444 53:0.37 55:2.413 56:16 57:140");
// or CPSignFactory.getSparseVector(new double[]{1, 3.5, 4.1, 21.3, 64.4});
// or CPSignFactory.getSparseVector(new int[]{1, 5, 10, 11}, new double[] {3.4, 12.2, 12.3, 5});

// Predict the p-values for the new example
double[] pvals = sparseACP.predictMondrian(example);

// Use GridSearch for parameter tuning
GridSearch gs = new GridSearch(); // use default settings or use the other constructor to set all parameters
// can set a Writer to print all grid-search Cross Validation results
gs.setWriter(new OutputStreamWriter(System.out));
// Perform the grid search
GridSearchResult gsRes = gs.gridsearchRegression(sparseACP, OptimizationType.OPTIMIZE_RMSE);
System.out.println(gsRes); // print optimal settings

// cross-validate the problem to get the overall efficiency, validity(accuracy) and rmse (only for regression)
CVResult result = sparseACP.crossValidate(crossValidationFolds, crossValidationConfidence);

// Loading models into a new SparsePredictor (no need to give the ACP implementation, that's included in the model)
SparsePredictorACPClassification newPredictor = factory.createSparsePredictorACPClassification(null);
newPredictor.addModel(modelsFile.toURI());

File format

CPSign stores sparse data in LibSVM format which is in the form:

<value> <index>:<occurrances> <index>:<occurrances> ..
<value> <index>:<occurrances> <index>:<occurrances> ..

Also note that the <index> must start at 1 and not 0, to conform with LibLinear and LibSVM requirements.

Problem class

The underlying data structure for sparse problems is the Problem class that is accessible through the API. It provides means of manipulating data directly, without having to write data to file. Here's some examples of what you can do:

// Problem-class can be accessed without instantiating the CPSignFactory
Problem sparseProblem = Problem.fromSparseFile(inputStream);

// Add data from another file:
// (requires that all indexing is the same of course!)
sparseProblem.readDataFromStream(antherStream);

// Shuffle the records
sparseProblem.shuffle();

// Clone a dataset to make a deep copy
Problem problemClone = sparseProblem.clone();

// Splitting can be done with random shuffle
Problem[] problems = sparseProblem.splitRandom(0.2); // 20% in first Problem, 80% in second one

// Or splitting can be done statically (keep ordering)
Problem[] problemsStaticSplit = sparseProblem.splitStatic(100); // Split so first 100 records in first Problem
Problem[] problemsStaticSplitFraction = sparseProblem.splitStatic(0.3); // 30% in first, 70% in second

// Write manipulated Problems to file (sparseProblem now has combined two data files and random shuffled the records)
sparseProblem.writeProblemToStream(outputStream, true); // chose to compress or not

// If you wish to encrypt the Problem, instantiate CPSignFactory with a license that
// supports encryption and get the encryption specification to store/load a Problem
// in encrypted format
IEncryptionSpec spec = encryptionFactory.getEncryptionSpec();
problem.writeProblemToEncryptedStream(new FileOutputStream(encryptedFile), spec);
// Load the Problem back
Problem fromEncryptedFile = Problem.fromSparseFile(new FileInputStream(encryptedFile), spec);

Make your manipulations on the Sparse Problem and use the Sparse Predictors to do the predictions:

// SparsePredictor needs CPSignFactory to be instantiated
...
ISparsePredictorACPClassification sparseACP = factory.createSparsePredictorACPClassification(acpImpl);

Problem problem = Problem.fromSparseFile(inputStream);
// make desired manipulations

sparseACP.setProblem(problem);

// Train models and predict as normal
sparseACP.trainACP(calibrationRatio, nrModels);
...

Note that the API has been extended with the methods supportEncryption and getEncryptionSpec that allows the API user to encrypt a Problem by themselves. The remaining code will handle encryption transparently for the API user, but the Problem class is presented without any wrappers so it's up to the user to handle encryption at this level.