Input formats in CPSign

Numerical file format

CPSign loads and stores numerical data in LibSVM/Liblinear file format:

<value> <index>:<occurrances> <index>:<occurrances> ..
<value> <index>:<occurrances> <index>:<occurrances> ..
..

Also note that the <index> must start at 1 and not 0, to conform with LibLinear and LibSVM requirements.

SMILES file format

CPSign requires SMILES input files to follow the OpenSMILES specification (opensmiles.org). Simply meaning that each line should start with a valid SMILES and can optionally include more information. If the SMILES file should include any other information CPSign requires the fields to be split with a tal (\t) character, dividing each line into columns. CPSign also requires this to be consistent throughout the file, each row must have the same number of columns. This requirement simply the parsing of the SMILES files and allows us to store all associated data with each molecule (i.e. save all data for predicted molecules). Example file (note that all blanks between columns should be tabs):

SMILES   Sample_ID   Activity Additional_Notes
OC(=O)\C=C/C(O)=O.C[C@]12CC=C3[C@@H](CCC4=CC(=O)C=C[C@]34C)[C@@H]1CC[C@@H]2C(=O)CN1CCN(CC1)C1=NC(=NC(=C1)N1CCCC1)N1CCCC1   NCGC00261900-01   POS   Here's some additional information
[Na+].NC1=NC=NC2=C1N=C(Br)N2C1OC2CO[P@]([O-])(=O)O[C@@H]2C1O   NCGC00260869-01   NEG   More notes
O=C1N2CCC3=C(NC4=C3C=CC=C4)C2=NC2=C1C=CC=C2  NCGC00261776-01   NEG
Cl.FC1=CC=C(C=C1)C(OCCCC1=CNC=N1)C1=CC=C(F)C=C1 NCGC00261380-01   POS
CC1=CC=C(C=C1)S(=O)(=O)N[C@@H](CC1=CC=CC=C1)C(=O)CCl  NCGC00261842-01   NEG   Not all lines need to contain the additional notes
...

Header Line

The SMILES file can optionally include a header line, also tab-delimited, with the name of the property in that column. The header will be used for setting all properties for that molecule and will be keept in the ouput. The only exception to that is that CPSign will overwrite the first column header name with "SMILES", but all other names will be keep.

If there is no header in a SMILES file, the first column will be named "SMILES". If there is more data, the second column will be named "activity" (second column is by default treated as activity for that SMILES), remaining columns will be named "Unnamed property(column-id)" where column-id is the index of that column (index starts at 0, so e.g. the third column in a file will be called "Unnamed property(2)"). If a SMILES file have the desired activity to model in a different column than the default one, the file must have a header so the correct column can be picked up.

SMILES Files as input

SMILES files that are sent as input to train and precompute is required to have at least one extra column of data after the SMILES column. If the column to use for modeling is not the column succeeding the SMILES column, the file must include a header line so that the correct column can be picked by using the -n/--response-name flag. If the SMILES is sent to predict there is no more requirement than that each line contains a valid SMILES in the beginning.

SMILES Files as output

From the 0.3.13 version of CPSign, it is now possible to have SMILES as output format from predict. All accepted outputformats are now json, plain/smiles and sdf. The plain output format has been changed in to a valid SMILES file format (simply meaning that the SMILES string comes first) and is tab delimited, passing plain will give the same output as passing smiles. The output file will also include all properties set in the input data, and for input data in SMILES format the output will be ordered in the same way as the input columns, only adding the additional prediction-column(s) to the end, if the input is in SDF format, there will not be any guarantee put on the ordering of the columns.

SMILES as single molecule

The predict command can predict single molecules using the --smiles flag, this flag takes a string of texts where the string must start with a valid SMILES and can then optionally include a blank space character (tab, space) and an identifier.

JSON file format

CPSign supports a JSON input format, the format requires that the top level starts as a JSON array (meaning that the first character must be a hard bracket "["). Each index of the array is one record and each record must include a key-value for SMILES for the molecule. This SMILES key-value pair must have the key "SMILES", "smiles" or "Smiles". Here are some examples for the file fromat (it is not required that the file is properly indented).

Example classification JSON file:

[
   {
      "cdk:Title" : "1728-95-6",
      "Ames test categorisation" : "mutagen",
      "smiles" : "C1(=C(C=2C=CC=CC2)N=C(N1)C3=CC=C(OC)C=C3)C=4C=CC=CC4"
   },

   {
      "cdk:Title" : "91-08-7",
      "Ames test categorisation" : "mutagen",
      "smiles" : "C=1(C(=C(C=CC1)N=C=O)C)N=C=O"
   },

   ..
]

Example regression JSON file:

[
   {
      "BIO" : "0.43",
      "comment" : "This is a comment",
      "smiles" : "SC1=C(C(F)(F)F)C=CC=C1"
   },

   {
      "BIO" : "1.60",
      "comment" : "Comment for second molecule",
      "smiles" : "SC1=C(C(F)(F)F)C=C([N+]([O-])=O)C=C1"
   },

   ..
]