DeltaDelta

DeltaDelta description

DeltaDelta is a deep learning-based protein-ligand binding affinity predictor. Its internal neural network is able to infer binding affinities by learning from hundreds of protein-ligand 3D complex structures.

DeltaDelta's applicability domain is the prediction of delta delta free energies of congeneric series, a daunting problem during the lead optimization phase.

The initial predictive model has been pre-trained on congeneric series of the BindingDB database. By doing so, the neural network can learn general features of the protein-ligand interaction nature that can be then transferred and applied to your particular protein or lead optimization project.

In order to optimize and adapt the initial pre-trained model to the chemical space of interest (i.e. your particular protein target), it is necessary to further train the network with as many examples as possible of compounds that belong to the chemical space you want to predict. This is the reason why the application require two major inputs: (1) a SDF file containing the ligands you want to predict the affinity for (test dataset), and (2) a SDF file containing a set of compounds with known experimental pIC50 (training dataset). Both training and testing dataset must be docked to the protein of interest.


DeltaDelta inputs

Required arguments:

  • -mol2 <MOL2> .mol2 file of the receptor.
  • -sdf_train <SDF_TRAIN> .sdf file containing a set of training docked ligands. The ligands must contain a field called pIC50 with the experimental pIC50.
  • -sdf_test <SDF_TEST> .sdf file containing a set of test docked ligands. Optionally, the ligands can contain a field with a custom name (<VALIDATION_FIELD>) containing the experimental pIC50 for benchmarking purposes.

Optional arguments:

  • --help print help and required arguments.
  • -validation_field <VALIDATION_FIELD> <SDF_TEST> field name to be used as reference for benchmarking. If this option is provided, the application will predict the <SDF_TEST> plus will automatically assess the accuracy of the prediction by providing the pearson and spearman correlation between predicted and experimental pIC50 values.

DeltaDelta outputs

The outputs of an execution are:

  1. output_ddG.csv : delta delta predictions for all combinations of test and training ligands.
  2. output_dG.csv : delta predictions for all test ligands.
  3. output.sdf : sdf of the test dataset including the predicted pIC50 fields.
  4. benchmark.csv (only for benchmark) : pearson and spearman correlation between predicted and experimental data for the test dataset.
  5. pearson_delta_pIC50.png (only for benchmark) : regression plot between predicted and experimental delta pIC50 for the test dataset.

  6. pearson_pIC50.png (only for benchmark) : regression plot between predicted and experimental pIC50 for the test dataset.

Examples of pearson_delta_pIC50.png and pearson_pIC50.png are:

pearson_delta_pIC50.png pearson_delta_pIC50.png

An example of benchmark.csv:

property,value
spearman_rho,0.9233044733044735
spearman_pval,1.1084895401468131e-23
pearson_corr,0.9465980073604575
pearson_pval,1.0157121742122449e-27
spearman_rho_dG,0.7
spearman_pval_dG,0.1881204043741873
pearson_corr_dG,0.581939832095685
pearson_pval_dG,0.3033062392002662


FAQs and frequent errors

I have a different protein conformation for each ligand. Can I use multiple protein conformations as input?

While this is theoretically possible to do, the program currently only accepts one single protein structure. This means that if you have more than one structure you will have to either align all the protein-ligand complexes or re-dock the ligands to a single conformation.


Is there any general rule on how many ligands should be used as training to get accurate predictions for the test set?

The answer, unfortunately, is that there is no golden rule than can be applied to all cases.

In our experience, for some "easy" datasets 3 ligands were enough but for "harder" ones more ligands were required.

In fact, we are working on a scientific publication that includes results in collaboration to several pharmaceutical companies, and in one of the examples we show that 10% of the data is usually enough to recover state-of-the-art performances.


Are there any guidelines I can follow to generate the training/test splits?

There are generally two ways you can split the data into the training and test sets:

  1. Random splits. You can randomly select ligands from the pool of existing data and randomly assign them to the test or training sets. For instance, you can randomly select 10%, 20% and 30% of training data and try to predict the remaining 90%, 80% and 70% of data, respectively. You can even run DeltaDelta several times for each condition. This simple experiment can show how robust is DeltaDelta for your specific dataset and how much data is needed for prediction accuracy convergence.

  2. Temporal splits. If you have date times available for each ligand, by all means we recommend to use them to generate the splits. Temporal splits will allow us to simulate an "online learning" scenario in which experimental data becomes increasingly available to enrich the model and will prevent the model from "look-ahead", meaning that it will not be able to see ligands "from the future" to predict pIC50s from the "the past". In order to do a temporal split, rank your ligands by date order and take increasingly bigger chunks of training data starting from the beginning. You could try with groups of 5 ligands (e.g. 5, 10, 15,...), portions of data (5%, 10%, 15%,...) or following some event-driven scheme (e.g. batches of synthesized ligands).


On executing the container I find the following error: "File error: Bad input file..."

This type of error usually takes place when the container cannot access an input file. This usually happens because the current directory has not been mounted correctly.

There are two ways to fix this error:

  1. Place the inputs in any directory inside your home directory, cd to that path and execute the container from there. For instance, if my user is john, I could place my input files and run the container at /home/john/workspace/deltadelta and the folder would be auto-mounted by Singularity.

  2. If you can't run the container inside your home directory because you are working from a cluster node or shared filesystem, then you have to explicitly mount the current directory. You can achieve by adding --bind $(pwd) to the list of arguments such as in the following command:

singularity run --bind $(pwd) -B $(pwd)/mylicense.dat:/data/license.dat --nv DeltaDelta.img -mol2 myprotein.mol2 -sdf_test test.sdf -sdf_train train.sdf