bidd-molmap
Tools
HTML
MolMap
MolMap is generated by the following steps:
- Step1: Input structures
- Step2: Feature extraction
- Step3: Feature pairwise distance calculation –> cosine, correlation, jaccard
- Step4: Feature 2D embedding –> umap, tsne, mds
- Step5: Feature grid arrangement –> grid, scatter
- Step5: Transform –> minmax, standard
MolMap Fmaps for compounds
Construction of the MolMap Objects
The MolMapNet Architecture
Installation
conda create -c conda-forge -n molmap rdkit python=3.7 conda activate molmap conda install -c tmap tmap pip install molmap
- ChemBench (optional, if you wish to use the dataset and the split induces in this paper).
- If you have gcc problems when you install molmap, please installing g++ first:
sudo apt-get install g++
Out-of-the-Box Usage
- Example for Regression Task on ESOL (descriptors only)
- Example for Classification Task on BACE (fingerprints only)
- Example for Regression Task on FreeSolv (descriptors plus fingerprints)
- Example for Classification Task on BACE (descriptors plus fingerprints)
- Example for Multi-label Classification Task on ClinTox (descriptors plus fingerprints)
import molmap # Define your molmap mp_name = './descriptor.mp' mp = molmap.MolMap(ftype = 'descriptor', fmap_type = 'grid', split_channels = True, metric='cosine', var_thr=1e-4)
# Fit your molmap mp.fit(method = 'umap', verbose = 2) mp.save(mp_name)
# Visulization of your molmap mp.plot_scatter() mp.plot_grid()
# Batch transform from molmap import dataset data = dataset.load_ESOL() smiles_list = data.x # list of smiles strings X = mp.batch_transform(smiles_list, scale = True, scale_method = 'minmax', n_jobs=8) Y = data.y print(X.shape)
# Train on your data and test on the external test set from molmap.model import RegressionEstimator from sklearn.utils import shuffle import numpy as np import pandas as pd def Rdsplit(df, random_state = 888, split_size = [0.8, 0.1, 0.1]): base_indices = np.arange(len(df)) base_indices = shuffle(base_indices, random_state = random_state) nb_test = int(len(base_indices) * split_size[2]) nb_val = int(len(base_indices) * split_size[1]) test_idx = base_indices[0:nb_test] valid_idx = base_indices[(nb_test):(nb_test+nb_val)] train_idx = base_indices[(nb_test+nb_val):len(base_indices)] print(len(train_idx), len(valid_idx), len(test_idx)) return train_idx, valid_idx, test_idx
# split your data train_idx, valid_idx, test_idx = Rdsplit(data.x, random_state = 888) trainX = X[train_idx] trainY = Y[train_idx] validX = X[valid_idx] validY = Y[valid_idx] testX = X[test_idx] testY = Y[test_idx] # fit your model clf = RegressionEstimator(n_outputs=trainY.shape[1], fmap_shape1 = trainX.shape[1:], dense_layers = [128, 64], gpuid = 0) clf.fit(trainX, trainY, validX, validY) # make prediction testY_pred = clf.predict(testX) rmse, r2 = clf._performance.evaluate(testX, testY) print(rmse, r2)
Out-of-the-Box Performances
Dataset
Task Metric
MoleculeNet (GCN Best Model)
Chemprop (D-MPNN model)
MolMapNet (MMNB model)
ESOL
RMSE
0.580 (MPNN)
0.555
0.575
FreeSolv
RMSE
1.150 (MPNN)
1.075
1.155
Lipop
RMSE
0.655 (GC)
0.555
0.625
PDBbind-F
RMSE
1.440 (GC)
1.391
0.721
PDBbind-C
RMSE
1.920 (GC)
2.173
0.931
PDBbind-R
RMSE
1.650 (GC)
1.486
0.889
BACE
ROC_AUC
0.806 (Weave)
N.A.
0.849
HIV
ROC_AUC
0.763 (GC)
0.776
0.777
PCBA
PRC_AUC
0.136 (GC)
0.335
0.276
MUV
PRC_AUC
0.109 (Weave)
0.041
0.096
ChEMBL
ROC_AUC
N.A.
0.739
0.750
Tox21
ROC_AUC
0.829 (GC)
0.851
0.845
SIDER
ROC_AUC
0.638 (GC)
0.676
0.68
ClinTox
ROC_AUC
0.832 (GC)
0.864
0.888
BBBP
ROC_AUC
0.690 (Weave)
0.738
0.739