Development and validation of consensus machine learning-based models for the prediction of novel small molecules as potential anti-tubercular agents

Wani, Mushtaq Ahmad; Roy, Kuldeep K.

doi:10.1007/s11030-021-10238-y

Development and validation of consensus machine learning-based models for the prediction of novel small molecules as potential anti-tubercular agents

Original Article
Published: 10 June 2021

Volume 26, pages 1345–1356, (2022)
Cite this article

Molecular Diversity Aims and scope Submit manuscript

824 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

Tuberculosis (TB) is an infectious disease and the leading cause of death globally. The rapidly emerging cases of drug resistance among pathogenic mycobacteria have been a global threat urging the need of new drug discovery and development. However, considering the fact that the new drug discovery and development is commonly lengthy and costly processes, strategic use of the cutting-edge machine learning (ML) algorithms may be very supportive in reducing both the cost and time involved. Considering the urgency of new drugs for TB, herein, we have attempted to develop predictive ML algorithms-based models useful in the selection of novel potential small molecules for subsequent in vitro validation. For this purpose, we used the GlaxoSmithKline (GSK) TCAMS TB dataset comprising a total of 776 hits that were made publicly available to the wider scientific community through the ChEMBL Neglected Tropical Diseases (ChEMBL-NTD) database. After exploring the different ML classifiers, viz. decision trees (DT), support vector machine (SVM), random forest (RF), Bernoulli Naive Bayes (BNB), K-nearest neighbors (k-NN), and linear logistic regression (LLR), and ensemble learning models (bagging and Adaboost) for training the model using the GSK dataset, we concluded with three best models, viz. Adaboost decision tree (ABDT), RF classifier, and k-NN models that gave the top prediction results for both the training and test sets. However, during the prediction of the external set of known anti-tubercular compounds/drugs, it was realized that each of these models had some limitations. The ABDT model correctly predicted 22 molecules as actives, while both the RF and k-NN models predicted 18 molecules correctly as actives; a number of molecules were predicted as actives by two of these models, while the third model predicted these compounds as inactives. Therefore, we concluded that while deciding the anti-tubercular potential of a new molecule, one should rely on the use of consensus predictions using these three models; it may lessen the attrition rate during the in vitro validation. We believe that this study may assist the wider anti-tuberculosis research community by providing a platform for predicting small molecules with subsequent validation for drug discovery and development.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

2D-QSAR modeling and two-fold classification of 1,2,4-triazole derivatives for antitubercular potency against the dormant stage of Mycobacterium tuberculosis

Article 04 August 2021

Antimalarial Drug Combination Predictions Using the Machine Learning Synergy Predictor (MLSyPred©) tool

Article Open access 02 January 2024

Machine learning-enabled predictive modeling to precisely identify the antimicrobial peptides

Article 11 October 2021

References

Global tuberculosis report (2020) World Health Organization: switzerland. https://apps.who.int/iris/bitstream/handle/10665/336069/9789240013131-eng.pdf
What is DOTS?: A guide to understanding the WHO-recommended TB Control Strategy Known as DOTS. (1999), World Health Organization, Switzerland. https://apps.who.int/iris/handle/10665/65979
Corbett EL, Watt CJ, Walker N, Maher D, Williams BG, Raviglione MC, Dye C (2003) The growing burden of tuberculosis: global trends and interactions with the HIV epidemic. Arch Intern Med 163(9):1009–10021
Article Google Scholar
Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18(6):463–477
Article CAS Google Scholar
Chibani S, Coudert F-X (2020) Machine learning approaches for the prediction of materials properties. APL Mater 8(8):080701
Article CAS Google Scholar
Ballell L, Bates RH, Young RJ, Alvarez-Gomez D, Alvarez-Ruiz E, Barroso V, Blanco D, Crespo B, Escribano J, Gonzalez R, Lozano S, Huss S, Santos-Villarejo A, Martin-Plaza JJ, Mendoza A, Rebollo-Lopez MJ, Remuinan-Blanco M, Lavandera JL, Perez-Herran E, Gamo-Benito FJ, Garcia-Bustos JF, Barros D, Castro JP, Cammack N (2013) Fueling open-source drug discovery: 177 small-molecule leads against tuberculosis. Chem Med Chem. https://doi.org/10.1002/cmdc.201200428
Article PubMed Google Scholar
Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C, Jupyter development team (2016) Jupyter Notebooks – a publishing format for reproducible computational workflows. Loizides, Fernando and Scmidt, Birgit (eds.) In Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press. pp. 87–90. https://doi.org/10.3233/978-1-61499-649-1-87
Lemaître G, Nogueira F,Aridas C K (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res, 18(1):559–563. https://www.jmlr.org/papers/volume18/16-365/16-365.pdf
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R,Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. https://jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf
Landrum G (2011) Rdkit documentation, Release 2011.12.1, 1–79. http://www.rdkit.org/RDKit_Docs.2011_12_1.pdf
Cao DS, Liang YZ, Yan J, Tan GS, Xu QS, Liu S (2013) PyDPI: freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies. J Chem Inf Model 53(11):3086–3096
Article CAS Google Scholar
Ballell L, Bates RH, Young RJ, Alvarez-Gomez D, Alvarez-Ruiz E, Barroso V, Blanco D, Crespo B, Escribano J, Gonzalez R, Lozano S, Huss S, Santos-Villarejo A, Martin-Plaza JJ, Mendoza A, Rebollo-Lopez MJ, Remuinan-Blanco M, Lavandera JL, Perez-Herran E, Gamo-Benito FJ, Garcia-Bustos JF, Barros D, Castro JP, Cammack N (2013) ChEMBL database. http://dx.doi.org/10.6019/CHEMBL2095176
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C Appl Rev. https://doi.org/10.1109/TSMCC.2011.2161285
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell 16:321–357
Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Humans 40(1):185–197
Article Google Scholar
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern. https://doi.org/10.1109/TSMCB.2008.2007853
Article PubMed Google Scholar
Matta CF, Arabi AA (2011) Electron-density descriptors as predictors in quantitative structure–activity/property relationships and drug design. Future Med Chem 3(8):969–994
Article CAS Google Scholar
Liu Y (2004) A comparative study on feature selection methods for drug discovery. J Chem Inf Comput Sci. https://doi.org/10.1021/ci049875d
Article PubMed Google Scholar
Cai J, Luo J, Wang S, Yang S (2018) Feature selection in machine learning: a new perspective. Neurocomputing 300(70):79
Google Scholar
McHugh ML (2013) The chi-square test of independence. Biochem Med (Zagreb) 23(2):143–149
Article CAS Google Scholar
Kersting K (2018) Machine learning and artificial intelligence: two fellow travelers on the quest for intelligent behavior in machines. Front Big Data. https://doi.org/10.3389/fdata.2018.00006
Article PubMed PubMed Central Google Scholar
Randles BM, Pasquetto IV, Golshan MS,Borgman CL (2017) Using the Jupyter notebook as a tool for open science: an empirical study. ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE http:/dx.doi.org/https://doi.org/10.1109/JCDL.2017.7991618
Luo G (2016) A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform. https://doi.org/10.1007/s13721-016-0125-6
Article Google Scholar
Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316
Article Google Scholar
Song YY, Lu Y (2015) Decision tree methods: applications for classification and prediction. Shanghai Arch Psych 27(2):130–135
Google Scholar
Abu Alfeilat HA, Hassanat AB, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VS (2019) Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data. https://doi.org/10.1089/big.2018.0175
Article PubMed Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Article Google Scholar
Pearce J, Ferrier S (2000) An evaluation of alternative algorithms for fitting species distribution models using logistic regression. Ecol Modell. https://doi.org/10.1016/S0304-3800(99)00227-6
Article Google Scholar
Daraghmeh M, Melhem SB, Agarwal A, Goel N, Zaman M (2018) Linear and logistic regression based monitoring for resource management in cloud networks. IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud). IEEE http:/dx.doi.org/https://doi.org/10.1109/FiCloud.2018.00045
Lemeshow S, Hosmer DW Jr (1982) A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol. https://doi.org/10.1093/oxfordjournals.aje.a113284
Article PubMed Google Scholar
Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. Fourth IEEE International conference on data mining (ICDM'04). IEEE, http:/dx.doi.org/https://doi.org/10.1109/ICDM.2004.10030
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms (2006). Proceedings of the 23rd international conference on machine learning. http:/dx.doi.org/https://doi.org/10.1145/1143844.1143865
Zhang Y (2012) Support vector machine classification algorithm and its application (2012). International conference on information computing and applications. Springer http:/dx.doi.org/https://doi.org/10.1007/978-3-642-34041-3_27
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. https://doi.org/10.1093/bioinformatics/16.10.906
Article PubMed Google Scholar
Solomatine DP,Shrestha DL (2004) AdaBoost. RT: a boosting algorithm for regression problems. IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541). IEEE http:/dx.doi.org/https://doi.org/10.1109/IJCNN.2004.1380102
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat Its Interface 2(3):349–360
Article Google Scholar
Qi Y (2012) Random forest for bioinformatics. In: Qi Y (ed) Ensemble machine learning. Springer, Boston
Google Scholar
Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Graph Statist 15(1):118–138
Article Google Scholar
Yadav S, Shukla S (2016) Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification . IEEE 6th International conference on advanced computing (IACC). IEEE http:/dx.doi.org/https://doi.org/10.1109/IACC.2016.25
Cawley G C,Talbot N L (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107. https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf.
Rodriguez JD, Perez A, Lozano JA (2009) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575
Article Google Scholar
Fushiki T (2011) Estimation of prediction error by using K-fold cross-validation. Stat Comput. https://doi.org/10.1007/s11222-009-9153-8
Article Google Scholar
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159
Article Google Scholar
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. https://doi.org/10.1093/bioinformatics/16.5.412
Article PubMed Google Scholar
Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10(3):e0118432
Article Google Scholar
Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using matthews correlation coefficient metric. PLoS One 12(6):e0177678
Article Google Scholar
Pruengkarn R, Fung CC,Wong KW (2015) Using misclassification data to improve classification performance (2015). 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE http:/dx.doi.org/https://doi.org/10.1109/ECTICon.2015.7206950

Download references

Acknowledgements

The authors are thankful to the Director of NIPER, Kolkata, for providing the resources and support. The author M.A.W. is thankful to the Department of Pharmaceuticals and the Ministry of Chemicals and Fertilizes for providing a Ph.D. fellowship.

Funding

None.

Author information

Authors and Affiliations

Department of Pharmacoinformatics, National Institute of Pharmaceutical Education and Research, Kolkata, West Bengal, 700054, India
Mushtaq Ahmad Wani
Department of Pharmaceutical Technology, School of Medical Sciences, Adamas University, Kolkata, West Bengal, 700126, India
Kuldeep K. Roy

Authors

Mushtaq Ahmad Wani
View author publications
You can also search for this author in PubMed Google Scholar
Kuldeep K. Roy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.A.W. and K.K.R. designed the project; M.A.W. carried out model building; M.A.W. and K.K.R. analyzed the results; M.A.W. wrote the initial draft of the manuscript; M.A.W. and K.K.R. revised the manuscript and checked the final version of the manuscript.

Corresponding author

Correspondence to Kuldeep K. Roy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 12 KB)

Supplementary file2 (CSV 62 KB)

Supplementary file3 (DOCX 29 KB)

Supplementary file4 (PDF 93 KB)

Supplementary file5 (PDF 41 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wani, M.A., Roy, K.K. Development and validation of consensus machine learning-based models for the prediction of novel small molecules as potential anti-tubercular agents. Mol Divers 26, 1345–1356 (2022). https://doi.org/10.1007/s11030-021-10238-y

Download citation

Received: 27 February 2021
Accepted: 27 May 2021
Published: 10 June 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11030-021-10238-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions