Abstract
Tuberculosis (TB) is an infectious disease and the leading cause of death globally. The rapidly emerging cases of drug resistance among pathogenic mycobacteria have been a global threat urging the need of new drug discovery and development. However, considering the fact that the new drug discovery and development is commonly lengthy and costly processes, strategic use of the cutting-edge machine learning (ML) algorithms may be very supportive in reducing both the cost and time involved. Considering the urgency of new drugs for TB, herein, we have attempted to develop predictive ML algorithms-based models useful in the selection of novel potential small molecules for subsequent in vitro validation. For this purpose, we used the GlaxoSmithKline (GSK) TCAMS TB dataset comprising a total of 776 hits that were made publicly available to the wider scientific community through the ChEMBL Neglected Tropical Diseases (ChEMBL-NTD) database. After exploring the different ML classifiers, viz. decision trees (DT), support vector machine (SVM), random forest (RF), Bernoulli Naive Bayes (BNB), K-nearest neighbors (k-NN), and linear logistic regression (LLR), and ensemble learning models (bagging and Adaboost) for training the model using the GSK dataset, we concluded with three best models, viz. Adaboost decision tree (ABDT), RF classifier, and k-NN models that gave the top prediction results for both the training and test sets. However, during the prediction of the external set of known anti-tubercular compounds/drugs, it was realized that each of these models had some limitations. The ABDT model correctly predicted 22 molecules as actives, while both the RF and k-NN models predicted 18 molecules correctly as actives; a number of molecules were predicted as actives by two of these models, while the third model predicted these compounds as inactives. Therefore, we concluded that while deciding the anti-tubercular potential of a new molecule, one should rely on the use of consensus predictions using these three models; it may lessen the attrition rate during the in vitro validation. We believe that this study may assist the wider anti-tuberculosis research community by providing a platform for predicting small molecules with subsequent validation for drug discovery and development.
Graphical abstract
Similar content being viewed by others
References
Global tuberculosis report (2020) World Health Organization: switzerland. https://apps.who.int/iris/bitstream/handle/10665/336069/9789240013131-eng.pdf
What is DOTS?: A guide to understanding the WHO-recommended TB Control Strategy Known as DOTS. (1999), World Health Organization, Switzerland. https://apps.who.int/iris/handle/10665/65979
Corbett EL, Watt CJ, Walker N, Maher D, Williams BG, Raviglione MC, Dye C (2003) The growing burden of tuberculosis: global trends and interactions with the HIV epidemic. Arch Intern Med 163(9):1009–10021
Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18(6):463–477
Chibani S, Coudert F-X (2020) Machine learning approaches for the prediction of materials properties. APL Mater 8(8):080701
Ballell L, Bates RH, Young RJ, Alvarez-Gomez D, Alvarez-Ruiz E, Barroso V, Blanco D, Crespo B, Escribano J, Gonzalez R, Lozano S, Huss S, Santos-Villarejo A, Martin-Plaza JJ, Mendoza A, Rebollo-Lopez MJ, Remuinan-Blanco M, Lavandera JL, Perez-Herran E, Gamo-Benito FJ, Garcia-Bustos JF, Barros D, Castro JP, Cammack N (2013) Fueling open-source drug discovery: 177 small-molecule leads against tuberculosis. Chem Med Chem. https://doi.org/10.1002/cmdc.201200428
Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C, Jupyter development team (2016) Jupyter Notebooks – a publishing format for reproducible computational workflows. Loizides, Fernando and Scmidt, Birgit (eds.) In Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press. pp. 87–90. https://doi.org/10.3233/978-1-61499-649-1-87
Lemaître G, Nogueira F,Aridas C K (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res, 18(1):559–563. https://www.jmlr.org/papers/volume18/16-365/16-365.pdf
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R,Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. https://jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf
Landrum G (2011) Rdkit documentation, Release 2011.12.1, 1–79. http://www.rdkit.org/RDKit_Docs.2011_12_1.pdf
Cao DS, Liang YZ, Yan J, Tan GS, Xu QS, Liu S (2013) PyDPI: freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies. J Chem Inf Model 53(11):3086–3096
Ballell L, Bates RH, Young RJ, Alvarez-Gomez D, Alvarez-Ruiz E, Barroso V, Blanco D, Crespo B, Escribano J, Gonzalez R, Lozano S, Huss S, Santos-Villarejo A, Martin-Plaza JJ, Mendoza A, Rebollo-Lopez MJ, Remuinan-Blanco M, Lavandera JL, Perez-Herran E, Gamo-Benito FJ, Garcia-Bustos JF, Barros D, Castro JP, Cammack N (2013) ChEMBL database. http://dx.doi.org/10.6019/CHEMBL2095176
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C Appl Rev. https://doi.org/10.1109/TSMCC.2011.2161285
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell 16:321–357
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Humans 40(1):185–197
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern. https://doi.org/10.1109/TSMCB.2008.2007853
Matta CF, Arabi AA (2011) Electron-density descriptors as predictors in quantitative structure–activity/property relationships and drug design. Future Med Chem 3(8):969–994
Liu Y (2004) A comparative study on feature selection methods for drug discovery. J Chem Inf Comput Sci. https://doi.org/10.1021/ci049875d
Cai J, Luo J, Wang S, Yang S (2018) Feature selection in machine learning: a new perspective. Neurocomputing 300(70):79
McHugh ML (2013) The chi-square test of independence. Biochem Med (Zagreb) 23(2):143–149
Kersting K (2018) Machine learning and artificial intelligence: two fellow travelers on the quest for intelligent behavior in machines. Front Big Data. https://doi.org/10.3389/fdata.2018.00006
Randles BM, Pasquetto IV, Golshan MS,Borgman CL (2017) Using the Jupyter notebook as a tool for open science: an empirical study. ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE http:/dx.doi.org/https://doi.org/10.1109/JCDL.2017.7991618
Luo G (2016) A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform. https://doi.org/10.1007/s13721-016-0125-6
Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316
Song YY, Lu Y (2015) Decision tree methods: applications for classification and prediction. Shanghai Arch Psych 27(2):130–135
Abu Alfeilat HA, Hassanat AB, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VS (2019) Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data. https://doi.org/10.1089/big.2018.0175
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Pearce J, Ferrier S (2000) An evaluation of alternative algorithms for fitting species distribution models using logistic regression. Ecol Modell. https://doi.org/10.1016/S0304-3800(99)00227-6
Daraghmeh M, Melhem SB, Agarwal A, Goel N, Zaman M (2018) Linear and logistic regression based monitoring for resource management in cloud networks. IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud). IEEE http:/dx.doi.org/https://doi.org/10.1109/FiCloud.2018.00045
Lemeshow S, Hosmer DW Jr (1982) A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol. https://doi.org/10.1093/oxfordjournals.aje.a113284
Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. Fourth IEEE International conference on data mining (ICDM'04). IEEE, http:/dx.doi.org/https://doi.org/10.1109/ICDM.2004.10030
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms (2006). Proceedings of the 23rd international conference on machine learning. http:/dx.doi.org/https://doi.org/10.1145/1143844.1143865
Zhang Y (2012) Support vector machine classification algorithm and its application (2012). International conference on information computing and applications. Springer http:/dx.doi.org/https://doi.org/10.1007/978-3-642-34041-3_27
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. https://doi.org/10.1093/bioinformatics/16.10.906
Solomatine DP,Shrestha DL (2004) AdaBoost. RT: a boosting algorithm for regression problems. IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541). IEEE http:/dx.doi.org/https://doi.org/10.1109/IJCNN.2004.1380102
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat Its Interface 2(3):349–360
Qi Y (2012) Random forest for bioinformatics. In: Qi Y (ed) Ensemble machine learning. Springer, Boston
Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Graph Statist 15(1):118–138
Yadav S, Shukla S (2016) Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification . IEEE 6th International conference on advanced computing (IACC). IEEE http:/dx.doi.org/https://doi.org/10.1109/IACC.2016.25
Cawley G C,Talbot N L (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107. https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf.
Rodriguez JD, Perez A, Lozano JA (2009) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575
Fushiki T (2011) Estimation of prediction error by using K-fold cross-validation. Stat Comput. https://doi.org/10.1007/s11222-009-9153-8
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. https://doi.org/10.1093/bioinformatics/16.5.412
Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10(3):e0118432
Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using matthews correlation coefficient metric. PLoS One 12(6):e0177678
Pruengkarn R, Fung CC,Wong KW (2015) Using misclassification data to improve classification performance (2015). 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE http:/dx.doi.org/https://doi.org/10.1109/ECTICon.2015.7206950
Acknowledgements
The authors are thankful to the Director of NIPER, Kolkata, for providing the resources and support. The author M.A.W. is thankful to the Department of Pharmaceuticals and the Ministry of Chemicals and Fertilizes for providing a Ph.D. fellowship.
Funding
None.
Author information
Authors and Affiliations
Contributions
M.A.W. and K.K.R. designed the project; M.A.W. carried out model building; M.A.W. and K.K.R. analyzed the results; M.A.W. wrote the initial draft of the manuscript; M.A.W. and K.K.R. revised the manuscript and checked the final version of the manuscript.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Wani, M.A., Roy, K.K. Development and validation of consensus machine learning-based models for the prediction of novel small molecules as potential anti-tubercular agents. Mol Divers 26, 1345–1356 (2022). https://doi.org/10.1007/s11030-021-10238-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11030-021-10238-y