Skip to main content

Advertisement

Log in

Development and validation of consensus machine learning-based models for the prediction of novel small molecules as potential anti-tubercular agents

  • Original Article
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Abstract

Tuberculosis (TB) is an infectious disease and the leading cause of death globally. The rapidly emerging cases of drug resistance among pathogenic mycobacteria have been a global threat urging the need of new drug discovery and development. However, considering the fact that the new drug discovery and development is commonly lengthy and costly processes, strategic use of the cutting-edge machine learning (ML) algorithms may be very supportive in reducing both the cost and time involved. Considering the urgency of new drugs for TB, herein, we have attempted to develop predictive ML algorithms-based models useful in the selection of novel potential small molecules for subsequent in vitro validation. For this purpose, we used the GlaxoSmithKline (GSK) TCAMS TB dataset comprising a total of 776 hits that were made publicly available to the wider scientific community through the ChEMBL Neglected Tropical Diseases (ChEMBL-NTD) database. After exploring the different ML classifiers, viz. decision trees (DT), support vector machine (SVM), random forest (RF), Bernoulli Naive Bayes (BNB), K-nearest neighbors (k-NN), and linear logistic regression (LLR), and ensemble learning models (bagging and Adaboost) for training the model using the GSK dataset, we concluded with three best models, viz. Adaboost decision tree (ABDT), RF classifier, and k-NN models that gave the top prediction results for both the training and test sets. However, during the prediction of the external set of known anti-tubercular compounds/drugs, it was realized that each of these models had some limitations. The ABDT model correctly predicted 22 molecules as actives, while both the RF and k-NN models predicted 18 molecules correctly as actives; a number of molecules were predicted as actives by two of these models, while the third model predicted these compounds as inactives. Therefore, we concluded that while deciding the anti-tubercular potential of a new molecule, one should rely on the use of consensus predictions using these three models; it may lessen the attrition rate during the in vitro validation. We believe that this study may assist the wider anti-tuberculosis research community by providing a platform for predicting small molecules with subsequent validation for drug discovery and development.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Global tuberculosis report (2020) World Health Organization: switzerland. https://apps.who.int/iris/bitstream/handle/10665/336069/9789240013131-eng.pdf

  2. What is DOTS?: A guide to understanding the WHO-recommended TB Control Strategy Known as DOTS. (1999), World Health Organization, Switzerland. https://apps.who.int/iris/handle/10665/65979

  3. Corbett EL, Watt CJ, Walker N, Maher D, Williams BG, Raviglione MC, Dye C (2003) The growing burden of tuberculosis: global trends and interactions with the HIV epidemic. Arch Intern Med 163(9):1009–10021

    Article  Google Scholar 

  4. Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18(6):463–477

    Article  CAS  Google Scholar 

  5. Chibani S, Coudert F-X (2020) Machine learning approaches for the prediction of materials properties. APL Mater 8(8):080701

    Article  CAS  Google Scholar 

  6. Ballell L, Bates RH, Young RJ, Alvarez-Gomez D, Alvarez-Ruiz E, Barroso V, Blanco D, Crespo B, Escribano J, Gonzalez R, Lozano S, Huss S, Santos-Villarejo A, Martin-Plaza JJ, Mendoza A, Rebollo-Lopez MJ, Remuinan-Blanco M, Lavandera JL, Perez-Herran E, Gamo-Benito FJ, Garcia-Bustos JF, Barros D, Castro JP, Cammack N (2013) Fueling open-source drug discovery: 177 small-molecule leads against tuberculosis. Chem Med Chem. https://doi.org/10.1002/cmdc.201200428

    Article  PubMed  Google Scholar 

  7. Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C, Jupyter development team (2016) Jupyter Notebooks – a publishing format for reproducible computational workflows. Loizides, Fernando and Scmidt, Birgit (eds.) In Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press. pp. 87–90. https://doi.org/10.3233/978-1-61499-649-1-87

  8. Lemaître G, Nogueira F,Aridas C K (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res, 18(1):559–563. https://www.jmlr.org/papers/volume18/16-365/16-365.pdf

  9. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R,Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. https://jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

  10. Landrum G (2011) Rdkit documentation, Release 2011.12.1, 1–79. http://www.rdkit.org/RDKit_Docs.2011_12_1.pdf

  11. Cao DS, Liang YZ, Yan J, Tan GS, Xu QS, Liu S (2013) PyDPI: freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies. J Chem Inf Model 53(11):3086–3096

    Article  CAS  Google Scholar 

  12. Ballell L, Bates RH, Young RJ, Alvarez-Gomez D, Alvarez-Ruiz E, Barroso V, Blanco D, Crespo B, Escribano J, Gonzalez R, Lozano S, Huss S, Santos-Villarejo A, Martin-Plaza JJ, Mendoza A, Rebollo-Lopez MJ, Remuinan-Blanco M, Lavandera JL, Perez-Herran E, Gamo-Benito FJ, Garcia-Bustos JF, Barros D, Castro JP, Cammack N (2013) ChEMBL database. http://dx.doi.org/10.6019/CHEMBL2095176

  13. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C Appl Rev. https://doi.org/10.1109/TSMCC.2011.2161285

    Article  Google Scholar 

  14. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell 16:321–357

    Google Scholar 

  15. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Humans 40(1):185–197

    Article  Google Scholar 

  16. Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern. https://doi.org/10.1109/TSMCB.2008.2007853

    Article  PubMed  Google Scholar 

  17. Matta CF, Arabi AA (2011) Electron-density descriptors as predictors in quantitative structure–activity/property relationships and drug design. Future Med Chem 3(8):969–994

    Article  CAS  Google Scholar 

  18. Liu Y (2004) A comparative study on feature selection methods for drug discovery. J Chem Inf Comput Sci. https://doi.org/10.1021/ci049875d

    Article  PubMed  Google Scholar 

  19. Cai J, Luo J, Wang S, Yang S (2018) Feature selection in machine learning: a new perspective. Neurocomputing 300(70):79

    Google Scholar 

  20. McHugh ML (2013) The chi-square test of independence. Biochem Med (Zagreb) 23(2):143–149

    Article  CAS  Google Scholar 

  21. Kersting K (2018) Machine learning and artificial intelligence: two fellow travelers on the quest for intelligent behavior in machines. Front Big Data. https://doi.org/10.3389/fdata.2018.00006

    Article  PubMed  PubMed Central  Google Scholar 

  22. Randles BM, Pasquetto IV, Golshan MS,Borgman CL (2017) Using the Jupyter notebook as a tool for open science: an empirical study. ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE http:/dx.doi.org/https://doi.org/10.1109/JCDL.2017.7991618

  23. Luo G (2016) A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform. https://doi.org/10.1007/s13721-016-0125-6

    Article  Google Scholar 

  24. Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316

    Article  Google Scholar 

  25. Song YY, Lu Y (2015) Decision tree methods: applications for classification and prediction. Shanghai Arch Psych 27(2):130–135

    Google Scholar 

  26. Abu Alfeilat HA, Hassanat AB, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VS (2019) Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data. https://doi.org/10.1089/big.2018.0175

    Article  PubMed  Google Scholar 

  27. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27

    Article  Google Scholar 

  28. Pearce J, Ferrier S (2000) An evaluation of alternative algorithms for fitting species distribution models using logistic regression. Ecol Modell. https://doi.org/10.1016/S0304-3800(99)00227-6

    Article  Google Scholar 

  29. Daraghmeh M, Melhem SB, Agarwal A, Goel N, Zaman M (2018) Linear and logistic regression based monitoring for resource management in cloud networks. IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud). IEEE http:/dx.doi.org/https://doi.org/10.1109/FiCloud.2018.00045

  30. Lemeshow S, Hosmer DW Jr (1982) A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol. https://doi.org/10.1093/oxfordjournals.aje.a113284

    Article  PubMed  Google Scholar 

  31. Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. Fourth IEEE International conference on data mining (ICDM'04). IEEE, http:/dx.doi.org/https://doi.org/10.1109/ICDM.2004.10030

  32. Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms (2006). Proceedings of the 23rd international conference on machine learning. http:/dx.doi.org/https://doi.org/10.1145/1143844.1143865

  33. Zhang Y (2012) Support vector machine classification algorithm and its application (2012). International conference on information computing and applications. Springer http:/dx.doi.org/https://doi.org/10.1007/978-3-642-34041-3_27

  34. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. https://doi.org/10.1093/bioinformatics/16.10.906

    Article  PubMed  Google Scholar 

  35. Solomatine DP,Shrestha DL (2004) AdaBoost. RT: a boosting algorithm for regression problems. IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541). IEEE http:/dx.doi.org/https://doi.org/10.1109/IJCNN.2004.1380102

  36. Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat Its Interface 2(3):349–360

    Article  Google Scholar 

  37. Qi Y (2012) Random forest for bioinformatics. In: Qi Y (ed) Ensemble machine learning. Springer, Boston

    Google Scholar 

  38. Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Graph Statist 15(1):118–138

    Article  Google Scholar 

  39. Yadav S, Shukla S (2016) Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification . IEEE 6th International conference on advanced computing (IACC). IEEE http:/dx.doi.org/https://doi.org/10.1109/IACC.2016.25

  40. Cawley G C,Talbot N L (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107. https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf.

  41. Rodriguez JD, Perez A, Lozano JA (2009) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575

    Article  Google Scholar 

  42. Fushiki T (2011) Estimation of prediction error by using K-fold cross-validation. Stat Comput. https://doi.org/10.1007/s11222-009-9153-8

    Article  Google Scholar 

  43. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

    Article  Google Scholar 

  44. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. https://doi.org/10.1093/bioinformatics/16.5.412

    Article  PubMed  Google Scholar 

  45. Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10(3):e0118432

    Article  Google Scholar 

  46. Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using matthews correlation coefficient metric. PLoS One 12(6):e0177678

    Article  Google Scholar 

  47. Pruengkarn R, Fung CC,Wong KW (2015) Using misclassification data to improve classification performance (2015). 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE http:/dx.doi.org/https://doi.org/10.1109/ECTICon.2015.7206950

Download references

Acknowledgements

The authors are thankful to the Director of NIPER, Kolkata, for providing the resources and support. The author M.A.W. is thankful to the Department of Pharmaceuticals and the Ministry of Chemicals and Fertilizes for providing a Ph.D. fellowship.

Funding

None.

Author information

Authors and Affiliations

Authors

Contributions

M.A.W. and K.K.R. designed the project; M.A.W. carried out model building; M.A.W. and K.K.R. analyzed the results; M.A.W. wrote the initial draft of the manuscript; M.A.W. and K.K.R. revised the manuscript and checked the final version of the manuscript.

Corresponding author

Correspondence to Kuldeep K. Roy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wani, M.A., Roy, K.K. Development and validation of consensus machine learning-based models for the prediction of novel small molecules as potential anti-tubercular agents. Mol Divers 26, 1345–1356 (2022). https://doi.org/10.1007/s11030-021-10238-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-021-10238-y

Keywords

Navigation