Skip to main content
Log in

Soft voting technique to improve the performance of global filter based feature selection in text corpus

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In text classification, the Global Filter-based Feature Selection Scheme (GFSS) selects the top-N ranked words as features. It discards the low ranked features from some classes either partially or completely. The low rank is usually due to varying occurrence of the words (terms) in the classes. The Latent Semantic Analysis (LSA) can be used to address this issue as it eliminates the redundant terms. It assigns an equal rank to the terms that represent similar concepts or meanings, e.g. four terms “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, any selected term by the algorithms from these four terms doesn’t affect the classifier performance. However, it does not guarantee that the selection of top-N LSA ranked terms by GFSS are the representative terms of each class. An Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative terms from all the classes. However, it has two issues, first, it assigns the class label and membership of each term on the basis of an individual vote of the Odds Ratio (OR) method thereby limiting the decision making capability. Second, the ratio of selected terms is determined empirically by the IGFSS and a common ratio is applied to all the classes to assign the positive and negative membership of the terms. However, the ratio of positive and negative nature terms varies from one class to another and it may be very less for one class, whereas high for other classes. Thus, one common negative features ratio used by the IGFSS affects those classes of a dataset in which there is an imbalance between positive and negative nature words. To address these issues of IGFSS, a new Soft Voting Technique (SVT) is proposed to improve the performance of GFSS. There are two main contributions in this paper: (i) The weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) improves the numerical discrimination of words to identify there positive and negative membership to a class. (ii) A mathematical expression is incorporated in the IGFSS that computes a varying ratio of positive and negative memberships of the terms for each class. The membership is based on the occurrence of the terms in the classes. The proposed SVT is evaluated using four standard classifiers applied on five bench-marked datasets. The experimental results based on Macro_F1 and Micro_F1 measures show that SVT achieves a significant improvement in the performance of classifiers in comparison of standard methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://blog.josephwilk.net/projects/latent-semantic-analysis-in-python.html

  2. http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/

  3. http://nbviewer.ipython.org/gist/rjweiss/7158866

  4. http://www.keel.es/

  5. https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

  6. https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

  7. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/

  8. http://www.dataminingresearch.com/index.php/category/dataset/

  9. http://www.ncbi.nlm.nih.gov/pubmed/?term=mouse+gene+ontology

  10. https://cran.r-project.org/web/packages/XML/XML.pdf

  11. http://scikit-learn.org/stable/

  12. http://www.kdnuggets.com/2016/07/softmax-regression-related-logistic-regression.html

  13. http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics

References

  1. Agnihotri D, Verma K, Tripathi P (2014) Pattern and cluster mining on text data. In: Fourth international conference on communication systems and network technologies. IEEE Computer Society, CSNT, Bhopal, pp 428-432. https://doi.org/10.1109/CSNT.2014.92

  2. Agnihotri D, Verma K, Tripathi P (2016) Computing correlative association of terms for automatic classification of text documents. In: Proceedings of the third international symposium on computer vision and the internet, ACM, pp 71–80

  3. Agnihotri D, Verma K, Tripathi P (2016b) Computing symmetrical strength of n-grams: a two pass filtering approach in automatic classification of text documents. SPRINGERPLUS 5(942):1–29

    Google Scholar 

  4. Agnihotri D, Verma K, Tripathi P (2017) An automatic classification of text documents based on correlative association of words. J Intell Inform Syst. https://doi.org/10.1007/s10844-017-0482-3

  5. Agnihotri D, Verma K, Tripathi P (2017) Mutual information using sample variance for text feature selection. In: Proceedings of the 3rd international conference on communication and information processing, ACM, New York, NY, USA, ICCIP ’17, pp 39–44. https://doi.org/10.1145/3162957.3163054

  6. Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281. https://doi.org/10.1016/j.eswa.2017.03.057. http://www.sciencedirect.com/science/article/pii/S0957417417302208

    Article  Google Scholar 

  7. Agnihotri D, Verma K, Tripathi P, Choudhary N (2018) A review of techniques to determine the optimal word score in text classification. In: Perez GM, Tiwari S, Trivedi MC, Mishra KK (eds) Ambient communications and computer systems. Springer, Singapore, pp 497–507

  8. Alejandro SD, VAJIA N, Carlos SJ (2012) Comparison between svm and logistic regression: which one is better to discriminate?. Revista Colombiana de EstadÃstica 35:223–237. http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-17512012000200003&nrm=iso

    MathSciNet  MATH  Google Scholar 

  9. Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, Physica-Verlag HD, pp 177–186

  10. Caulkins JP, Ding W, Duncan G, Krishnan R, Nyberg E (2006) A method for managing access to web pages: Filtering by statistical classification (fsc) applied to text. Decision Support Syst 42(1):144–161. https://doi.org/10.1016/j.dss.2004.11.015. http://www.sciencedirect.com/science/article/pii/S0167923604002635

    Article  Google Scholar 

  11. Chan SW, Chong MW (2004) Unsupervised clustering for nontextual web document classification. Decision Support Syst 37(3):377– 396. https://doi.org/10.1016/S0167-9236(03)00035-6. http://www.sciencedirect.com/science/article/pii/S0167923603000356

    Article  Google Scholar 

  12. Chen Y, Zhang H, Liu R, Ye Z, Lin J (2018) Experimental explorations on short text topic mining between lda and nmf based schemes. Knowledge-Based Syst. https://doi.org/10.1016/j.knosys.2018.08.011. http://www.sciencedirect.com/science/article/pii/S0950705118304076

  13. Cohen AM, Hersh WR (2006) The trec 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab 1(1):4. https://doi.org/10.1186/1747-5333-1-4

    Article  Google Scholar 

  14. Craven M, McCallum A, PiPasquo D, Mitchell T, Freitag D (1998) Learning to extract symbolic knowledge from the world wide web. Tech. Rep. No. CMU-CS-98-122, Carnegie-Mellon University Pittsburgh pa School of Computer Science

  15. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548

    MathSciNet  MATH  Google Scholar 

  16. Du M, Chen XS (2013) Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization. J Zhejiang University Sci C 14(6):407–416. https://doi.org/10.1631/jzus.C1200303

    Article  Google Scholar 

  17. Fabian P, Gaël V, Alexandre G, Vincent M, Bertrand T, Olivier G, Mathieu B, Peter P, Ron W, Vincent D, Jake V, Alexandre P, David C, Matthieu B, Matthieu P, Édouard D (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830. http://dl.acm.org/citation.cfm?id=1953048.2078195

    MathSciNet  MATH  Google Scholar 

  18. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  19. Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA, ICML ’04, pp 38–. https://doi.org/10.1145/1015330.1015356

  20. Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 1923. Springer, Berlin, pp 59–68

    Google Scholar 

  21. García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  22. García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Springer, Berlin

    Book  Google Scholar 

  23. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. http://dl.acm.org/citation.cfm?id=944919.944968

    MATH  Google Scholar 

  24. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  25. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70

    MathSciNet  MATH  Google Scholar 

  26. Hommel G, Bernhard G (1999) Bonferroni procedures for logically related hypotheses. J Statist Plann Inference 82:119–128

    Article  MathSciNet  MATH  Google Scholar 

  27. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In European conference on machine learning (pp. 137–142). Springer, Berlin, Heidelberg

  28. Kamal N, Kachites MA, Sebastian T, Tom M (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134. https://doi.org/10.1023/A:1007692713085

    MATH  Google Scholar 

  29. Kevin B, Moshe L (2013) Uci machine learning repository. http://archiveicsuciedu/ml901

  30. Li XM, Ouyang JH, Lu Y (2015) Topic modeling for large-scale text data. Frontiers Inform Technol Electron Eng 16(6):457–465. https://doi.org/10.1631/FITEE.1400352

    Article  Google Scholar 

  31. Luengoán J, García S, Francisco H (2009) A study on the use of statistical tests for experimentation with neural networks: analysis of parametric test conditions and non-parametric tests. Expert Syst Appl 36(4):7798–7808

    Article  Google Scholar 

  32. Luis T (2005) An Evaluation of Filter and Wrapper Methods for Feature Selection in Categorical Clustering. Springer, Berlin, pp 440–451. https://doi.org/10.1007/11552253_40

    MATH  Google Scholar 

  33. Manning C D, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  34. Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceeding of the 16th international conference on machine learning, San Francisco, SF, pp 258–267

  35. Moschitti A, Basili R (2004) Ohsumed medical corpus dataset. http://disi.unitn.it/moschitti/corpora.htm

  36. Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum 31(SI):67–73. https://doi.org/10.1145/278459.258537

    Article  Google Scholar 

  37. Nist T (2001) Ohsumed medical corpus dataset. http://trec.nist.gov/data/t9_filtering.html

  38. Rohit P, Devansh A, Shuang W, Premkumar N, Pradeep N (2013) Ridge regression based classifiers for large scale class imbalanced datasets. In: Proceedings of the 2013 IEEE workshop on applications of computer vision (WACV), IEEE Computer Society, Washington, DC, USA, WACV ’13, pp 267–274. https://doi.org/10.1109/WACV.2013.6475028

  39. Sebastiani F (2002) Machine learning in automated text classification. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  40. Shaffer JP (1986) Modified sequentially rejective multiple test procedures. J Am Stat Assoc 81(395):826–831

    Article  MATH  Google Scholar 

  41. Singh BK, Verma K, Thoke A, Suri JS (2017) Risk stratification of 2d ultrasound-based breast lesions using hybrid feature selection in machine learning paradigm. Measurement 105:146–157. https://doi.org/10.1016/j.measurement.2017.01.016. http://www.sciencedirect.com/science/article/pii/S026322411730026X

    Article  Google Scholar 

  42. Song D, Lau RY, Bruza PD, Wong KF, Chen DY (2007) An intelligent information agent for document title classification and filtering in document-intensive domains. Decision Support Syst 44(1):251–265. https://doi.org/10.1016/j.dss.2007.04.001. http://www.sciencedirect.com/science/article/pii/S0167923607000681

    Article  Google Scholar 

  43. Tellez ES, Moctezuma D, Miranda-Jiménez S, Graff M (2018) An automated text categorization framework based on hyperparameter optimization. Knowledge-Based Syst 149:110–123. https://doi.org/10.1016/j.knosys.2018.03.003. http://www.sciencedirect.com/science/article/pii/S0950705118301217

    Article  Google Scholar 

  44. Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92. https://doi.org/10.1016/j.eswa.2015.08.050. http://www.sciencedirect.com/science/article/pii/S0957417415006077

    Article  Google Scholar 

  45. Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowledge-based Syst, Elsevier 36:226–235

    Article  Google Scholar 

  46. Van RCJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, Newton

    Google Scholar 

  47. Bk W, Yf H, Wx Y, Li X (2012) Short text classification based on strong feature thesaurus. J Zhejiang University Sci C 13(9):649–659. https://doi.org/10.1631/jzus.C1100373

    Article  Google Scholar 

  48. Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45:1–10

    Article  Google Scholar 

  49. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’97, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137

  50. Zheng Z, Srihari R (2003) Optimally combining positive and negative features for text categorization. In: ICML 2003 Workshop, Washington DC, USA

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepak Agnihotri.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agnihotri, D., Verma, K., Tripathi, P. et al. Soft voting technique to improve the performance of global filter based feature selection in text corpus. Appl Intell 49, 1597–1619 (2019). https://doi.org/10.1007/s10489-018-1349-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1349-1

Keywords

Navigation