Soft voting technique to improve the performance of global filter based feature selection in text corpus

Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka; Singh, Bikesh Kumar

doi:10.1007/s10489-018-1349-1

Soft voting technique to improve the performance of global filter based feature selection in text corpus

Published: 21 November 2018

Volume 49, pages 1597–1619, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Deepak Agnihotri ORCID: orcid.org/0000-0002-1536-2261¹,
Kesari Verma¹,
Priyanka Tripathi¹ &
…
Bikesh Kumar Singh²

538 Accesses
17 Citations
Explore all metrics

Abstract

In text classification, the Global Filter-based Feature Selection Scheme (GFSS) selects the top-N ranked words as features. It discards the low ranked features from some classes either partially or completely. The low rank is usually due to varying occurrence of the words (terms) in the classes. The Latent Semantic Analysis (LSA) can be used to address this issue as it eliminates the redundant terms. It assigns an equal rank to the terms that represent similar concepts or meanings, e.g. four terms “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, any selected term by the algorithms from these four terms doesn’t affect the classifier performance. However, it does not guarantee that the selection of top-N LSA ranked terms by GFSS are the representative terms of each class. An Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative terms from all the classes. However, it has two issues, first, it assigns the class label and membership of each term on the basis of an individual vote of the Odds Ratio (OR) method thereby limiting the decision making capability. Second, the ratio of selected terms is determined empirically by the IGFSS and a common ratio is applied to all the classes to assign the positive and negative membership of the terms. However, the ratio of positive and negative nature terms varies from one class to another and it may be very less for one class, whereas high for other classes. Thus, one common negative features ratio used by the IGFSS affects those classes of a dataset in which there is an imbalance between positive and negative nature words. To address these issues of IGFSS, a new Soft Voting Technique (SVT) is proposed to improve the performance of GFSS. There are two main contributions in this paper: (i) The weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) improves the numerical discrimination of words to identify there positive and negative membership to a class. (ii) A mathematical expression is incorporated in the IGFSS that computes a varying ratio of positive and negative memberships of the terms for each class. The membership is based on the occurrence of the terms in the classes. The proposed SVT is evaluated using four standard classifiers applied on five bench-marked datasets. The experimental results based on Macro_F1 and Micro_F1 measures show that SVT achieves a significant improvement in the performance of classifiers in comparison of standard methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

A review of semi-supervised learning for text classification

Article 31 January 2023

A review of unsupervised feature selection methods

Article 29 January 2019

Notes

References

Agnihotri D, Verma K, Tripathi P (2014) Pattern and cluster mining on text data. In: Fourth international conference on communication systems and network technologies. IEEE Computer Society, CSNT, Bhopal, pp 428-432. https://doi.org/10.1109/CSNT.2014.92
Agnihotri D, Verma K, Tripathi P (2016) Computing correlative association of terms for automatic classification of text documents. In: Proceedings of the third international symposium on computer vision and the internet, ACM, pp 71–80
Agnihotri D, Verma K, Tripathi P (2016b) Computing symmetrical strength of n-grams: a two pass filtering approach in automatic classification of text documents. SPRINGERPLUS 5(942):1–29
Google Scholar
Agnihotri D, Verma K, Tripathi P (2017) An automatic classification of text documents based on correlative association of words. J Intell Inform Syst. https://doi.org/10.1007/s10844-017-0482-3
Agnihotri D, Verma K, Tripathi P (2017) Mutual information using sample variance for text feature selection. In: Proceedings of the 3rd international conference on communication and information processing, ACM, New York, NY, USA, ICCIP ’17, pp 39–44. https://doi.org/10.1145/3162957.3163054
Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281. https://doi.org/10.1016/j.eswa.2017.03.057. http://www.sciencedirect.com/science/article/pii/S0957417417302208
Article Google Scholar
Agnihotri D, Verma K, Tripathi P, Choudhary N (2018) A review of techniques to determine the optimal word score in text classification. In: Perez GM, Tiwari S, Trivedi MC, Mishra KK (eds) Ambient communications and computer systems. Springer, Singapore, pp 497–507
Alejandro SD, VAJIA N, Carlos SJ (2012) Comparison between svm and logistic regression: which one is better to discriminate?. Revista Colombiana de EstadÃstica 35:223–237. http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-17512012000200003&nrm=iso
MathSciNet MATH Google Scholar
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, Physica-Verlag HD, pp 177–186
Caulkins JP, Ding W, Duncan G, Krishnan R, Nyberg E (2006) A method for managing access to web pages: Filtering by statistical classification (fsc) applied to text. Decision Support Syst 42(1):144–161. https://doi.org/10.1016/j.dss.2004.11.015. http://www.sciencedirect.com/science/article/pii/S0167923604002635
Article Google Scholar
Chan SW, Chong MW (2004) Unsupervised clustering for nontextual web document classification. Decision Support Syst 37(3):377– 396. https://doi.org/10.1016/S0167-9236(03)00035-6. http://www.sciencedirect.com/science/article/pii/S0167923603000356
Article Google Scholar
Chen Y, Zhang H, Liu R, Ye Z, Lin J (2018) Experimental explorations on short text topic mining between lda and nmf based schemes. Knowledge-Based Syst. https://doi.org/10.1016/j.knosys.2018.08.011. http://www.sciencedirect.com/science/article/pii/S0950705118304076
Cohen AM, Hersh WR (2006) The trec 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab 1(1):4. https://doi.org/10.1186/1747-5333-1-4
Article Google Scholar
Craven M, McCallum A, PiPasquo D, Mitchell T, Freitag D (1998) Learning to extract symbolic knowledge from the world wide web. Tech. Rep. No. CMU-CS-98-122, Carnegie-Mellon University Pittsburgh pa School of Computer Science
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548
MathSciNet MATH Google Scholar
Du M, Chen XS (2013) Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization. J Zhejiang University Sci C 14(6):407–416. https://doi.org/10.1631/jzus.C1200303
Article Google Scholar
Fabian P, Gaël V, Alexandre G, Vincent M, Bertrand T, Olivier G, Mathieu B, Peter P, Ron W, Vincent D, Jake V, Alexandre P, David C, Matthieu B, Matthieu P, Édouard D (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830. http://dl.acm.org/citation.cfm?id=1953048.2078195
MathSciNet MATH Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA, ICML ’04, pp 38–. https://doi.org/10.1145/1015330.1015356
Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 1923. Springer, Berlin, pp 59–68
Google Scholar
García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694
MATH Google Scholar
García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Springer, Berlin
Book Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. http://dl.acm.org/citation.cfm?id=944919.944968
MATH Google Scholar
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
MathSciNet MATH Google Scholar
Hommel G, Bernhard G (1999) Bonferroni procedures for logically related hypotheses. J Statist Plann Inference 82:119–128
Article MathSciNet MATH Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In European conference on machine learning (pp. 137–142). Springer, Berlin, Heidelberg
Kamal N, Kachites MA, Sebastian T, Tom M (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134. https://doi.org/10.1023/A:1007692713085
MATH Google Scholar
Kevin B, Moshe L (2013) Uci machine learning repository. http://archiveicsuciedu/ml901
Li XM, Ouyang JH, Lu Y (2015) Topic modeling for large-scale text data. Frontiers Inform Technol Electron Eng 16(6):457–465. https://doi.org/10.1631/FITEE.1400352
Article Google Scholar
Luengoán J, García S, Francisco H (2009) A study on the use of statistical tests for experimentation with neural networks: analysis of parametric test conditions and non-parametric tests. Expert Syst Appl 36(4):7798–7808
Article Google Scholar
Luis T (2005) An Evaluation of Filter and Wrapper Methods for Feature Selection in Categorical Clustering. Springer, Berlin, pp 440–451. https://doi.org/10.1007/11552253_40
MATH Google Scholar
Manning C D, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Book MATH Google Scholar
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceeding of the 16th international conference on machine learning, San Francisco, SF, pp 258–267
Moschitti A, Basili R (2004) Ohsumed medical corpus dataset. http://disi.unitn.it/moschitti/corpora.htm
Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum 31(SI):67–73. https://doi.org/10.1145/278459.258537
Article Google Scholar
Nist T (2001) Ohsumed medical corpus dataset. http://trec.nist.gov/data/t9_filtering.html
Rohit P, Devansh A, Shuang W, Premkumar N, Pradeep N (2013) Ridge regression based classifiers for large scale class imbalanced datasets. In: Proceedings of the 2013 IEEE workshop on applications of computer vision (WACV), IEEE Computer Society, Washington, DC, USA, WACV ’13, pp 267–274. https://doi.org/10.1109/WACV.2013.6475028
Sebastiani F (2002) Machine learning in automated text classification. ACM Comput Surv 34(1):1–47
Article Google Scholar
Shaffer JP (1986) Modified sequentially rejective multiple test procedures. J Am Stat Assoc 81(395):826–831
Article MATH Google Scholar
Singh BK, Verma K, Thoke A, Suri JS (2017) Risk stratification of 2d ultrasound-based breast lesions using hybrid feature selection in machine learning paradigm. Measurement 105:146–157. https://doi.org/10.1016/j.measurement.2017.01.016. http://www.sciencedirect.com/science/article/pii/S026322411730026X
Article Google Scholar
Song D, Lau RY, Bruza PD, Wong KF, Chen DY (2007) An intelligent information agent for document title classification and filtering in document-intensive domains. Decision Support Syst 44(1):251–265. https://doi.org/10.1016/j.dss.2007.04.001. http://www.sciencedirect.com/science/article/pii/S0167923607000681
Article Google Scholar
Tellez ES, Moctezuma D, Miranda-Jiménez S, Graff M (2018) An automated text categorization framework based on hyperparameter optimization. Knowledge-Based Syst 149:110–123. https://doi.org/10.1016/j.knosys.2018.03.003. http://www.sciencedirect.com/science/article/pii/S0950705118301217
Article Google Scholar
Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92. https://doi.org/10.1016/j.eswa.2015.08.050. http://www.sciencedirect.com/science/article/pii/S0957417415006077
Article Google Scholar
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowledge-based Syst, Elsevier 36:226–235
Article Google Scholar
Van RCJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, Newton
Google Scholar
Bk W, Yf H, Wx Y, Li X (2012) Short text classification based on strong feature thesaurus. J Zhejiang University Sci C 13(9):649–659. https://doi.org/10.1631/jzus.C1100373
Article Google Scholar
Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45:1–10
Article Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’97, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
Zheng Z, Srihari R (2003) Optimally combining positive and negative features for text categorization. In: ICML 2003 Workshop, Washington DC, USA

Download references

Author information

Authors and Affiliations

Department of Computer Applications, National Institute of Technology, Raipur, CG, India
Deepak Agnihotri, Kesari Verma & Priyanka Tripathi
Department of Biomedical Engineering, National Institute of Technology, Raipur, CG, India
Bikesh Kumar Singh

Authors

Deepak Agnihotri
View author publications
You can also search for this author in PubMed Google Scholar
Kesari Verma
View author publications
You can also search for this author in PubMed Google Scholar
Priyanka Tripathi
View author publications
You can also search for this author in PubMed Google Scholar
Bikesh Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepak Agnihotri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Agnihotri, D., Verma, K., Tripathi, P. et al. Soft voting technique to improve the performance of global filter based feature selection in text corpus. Appl Intell 49, 1597–1619 (2019). https://doi.org/10.1007/s10489-018-1349-1

Download citation

Published: 21 November 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s10489-018-1349-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Soft voting technique to improve the performance of global filter based feature selection in text corpus

Abstract

Access this article

Similar content being viewed by others

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of semi-supervised learning for text classification

A review of unsupervised feature selection methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Soft voting technique to improve the performance of global filter based feature selection in text corpus

Abstract

Access this article

Similar content being viewed by others

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of semi-supervised learning for text classification

A review of unsupervised feature selection methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation