Abstract
Despite the growth of bilingual word embeddings, there is no work done so far, for directly evaluating them for English–Tamil language pair. In this paper, we present a data resource and evaluation for the English–Tamil bilingual word vector model. In this paper, we present dataset and the evaluation paradigm for English–Tamil bilingual language pair. This dataset contains words that covers a range of concepts that occur in natural language. The dataset is scored based on the similarity rather than association or relatedness. Hence, the word pairs that are associated but not literally similar have a low rating. The measures are quantified further to ensure consistency in the dataset, mimicking the cognitive phenomena. Henceforth, the dataset can be used by non-native speakers, with minimal effort. We also present some inferences and insights into the semantics captured by word vectors and human cognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Each word is appended with an id (:id) to understand the mapping between the sentences.
- 2.
All the participants were informed about this study, and they have provided their consent to be part of this.
References
Akhtar, S.S., Gupta, A., Vajpayee, A., Srivastava, A., Shrivastava, M.: Word similarity datasets for indian languages: Annotation and baseline systems. In: LAW@ACL (2017)
Bruni, E., Tran, N.K., Baroni, M.: Multimodal distributional semantics. J. Artif. Int. Res. 49(1), 1–47 (2014). URL http://dl.acm.org/citation.cfm?id=2655713.2655714
Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In: IN WORKSHOP ON WORDNET AND OTHER LEXICAL RESOURCES, SECOND MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2001)
Chomsky, N.: Aspects of the Theory of Syntax. The MIT Press, Cambridge (1965). URL http://www.amazon.com/Aspects-Theory-Syntax-Noam-Chomsky/dp/0262530074
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011). URL http://dl.acm.org/citation.cfm?id=1953048.2078186
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: The concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pp. 406–414. ACM, New York, NY, USA (2001). 10.1145/371920.372094. URL http://doi.acm.org/10.1145/371920.372094
Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: Fast bilingual distributed representations without word alignments. In: F. Bach, D. Blei (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 748–756. PMLR, Lille, France (2015)
Hill, F., Reichart, R., Korhonen, A.: Simlex-999: Evaluating semantic models with (genuine) similarity estimation. CoRR abs/1408.3456 (2014). URL http://arxiv.org/abs/1408.3456
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pp. 873–882. Association for Computational Linguistics, Stroudsburg, PA, USA (2012). URL http://dl.acm.org/citation.cfm?id=2390524.2390645
Li, Q., Shah, S., Nourbakhsh, A., Liu, X., Fang, R.: Hashtag recommendation based on topic enhanced embedding, tweet entity data and learning to rank. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pp. 2085–2088. ACM, New York, NY, USA (2016). 10.1145/2983323.2983915. URL http://doi.acm.org/10.1145/2983323.2983915
Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2418–2424. AAAI Press (2015). URL http://dl.acm.org/citation.cfm?id=2886521.2886657
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 3111–3119. Curran Associates Inc., USA (2013). URL http://dl.acm.org/citation.cfm?id=2999792.2999959
Rekha, R.U., Anand Kumar, M., Dhanalakshmi, V., Soman, K.P., Rajendran, S.: A novel approach to morphological generator for tamil. In: Kannan, R., Andres, F. (eds.) Data Engineering and Management, pp. 249–251. Springer, Berlin Heidelberg, Berlin, Heidelberg (2012)
Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., Dyer, C.: Evaluation of word vector representations by subspace alignment. In: EMNLP (2015)
Zahran, M.A., Magooda, A., Mahgoub, A.Y., Raafat, H., Rashwan, M., Atyia, A.: Word representations in vector space and their applications for arabic. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing, pp. 430–443. Springer International Publishing, Cham (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sanjanasri, J.P., Menon, V.K., Rajendran, S., Soman, K.P., Anand Kumar, M. (2020). Intrinsic Evaluation for English–Tamil Bilingual Word Embeddings. In: Thampi, S., et al. Intelligent Systems, Technologies and Applications. Advances in Intelligent Systems and Computing, vol 910. Springer, Singapore. https://doi.org/10.1007/978-981-13-6095-4_3
Download citation
DOI: https://doi.org/10.1007/978-981-13-6095-4_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6094-7
Online ISBN: 978-981-13-6095-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)