Skip to main content

Intrinsic Evaluation for English–Tamil Bilingual Word Embeddings

  • Conference paper
  • First Online:
Intelligent Systems, Technologies and Applications

Abstract

Despite the growth of bilingual word embeddings, there is no work done so far, for directly evaluating them for English–Tamil language pair. In this paper, we present a data resource and evaluation for the English–Tamil bilingual word vector model. In this paper, we present dataset and the evaluation paradigm for English–Tamil bilingual language pair. This dataset contains words that covers a range of concepts that occur in natural language. The dataset is scored based on the similarity rather than association or relatedness. Hence, the word pairs that are associated but not literally similar have a low rating. The measures are quantified further to ensure consistency in the dataset, mimicking the cognitive phenomena. Henceforth, the dataset can be used by non-native speakers, with minimal effort. We also present some inferences and insights into the semantics captured by word vectors and human cognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Each word is appended with an id (:id) to understand the mapping between the sentences.

  2. 2.

    All the participants were informed about this study, and they have provided their consent to be part of this.

References

  1. Akhtar, S.S., Gupta, A., Vajpayee, A., Srivastava, A., Shrivastava, M.: Word similarity datasets for indian languages: Annotation and baseline systems. In: LAW@ACL (2017)

    Google Scholar 

  2. Bruni, E., Tran, N.K., Baroni, M.: Multimodal distributional semantics. J. Artif. Int. Res. 49(1), 1–47 (2014). URL http://dl.acm.org/citation.cfm?id=2655713.2655714

    Article  MathSciNet  Google Scholar 

  3. Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In: IN WORKSHOP ON WORDNET AND OTHER LEXICAL RESOURCES, SECOND MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2001)

    Google Scholar 

  4. Chomsky, N.: Aspects of the Theory of Syntax. The MIT Press, Cambridge (1965). URL http://www.amazon.com/Aspects-Theory-Syntax-Noam-Chomsky/dp/0262530074

  5. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011). URL http://dl.acm.org/citation.cfm?id=1953048.2078186

  6. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: The concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pp. 406–414. ACM, New York, NY, USA (2001). 10.1145/371920.372094. URL http://doi.acm.org/10.1145/371920.372094

  7. Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: Fast bilingual distributed representations without word alignments. In: F. Bach, D. Blei (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 748–756. PMLR, Lille, France (2015)

    Google Scholar 

  8. Hill, F., Reichart, R., Korhonen, A.: Simlex-999: Evaluating semantic models with (genuine) similarity estimation. CoRR abs/1408.3456 (2014). URL http://arxiv.org/abs/1408.3456

  9. Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pp. 873–882. Association for Computational Linguistics, Stroudsburg, PA, USA (2012). URL http://dl.acm.org/citation.cfm?id=2390524.2390645

  10. Li, Q., Shah, S., Nourbakhsh, A., Liu, X., Fang, R.: Hashtag recommendation based on topic enhanced embedding, tweet entity data and learning to rank. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pp. 2085–2088. ACM, New York, NY, USA (2016). 10.1145/2983323.2983915. URL http://doi.acm.org/10.1145/2983323.2983915

  11. Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2418–2424. AAAI Press (2015). URL http://dl.acm.org/citation.cfm?id=2886521.2886657

  12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 3111–3119. Curran Associates Inc., USA (2013). URL http://dl.acm.org/citation.cfm?id=2999792.2999959

  13. Rekha, R.U., Anand Kumar, M., Dhanalakshmi, V., Soman, K.P., Rajendran, S.: A novel approach to morphological generator for tamil. In: Kannan, R., Andres, F. (eds.) Data Engineering and Management, pp. 249–251. Springer, Berlin Heidelberg, Berlin, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., Dyer, C.: Evaluation of word vector representations by subspace alignment. In: EMNLP (2015)

    Google Scholar 

  15. Zahran, M.A., Magooda, A., Mahgoub, A.Y., Raafat, H., Rashwan, M., Atyia, A.: Word representations in vector space and their applications for arabic. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing, pp. 430–443. Springer International Publishing, Cham (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. P. Sanjanasri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sanjanasri, J.P., Menon, V.K., Rajendran, S., Soman, K.P., Anand Kumar, M. (2020). Intrinsic Evaluation for English–Tamil Bilingual Word Embeddings. In: Thampi, S., et al. Intelligent Systems, Technologies and Applications. Advances in Intelligent Systems and Computing, vol 910. Springer, Singapore. https://doi.org/10.1007/978-981-13-6095-4_3

Download citation

Publish with us

Policies and ethics