Skip to main content

Putting the Horses Before the Cart: Identifying Multiword Expressions Before Translation

  • Conference paper
  • First Online:
Computational and Corpus-Based Phraseology (EUROPHRAS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10596))

Included in the following conference series:

Abstract

Translating multiword expressions (MWEs) is notoriously difficult. Part of the challenge stems from the analysis of non-compositional expressions in source texts, preventing literal translation. Therefore, before translating them, it is crucial to locate MWEs in the source text. We would be putting the cart before the horses if we tried to translate MWEs before ensuring that they are correctly identified in the source text. This paper discusses the current state of affairs in automatic MWE identification, covering rule-based methods and sequence taggers. While MWE identification is not a solved problem, significant advances have been made in the recent years. Hence, we can hope that MWE identification can be integrated into MT in the near future, thus avoiding clumsy translations that have often been mocked and used to motivate the urgent need for better MWE processing.

I would like to thank the chairs of MUMTTT 2017 for inviting me to the event and for giving me the oportunity to publish this invited contribution. This paper includes materials published in other venues and co-written with: Mathieu Constant, Silvio Cordeiro, Benoit Favre, Marco Idiart, Gülşen Eryiğit, Johanna Monti, Lonneke van der Plas, Michael Rosner, Manon Scholivet, Amalia Todirascu, and Aline Villavicencio. Work reported here has been partly funded by projects PARSEME (Cost Action IC1207), PARSEME-FR (ANR-14-CERA-0001), and AIM-WEST (FAPERGS-INRIA 1706-2551/13-7).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 95.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Translations obtained using Google’s online translation service (http://translate.google.com) on September 6, 2017.

  2. 2.

    http://multiword.sf.net/.

  3. 3.

    http://dimsum16.github.io.

  4. 4.

    http://multiword.sf.net/sharedtask2017.

  5. 5.

    http://mwetoolkit.sf.net/.

  6. 6.

    In this toy example, the “lexicon” is formed by abstract POS patterns. In our implementation, lexicons can contain lemmas, surface forms, POS patterns or a mix of all these.

  7. 7.

    In the remainder of the paper, we abbreviate the POS tag NOUN as N.

  8. 8.

    The 13 most frequent non-literal particles: about, around, away, back, down, in, into, off, on, out, over, through, up.

  9. 9.

    B is used for a token that appears at the Beginning of an MWE, I is used for a token Included in the MWE, and O for tokens Outside any MWE.

  10. 10.

    http://www.chokkan.org/software/crfsuite/.

  11. 11.

    http://bach.arts.kuleuven.be/dicovalence/.

  12. 12.

    t\(_{\text {i}}\) is a shortcut denoting the group of features w\(_{\text {i}}\), l\(_{\text {i}}\) and p\(_{\text {i}}\) for a token t\(_{\text {i}}\). In other words, each token t\(_{\text {i}}\) is a tuple (w\(_{\text {i}}\),l\(_{\text {i}}\),p\(_{\text {i}}\)). The same applies to n-grams.

References

  1. Baldwin, T.: Deep lexical acquisition of verb-particle constructions. Comput. Speech Lang. 19(4), 398–414 (2005). doi:10.1016/j.csl.2005.02.004

    Article  Google Scholar 

  2. Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn., pp. 267–292. CRC Press, Taylor and Francis Group, Boca Raton (2010)

    Google Scholar 

  3. Baroni, M., Bernardini, S. (eds.): Wacky! Working papers on the Web as Corpus. GEDIT, Bologna, 224 p. (2006)

    Google Scholar 

  4. Barreiro, A., Monti, J., Batista, F., Orliac, B.: When multiwords go bad in machine translation. In: Mitkov, R., et al. [21], pp. 26–33

    Google Scholar 

  5. Calzolari, N., Fillmore, C., Grishman, R., Ide, N., Lenci, A., MacLeod, C., Zampolli, A.: Towards best practice for multiword expressions in computational lexicons. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002), pp. 1934–1940. Las Palmas (2002)

    Google Scholar 

  6. Cap, F., Nirmal, M., Weller, M., im Walde, S.S.: How to account for idiomatic German support verb constructions in statistical machine translation. In: Proceedings of the 11th Workshop on Multiword Expressions (MWE 2015), pp. 19–28. Association for Computational Linguistics, Denver (2015). http://aclweb.org/anthology/W15-0903

  7. Carpuat, M., Diab, M.: Task-based evaluation of multiword expressions: a pilot study in statistical machine translation. In: Proceedings of Human Language Technology: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2003), pp. 242–245. Association for Computational Linguistics, Los Angeles, June 2010. http://www.aclweb.org/anthology/N10-1029

  8. Constant, M., Eryiğit, G., Monti, J., van der Plas, L., Ramisch, C., Rosner, M., Todirascu, A.: Multiword expression processing: a survey. Computational Linguistics (2017)

    Google Scholar 

  9. Constant, M., Nivre, J.: A transition-based system for joint lexical and syntactic analysis. In: Proceedings of ACL 2016, Berlin, Germany, pp. 161–171 (2016)

    Google Scholar 

  10. Constant, M., Sigogne, A.: MWU-aware part-of-speech tagging with a CRF model and lexical resources. In: Proceedings of the ACL 2011 Workshop on MWEs, Portland, OR, USA, pp. 49–56 (2011)

    Google Scholar 

  11. Constant, M., Tellier, I.: Evaluating the impact of external lexical resources into a CRF-based multiword segmenter and part-of-speech tagger. In: Proceedings of LREC 2012, Istanbul, Turkey (2012)

    Google Scholar 

  12. Cordeiro, S., Ramisch, C., Villavicencio, A.: Token-based mwe identification strategies in the mwetoolkit. In: Proceedings of the 4th PARSEME General Meeting. Valetta, Malta, March 2015. https://typo.uni-konstanz.de/parseme/images/Meeting/2015-03-19-Malta-meeting/WG2-WG3-CORDEIRO-et-al-abstract.pdf

  13. Cordeiro, S., Ramisch, C., Villavicencio, A.: UFRGS&LIF at SemEval-2016 task 10: rule-based MWE identification and predominant-supersense tagging. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 910–917. Association for Computational Linguistics, San Diego, June 2016. http://www.aclweb.org/anthology/S16-1140

  14. van den Eynde, K., Mertens, P.: La valence: l’approche pronominale et son application au lexique verbal. J. Fr. Lang. Stud. 13, 63–104 (2003)

    Article  Google Scholar 

  15. Finlayson, M., Kulkarni, N.: Detecting multi-word expressions improves word sense disambiguation. In: Kordoni, V., et al. [16], pp. 20–24. http://www.aclweb.org/anthology/W/W11/W11-0805

  16. Kordoni, V., Ramisch, C., Villavicencio, A. (eds.): Proceedings of the ACL Workshop on Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011). Association for Computational Linguistics, Portland, June 2011. http://www.aclweb.org/anthology/W11-08

  17. Kulkarni, N., Finlayson, M.: jMWE: A Java toolkit for detecting multi-word expressions. In: Kordoni, V., et al. [16], pp. 122–124. http://www.aclweb.org/anthology/W/W11/W11-0818

  18. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001). http://dl.acm.org/citation.cfm?id=645530.655813

  19. Le Roux, J., Rozenknop, A., Constant, M.: Syntactic parsing and compound recognition via dual decomposition: application to French. In: the 25th International Conference on Computational Linguistics: Technical Papers, Proceedings of COLING 2014, pp. 1875–1885. Dublin City University and Association for Computational Linguistics, Dublin, August 2014. http://www.aclweb.org/anthology/C14-1177

  20. Losnegaard, G.S., Sangati, F., Escartín, C.P., Savary, A., Bargmann, S., Monti, J.: Parseme survey on MWE resources. In: Proceedings of LREC 2016, Portorož, Slovenia (2016)

    Google Scholar 

  21. Mitkov, R., Monti, J., Pastor, G.C., Seretan, V. (eds.): Proceedings of the MT Summit 2013 Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2013), Nice, France, September 2013

    Google Scholar 

  22. Monti, J., Sangati, F., Arcan, M.: TED-MWE: a bilingual parallel corpus with mwe annotation: Towards a methodology for annotating MWEs in parallel multilingual corpora. In: Proceedings of the Second Italian Conference on Computational Linguistics (CLiC-it 2015). Accademia University Press, Trento, Torino (2015)

    Google Scholar 

  23. Monti, J., Seretan, V., Pastor, G.C., Mitkov, R.: Multiword units in machine translation and translation technology. In: Mitkov, R., Monti, J., Pastor, G.C., Seretan, V. (eds.) Multiword Units in Machine Translation and Translation Technology. John Benjamin (2017)

    Google Scholar 

  24. Nasr, A., Ramisch, C., Deulofeu, J., Valli, A.: Joint dependency parsing and multiword expression tokenization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (v 1: Long Papers), pp. 1116–1126. Association for Computational Linguistics, Beijing, July 2015. http://aclweb.org/anthology/P15-1108

  25. Okazaki, N.: CRFsuite: a fast implementation of conditional random fields (CRFs) (2007). http://www.chokkan.org/software/crfsuite/

  26. Ramisch, C.: Multiword Expressions Acquisition: A Generic and Open Framework, Theory and Applications of Natural Language Processing, vol. XIV. Springer, Cham (2015). doi:10.1007/978-3-319-09207-2

    Google Scholar 

  27. Ramisch, C., Besacier, L., Kobzar, O.: How hard is it to automatically translate phrasal verbs from English to French? In: Mitkov, R., et al. [21], pp. 53–61

    Google Scholar 

  28. Ramisch, C., Villavicencio, A.: Computational treatment of multiword expressions. In: Mitkov, R. (ed.) Oxford Handbook of Computational Linguistics, 2nd edn. Oxford University Press (2016)

    Google Scholar 

  29. Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995). http://aclweb.org/anthology/W95-0107

  30. Riedl, M., Biemann, C.: Impact of MWE resources on multiword recognition. In: Proceedings of the 12th Workshop on Multiword Expressions (MWE 2016), pp. 107–111. Association for Computational Linguistics, Berlin, Germany (2016). http://anthology.aclweb.org/W16-1816

  31. Rosén, V., De Smedt, K., Losnegaard, G.S., Bejcek, E., Savary, A., Osenova, P.: MWEs in treebanks: from survey to guidelines. In: Proceedings of LREC 2016, pp. 2323–2330, Portorož, Slovenia (2016)

    Google Scholar 

  32. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). doi:10.1007/3-540-45715-1_1

    Chapter  Google Scholar 

  33. Savary, A.: Multiflex: a multilingual finite-state tool for multi-word units. In: Maneth, S. (ed.) CIAA 2009. LNCS, vol. 5642, pp. 237–240. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02979-0_27

    Chapter  Google Scholar 

  34. Savary, A., Ramisch, C., Cordeiro, S., Sangati, F., Vincze, V., QasemiZadeh, B., Candito, M., Cap, F., Giouli, V., Stoyanova, I., Doucet, A.: The PARSEME shared task on automatic identification of verbal multiword expressions. In: [48], pp. 31–47

    Google Scholar 

  35. Savary, A., Sailer, M., Parmentier, Y., Rosner, M., Rosén, V., Przepiórkowski, A., Krstev, C., Vincze, V., Wójtowicz, B., Losnegaard, G.S., Parra Escartín, C., Waszczuk, J., Constant, M., Osenova, P., Sangati, F.: PARSEME - parsing and multiword expressions within a European multilingual network. In: Proceedings of LTC 2015, Poznań (2015)

    Google Scholar 

  36. Schneider, N., Danchik, E., Dyer, C., Smith, N.A.: Discriminative lexical semantic segmentation with gaps: running the MWE gamut. In: TACL, vol. 2, pp. 193–206 (2014)

    Google Scholar 

  37. Schneider, N., Hovy, D., Johannsen, A., Carpuat, M.: Semeval-2016 task 10: Detecting minimal semantic units and their meanings (diMSUM). In: Proceedings of SemEval 2016, pp. 546–559, San Diego, CA, USA (2016)

    Google Scholar 

  38. Schneider, N., Onuffer, S., Kazour, N., Danchik, E., Mordowanec, M.T., Conrad, H., Smith, N.A.: Comprehensive annotation of multiword expressions in a social web corpus. In: Proceedings of LREC 2014, Reykjavik, Iceland, pp. 455–461 (2014)

    Google Scholar 

  39. Scholivet, M., Ramisch, C.: Identification of ambiguous multiword expressions using sequence models and lexical resources. In: [48], pp. 167–175. http://aclweb.org/anthology/W17-1723

  40. Seretan, V.: On translating syntactically-flexible expressions. In: Mitkov, R., et al. [21], pp. 11–11

    Google Scholar 

  41. Silberztein, M.: The lexical analysis of natural languages. In: Finite-State Language Processing, pp. 175–203. MIT Press (1997)

    Google Scholar 

  42. Taslimipoor, S., Desantis, A., Cherchi, M., Mitkov, R., Monti, J.: Language resources for italian: towards the development of a corpus of annotated italian multiword expressions. In: Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Final Workshop (EVALITA 2016), Napoli, Italy, 5–7 December 2016

    Google Scholar 

  43. Tu, Y., Roth, D.: Learning English light verb constructions: contextual or statistical. In: Kordoni, V., et al. [16], pp. 31–39. http://www.aclweb.org/anthology/W/W11/W11-0807

  44. Tu, Y., Roth, D.: Sorting out the most confusing english phrasal verbs. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics - v 1: Proceedings of the Main Conference and the Shared Task, and v 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval 2012, pp. 65–69. Association for Computational Linguistics, Stroudsburg (2012)

    Google Scholar 

  45. Vauquois, B.: A survey of formal grammars and algorithms for recognition and transformation in mechanical translation. In: IFIP Congress (2), pp. 1114–1122 (1968)

    Google Scholar 

  46. Vincze, V., Nagy, I., Berend, G.: Multiword expressions and named entities in the Wiki50 corpus. In: Proceedings of RANLP 2011, pp. 289–295, Hissar, Bulgaria (2011)

    Google Scholar 

  47. Vincze, V.: Light verb constructions in the SzegedParalellFX English-Hungarian parallel corpus. In: Proceedings of LREC 2012, pp. 2381–2388, Istanbul, Turkey (2012)

    Google Scholar 

  48. Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017). Association for Computational Linguistics, Valencia, Spain (2017). http://aclweb.org/anthology/W17-17

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Ramisch .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ramisch, C. (2017). Putting the Horses Before the Cart: Identifying Multiword Expressions Before Translation. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69805-2_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69804-5

  • Online ISBN: 978-3-319-69805-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics