skip to main content

Regex matching with counting-set automata

Published:13 November 2020Publication History
Skip Abstract Section

Abstract

We propose a solution to the problem of efficient matching regular expressions (regexes) with bounded repetition, such as (ab){1,100}, using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that can hold sets of bounded integers and can be manipulated by a limited portfolio of constant-time operations. We present an algorithm that compiles a large sub-class of regexes to deterministic CsAs. This includes (1) a novel Antimirov-style translation of regexes with counting to counting automata (CAs), nondeterministic automata with bounded counters, and (2) our main technical contribution, a determinization of CAs that outputs CsAs. The main advantage of this workflow is that the size of the produced CsAs does not depend on the repetition bounds used in the regex (while the size of the DFA is exponential to them). Our experimental results confirm that deterministic CsAs produced from practical regexes with repetition are indeed vastly smaller than the corresponding DFAs. More importantly, our prototype matcher based on CsA simulation handles practical regexes with repetition regardless of sizes of counter bounds. It easily copes with regexes with repetition where state-of-the-art matchers struggle.

Skip Supplemental Material Section

Supplemental Material

oopsla20main-p480-p-video.mp4

mp4

223.6 MB

References

  1. Parosh Aziz Abdulla, Pavel Krčál, and Wang Yi. 2008. R-Automata. In CONCUR'08 (LNCS, Vol. 5201 ). Springer, 67-81. https://doi.org/10.1007/978-3-540-85361-9_9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Cyril Allauzen and Mehryar Mohri. 2006. A Uni ed Construction of the Glushkov, Follow, and Antimirov Automata. In Mathematical Foundations of Computer Science 2006. Springer Berlin Heidelberg, Berlin, Heidelberg, 110-121. https: //doi.org/10.1007/11821069_10 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Valentin Antimirov. 1996. Partial derivatives of regular expressions and nite automaton constructions. Theoretical Computer Science 155, 2 ( 1996 ), 291-319. https://doi.org/10.1016/ 0304-3975 ( 95 ) 00182-4 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Adam Baldwin. 2016. Regular Expression Denial of Service a ecting Express.js. http://web.archive.org/web/20170116160113/ https://medium.com/node-security/ regular-expression-denial-of-service-a ecting-express-js-9c397c164c43Google ScholarGoogle Scholar
  5. Sébastien Bardin, Alain Finkel, Jérôme Leroux, and Laure Petrucci. 2008. FAST: acceleration from theory to practice. STTT 10, 5 ( 2008 ), 401-424. https://doi.org/10.1007/s10009-008-0064-3 Google ScholarGoogle ScholarCross RefCross Ref
  6. Gerard Berry and Ravi Sethi. 1986. From regular expressions to deterministic automata. Theoretical Computer Science 48, 3 ( 1986 ), 117-126. https://doi.org/10.1016/ 0304-3975 ( 86 ) 90088-5 Google ScholarGoogle ScholarCross RefCross Ref
  7. Jean Berstel and Jean-Éric Pin. 1996. Local languages and the Berry-Sethi algorithm. Theoret. Comput. Sci. 155, 2 ( 1996 ), 439-446. https://doi.org/10.1016/ 0304-3975 ( 95 ) 00104-2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Henrik Björklund, Wim Martens, and Thomas Timm. 2015. E cient Incremental Evaluation of Succinct Regular Expressions. In CIKM' 15 (ACM). https://doi.org/10.1145/2806416.2806434 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robert S. Boyer and J. Strother Moore. 1977. A Fast String Searching Algorithm. Commun. ACM 20, 10 (Oct. 1977 ), 762-772. https://doi.org/10.1145/359842.359859 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Anne Brüggemann-Klein and Derick Wood. 1998. One-unambiguous regular languages. Information and Computation 140, 2 ( 1998 ), 229-253. https://doi.org/10.1006/inco. 1997.2695 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 ( 1964 ), 481-494. https://doi.org/10.1145/ 321239.321249 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Pascal Caron, Jean-Marc Champarnaud, and Ludovic Mignot. 2011. Partial Derivatives of an Extended Regular Expression. In Language and Automata Theory and Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, 179-191. https: //doi.org/10.1007/978-3-642-21254-3_13 Google ScholarGoogle ScholarCross RefCross Ref
  13. Jean-Marc Champarnaud and Djelloul Ziadi. 2001. Computing the equation automaton of a regular expression in O (s 2) space and time. In Proceedings of CPM 2001 (LNCS, Vol. 2089 ). Springer, 157-168. https://doi.org/10.1007/3-540-48194-X_15 Google ScholarGoogle ScholarCross RefCross Ref
  14. Haiming Chen and Ping Lu. 2015. Checking determinism of regular expressions with counting. Information and Computation 241 ( 2015 ), 302-320. https://doi.org/10.1016/j.ic. 2014. 12.001 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kwang-Ting Cheng and A. S. Krishnakumar. 1993. Automatic Functional Test Generation Using the Extended Finite State Machine Model. In Proceedings of the 30th Design Automation Conference. Dallas, Texas, USA, June 14-18, 1993. ACM Press, 86-91. https://doi.org/10.1145/157485.164585 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Wikipedia contributors. 2019. Regular expression-Wikipedia. https://en.wikipedia.org/w/index.php?title=Regular_expression&%20oldid= 852858998Google ScholarGoogle Scholar
  17. Russ Cox. 2010. Regular Expression Matching in the Wild. https://swtch.com/~rsc/regexp/regexp3.html.Google ScholarGoogle Scholar
  18. Loris D'Antoni and Margus Veanes. 2020. Automata Modulo Theories. Commun. ACM ( 2020 ).Google ScholarGoogle Scholar
  19. James C. Davis. 2019. Rethinking Regex Engines to Address ReDoS. In Proceedings of ESEC/FSE'19 (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 1256-1258. https://doi.org/10.1145/3338906.3342509 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. James C. Davis, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2018. The Impact of Regular Expression Denial of Service (ReDoS) in Practice: An Empirical Study at the Ecosystem Scale. In Proceedings of ESEC/FSE'18 ( Lake Buena Vista, FL, USA) ( ESEC/FSE 2018). ACM, New York, NY, USA, 246-256. https://doi.org/10.1145/3236024.3236027 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2019. Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions. In Proceedings of ESEC/FSE'19 (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 1256-1258. https://doi.org/10.1145/3338906. 3338909 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Stack Exchange. 2016. Outage Postmortem. http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016Google ScholarGoogle Scholar
  23. Sebastian Fischer, Frank Huch, and Thomas Wilke. 2010. A Play on Regular Expressions: Functional Pearl. SIGPLAN Not. 45, 9 ( 2010 ), 357-368. https://doi.org/10.1145/1863543.1863594 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Wouter Gelade, Marc Gyssens, and Wim Martens. 2012. Regular Expressions with Counting: Weak versus Strong Determinism. SIAM J. Comput. 41, 1 ( 2012 ), 160-190. https://doi.org/10.1137/100814196 Extended version of paper in MFCS' 09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wouter Gelade, Wim Martens, and Frank Neven. 2007. Optimizing schema languages for XML: Numerical constraints and interleaving. In Proceedings of ICDT'07 (LNCS, Vol. 4353 ). Springer, 269-283. https://doi.org/10.1007/11965893_19 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. M. Glushkov. 1961. The abstract theory of automata. Russian Math. Surveys 16 ( 1961 ), 1-53. https://doi.org/10.1070/ RM1961v016n05ABEH004112 Google ScholarGoogle ScholarCross RefCross Ref
  27. Google. [n.d.]. RE2. https://github.com/google/re2.Google ScholarGoogle Scholar
  28. John Graham-Cumming. 2019. Details of the Cloud are outage on July 2, 2019. https://blog.cloud are. com/details-of-thecloud are-outage-on-july-2-2019/Google ScholarGoogle Scholar
  29. Mike Haertel. [n.d.]. why GNU grep is fast. https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html.Google ScholarGoogle Scholar
  30. Dag Hovland. 2009. Regular Expressions with Numerical Constraints and Automata with Counters. In ICTAC (LNCS, Vol. 5684 ). Springer, 231-245. https://doi.org/10.1007/978-3-642-03466-4_15 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Dag Hovland. 2012. The Membership Problem for Regular Expressions with Unordered Concatenation and Numerical Constraints. In Language and Automata Theory and Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, 313-324. https://doi.org/10.1007/978-3-642-28332-1_27 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lucian Ilie and Sheng Yu. 2003. Follow automata. Information and Computation 186, 1 ( 2003 ), 146-162. https://doi.org/10. 1016/S0890-5401 ( 03 ) 00090-7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Pekka Kilpeläinen and Rauno Tuhkanen. 2003. Regular Expressions with Numerical Occurrence Indicators-preliminary results. In Proceedings of the Eighth Symposium on Programming Languages and Software Tools, SPLST'03, Kuopio, Finland, June 17-18, 2003. University of Kuopio, Department of Computer Science, 163-173.Google ScholarGoogle Scholar
  34. Pekka Kilpeläinen and Rauno Tuhkanen. 2007. One-unambiguity of regular expressions with numeric occurrence indicators. Information and Computation 205, 6 ( 2007 ), 890-916. https://doi.org/10.1016/j.ic. 2006. 12.003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sylvain Lombardy and Jacques Sakarovitch. 2005. Derivatives of rational expressions with multiplicity. Theoretical Computer Science 332, 1 ( 2005 ), 141-177. https://doi.org/10.1016/j.tcs. 2004. 10.016 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Microsoft. 2020.. https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.matchGoogle ScholarGoogle Scholar
  37. Scott Owens, John Reppy, and Aaron Turon. 2009. Regular-expression Derivatives Re-examined. J. Funct. Program. 19, 2 ( 2009 ), 173-190. https://doi.org/10.1017/S0956796808007090 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Olli Saarikivi, Margus Veanes, Tiki Wan, and Eric Xu. 2019. Symbolic Regex Matcher. In TACAS'2019 (LNCS, Vol. 11427 ), Tomás Vojnar and Lijun Zhang (Eds.). Springer, 372-378. https://doi.org/10.1007/978-3-030-17462-0_24 Google ScholarGoogle ScholarCross RefCross Ref
  39. Thomas R. Shiple, James H. Kukula, and Rajeev K. Ranjan. 1998. A Comparison of Presburger Engines for EFSM Reachability. In Computer Aided Veri cation, 10th International Conference, CAV ' 98, Vancouver, BC, Canada, June 28-July 2, 1998, Proceedings (Lecture Notes in Computer Science, Vol. 1427 ). Springer, 280-292. https://doi.org/10.1007/BFb0028752 Google ScholarGoogle ScholarCross RefCross Ref
  40. Michael Sipser. 2006. Introduction to Theory of Computation. Vol. 2. Thomson Course Technology Boston. https://doi.org/10. 1145/230514.571645 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Randy Smith, Cristian Estan, and Somesh Jha. 2008a. XFA: Faster Signature Matching with Extended Automata. In IEEE Symposium on Security and Privacy. IEEE. https://doi.org/10.1109/SP. 2008.14 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Randy Smith, Cristian Estan, Somesh Jha, and Ida Siahaan. 2008b. Fast Signature Matching Using Extended Finite Automaton (XFA). In ICISS'08 (LNCS, Vol. 5352 ). Springer, 158-172. https://doi.org/10.1007/978-3-540-89862-7_15 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Henry Spencer. 1994. Software Solutions in C. Academic Press Professional, Inc., San Diego, CA, USA, Chapter A Regularexpression Matcher, 35-71. http://dl.acm.org/citation.cfm?id= 156626. 184689Google ScholarGoogle Scholar
  44. Michael Sperberg-McQueen. [n.d.]. Notes on nite state automata with counters. https://www.w3.org/XML/ 2004 /05/msmcfa.html Accessed: 2018-08-08.Google ScholarGoogle Scholar
  45. Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm. Commun. ACM 11, 6 ( June 1968 ), 419-422. https://doi.org/10.1145/363347.363387 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Margus Veanes, Peli de Halleux, and Nikolai Tillmann. 2010. Rex: Symbolic Regular Expression Explorer. In Third International Conference on Software Testing, Veri cation and Validation, ICST 2010, Paris, France, April 7-9, 2010. 498-507. https: //doi.org/10.1109/ICST. 2010.15 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Liu Yang, Rezwana Karim, Vinod Ganapathy, and Randy Smith. 2010. Improving NFA-Based Signature Matching Using Ordered Binary Decision Diagrams. In Recent Advances in Intrusion Detection. Springer Berlin Heidelberg, Berlin, Heidelberg, 58-78. https://doi.org/10.1007/978-3-642-15512-3_4 Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Regex matching with counting-set automata

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader