Abstract
We propose a solution to the problem of efficient matching regular expressions (regexes) with bounded repetition, such as (ab){1,100}, using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that can hold sets of bounded integers and can be manipulated by a limited portfolio of constant-time operations. We present an algorithm that compiles a large sub-class of regexes to deterministic CsAs. This includes (1) a novel Antimirov-style translation of regexes with counting to counting automata (CAs), nondeterministic automata with bounded counters, and (2) our main technical contribution, a determinization of CAs that outputs CsAs. The main advantage of this workflow is that the size of the produced CsAs does not depend on the repetition bounds used in the regex (while the size of the DFA is exponential to them). Our experimental results confirm that deterministic CsAs produced from practical regexes with repetition are indeed vastly smaller than the corresponding DFAs. More importantly, our prototype matcher based on CsA simulation handles practical regexes with repetition regardless of sizes of counter bounds. It easily copes with regexes with repetition where state-of-the-art matchers struggle.
Supplemental Material
- Parosh Aziz Abdulla, Pavel Krčál, and Wang Yi. 2008. R-Automata. In CONCUR'08 (LNCS, Vol. 5201 ). Springer, 67-81. https://doi.org/10.1007/978-3-540-85361-9_9 Google ScholarDigital Library
- Cyril Allauzen and Mehryar Mohri. 2006. A Uni ed Construction of the Glushkov, Follow, and Antimirov Automata. In Mathematical Foundations of Computer Science 2006. Springer Berlin Heidelberg, Berlin, Heidelberg, 110-121. https: //doi.org/10.1007/11821069_10 Google ScholarDigital Library
- Valentin Antimirov. 1996. Partial derivatives of regular expressions and nite automaton constructions. Theoretical Computer Science 155, 2 ( 1996 ), 291-319. https://doi.org/10.1016/ 0304-3975 ( 95 ) 00182-4 Google ScholarDigital Library
- Adam Baldwin. 2016. Regular Expression Denial of Service a ecting Express.js. http://web.archive.org/web/20170116160113/ https://medium.com/node-security/ regular-expression-denial-of-service-a ecting-express-js-9c397c164c43Google Scholar
- Sébastien Bardin, Alain Finkel, Jérôme Leroux, and Laure Petrucci. 2008. FAST: acceleration from theory to practice. STTT 10, 5 ( 2008 ), 401-424. https://doi.org/10.1007/s10009-008-0064-3 Google ScholarCross Ref
- Gerard Berry and Ravi Sethi. 1986. From regular expressions to deterministic automata. Theoretical Computer Science 48, 3 ( 1986 ), 117-126. https://doi.org/10.1016/ 0304-3975 ( 86 ) 90088-5 Google ScholarCross Ref
- Jean Berstel and Jean-Éric Pin. 1996. Local languages and the Berry-Sethi algorithm. Theoret. Comput. Sci. 155, 2 ( 1996 ), 439-446. https://doi.org/10.1016/ 0304-3975 ( 95 ) 00104-2 Google ScholarDigital Library
- Henrik Björklund, Wim Martens, and Thomas Timm. 2015. E cient Incremental Evaluation of Succinct Regular Expressions. In CIKM' 15 (ACM). https://doi.org/10.1145/2806416.2806434 Google ScholarDigital Library
- Robert S. Boyer and J. Strother Moore. 1977. A Fast String Searching Algorithm. Commun. ACM 20, 10 (Oct. 1977 ), 762-772. https://doi.org/10.1145/359842.359859 Google ScholarDigital Library
- Anne Brüggemann-Klein and Derick Wood. 1998. One-unambiguous regular languages. Information and Computation 140, 2 ( 1998 ), 229-253. https://doi.org/10.1006/inco. 1997.2695 Google ScholarDigital Library
- Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 ( 1964 ), 481-494. https://doi.org/10.1145/ 321239.321249 Google ScholarDigital Library
- Pascal Caron, Jean-Marc Champarnaud, and Ludovic Mignot. 2011. Partial Derivatives of an Extended Regular Expression. In Language and Automata Theory and Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, 179-191. https: //doi.org/10.1007/978-3-642-21254-3_13 Google ScholarCross Ref
- Jean-Marc Champarnaud and Djelloul Ziadi. 2001. Computing the equation automaton of a regular expression in O (s 2) space and time. In Proceedings of CPM 2001 (LNCS, Vol. 2089 ). Springer, 157-168. https://doi.org/10.1007/3-540-48194-X_15 Google ScholarCross Ref
- Haiming Chen and Ping Lu. 2015. Checking determinism of regular expressions with counting. Information and Computation 241 ( 2015 ), 302-320. https://doi.org/10.1016/j.ic. 2014. 12.001 Google ScholarDigital Library
- Kwang-Ting Cheng and A. S. Krishnakumar. 1993. Automatic Functional Test Generation Using the Extended Finite State Machine Model. In Proceedings of the 30th Design Automation Conference. Dallas, Texas, USA, June 14-18, 1993. ACM Press, 86-91. https://doi.org/10.1145/157485.164585 Google ScholarDigital Library
- Wikipedia contributors. 2019. Regular expression-Wikipedia. https://en.wikipedia.org/w/index.php?title=Regular_expression&%20oldid= 852858998Google Scholar
- Russ Cox. 2010. Regular Expression Matching in the Wild. https://swtch.com/~rsc/regexp/regexp3.html.Google Scholar
- Loris D'Antoni and Margus Veanes. 2020. Automata Modulo Theories. Commun. ACM ( 2020 ).Google Scholar
- James C. Davis. 2019. Rethinking Regex Engines to Address ReDoS. In Proceedings of ESEC/FSE'19 (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 1256-1258. https://doi.org/10.1145/3338906.3342509 Google ScholarDigital Library
- James C. Davis, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2018. The Impact of Regular Expression Denial of Service (ReDoS) in Practice: An Empirical Study at the Ecosystem Scale. In Proceedings of ESEC/FSE'18 ( Lake Buena Vista, FL, USA) ( ESEC/FSE 2018). ACM, New York, NY, USA, 246-256. https://doi.org/10.1145/3236024.3236027 Google ScholarDigital Library
- James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2019. Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions. In Proceedings of ESEC/FSE'19 (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 1256-1258. https://doi.org/10.1145/3338906. 3338909 Google ScholarDigital Library
- Stack Exchange. 2016. Outage Postmortem. http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016Google Scholar
- Sebastian Fischer, Frank Huch, and Thomas Wilke. 2010. A Play on Regular Expressions: Functional Pearl. SIGPLAN Not. 45, 9 ( 2010 ), 357-368. https://doi.org/10.1145/1863543.1863594 Google ScholarDigital Library
- Wouter Gelade, Marc Gyssens, and Wim Martens. 2012. Regular Expressions with Counting: Weak versus Strong Determinism. SIAM J. Comput. 41, 1 ( 2012 ), 160-190. https://doi.org/10.1137/100814196 Extended version of paper in MFCS' 09. Google ScholarDigital Library
- Wouter Gelade, Wim Martens, and Frank Neven. 2007. Optimizing schema languages for XML: Numerical constraints and interleaving. In Proceedings of ICDT'07 (LNCS, Vol. 4353 ). Springer, 269-283. https://doi.org/10.1007/11965893_19 Google ScholarDigital Library
- V. M. Glushkov. 1961. The abstract theory of automata. Russian Math. Surveys 16 ( 1961 ), 1-53. https://doi.org/10.1070/ RM1961v016n05ABEH004112 Google ScholarCross Ref
- Google. [n.d.]. RE2. https://github.com/google/re2.Google Scholar
- John Graham-Cumming. 2019. Details of the Cloud are outage on July 2, 2019. https://blog.cloud are. com/details-of-thecloud are-outage-on-july-2-2019/Google Scholar
- Mike Haertel. [n.d.]. why GNU grep is fast. https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html.Google Scholar
- Dag Hovland. 2009. Regular Expressions with Numerical Constraints and Automata with Counters. In ICTAC (LNCS, Vol. 5684 ). Springer, 231-245. https://doi.org/10.1007/978-3-642-03466-4_15 Google ScholarDigital Library
- Dag Hovland. 2012. The Membership Problem for Regular Expressions with Unordered Concatenation and Numerical Constraints. In Language and Automata Theory and Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, 313-324. https://doi.org/10.1007/978-3-642-28332-1_27 Google ScholarDigital Library
- Lucian Ilie and Sheng Yu. 2003. Follow automata. Information and Computation 186, 1 ( 2003 ), 146-162. https://doi.org/10. 1016/S0890-5401 ( 03 ) 00090-7 Google ScholarDigital Library
- Pekka Kilpeläinen and Rauno Tuhkanen. 2003. Regular Expressions with Numerical Occurrence Indicators-preliminary results. In Proceedings of the Eighth Symposium on Programming Languages and Software Tools, SPLST'03, Kuopio, Finland, June 17-18, 2003. University of Kuopio, Department of Computer Science, 163-173.Google Scholar
- Pekka Kilpeläinen and Rauno Tuhkanen. 2007. One-unambiguity of regular expressions with numeric occurrence indicators. Information and Computation 205, 6 ( 2007 ), 890-916. https://doi.org/10.1016/j.ic. 2006. 12.003 Google ScholarDigital Library
- Sylvain Lombardy and Jacques Sakarovitch. 2005. Derivatives of rational expressions with multiplicity. Theoretical Computer Science 332, 1 ( 2005 ), 141-177. https://doi.org/10.1016/j.tcs. 2004. 10.016 Google ScholarDigital Library
- Microsoft. 2020.. https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.matchGoogle Scholar
- Scott Owens, John Reppy, and Aaron Turon. 2009. Regular-expression Derivatives Re-examined. J. Funct. Program. 19, 2 ( 2009 ), 173-190. https://doi.org/10.1017/S0956796808007090 Google ScholarDigital Library
- Olli Saarikivi, Margus Veanes, Tiki Wan, and Eric Xu. 2019. Symbolic Regex Matcher. In TACAS'2019 (LNCS, Vol. 11427 ), Tomás Vojnar and Lijun Zhang (Eds.). Springer, 372-378. https://doi.org/10.1007/978-3-030-17462-0_24 Google ScholarCross Ref
- Thomas R. Shiple, James H. Kukula, and Rajeev K. Ranjan. 1998. A Comparison of Presburger Engines for EFSM Reachability. In Computer Aided Veri cation, 10th International Conference, CAV ' 98, Vancouver, BC, Canada, June 28-July 2, 1998, Proceedings (Lecture Notes in Computer Science, Vol. 1427 ). Springer, 280-292. https://doi.org/10.1007/BFb0028752 Google ScholarCross Ref
- Michael Sipser. 2006. Introduction to Theory of Computation. Vol. 2. Thomson Course Technology Boston. https://doi.org/10. 1145/230514.571645 Google ScholarDigital Library
- Randy Smith, Cristian Estan, and Somesh Jha. 2008a. XFA: Faster Signature Matching with Extended Automata. In IEEE Symposium on Security and Privacy. IEEE. https://doi.org/10.1109/SP. 2008.14 Google ScholarDigital Library
- Randy Smith, Cristian Estan, Somesh Jha, and Ida Siahaan. 2008b. Fast Signature Matching Using Extended Finite Automaton (XFA). In ICISS'08 (LNCS, Vol. 5352 ). Springer, 158-172. https://doi.org/10.1007/978-3-540-89862-7_15 Google ScholarDigital Library
- Henry Spencer. 1994. Software Solutions in C. Academic Press Professional, Inc., San Diego, CA, USA, Chapter A Regularexpression Matcher, 35-71. http://dl.acm.org/citation.cfm?id= 156626. 184689Google Scholar
- Michael Sperberg-McQueen. [n.d.]. Notes on nite state automata with counters. https://www.w3.org/XML/ 2004 /05/msmcfa.html Accessed: 2018-08-08.Google Scholar
- Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm. Commun. ACM 11, 6 ( June 1968 ), 419-422. https://doi.org/10.1145/363347.363387 Google ScholarDigital Library
- Margus Veanes, Peli de Halleux, and Nikolai Tillmann. 2010. Rex: Symbolic Regular Expression Explorer. In Third International Conference on Software Testing, Veri cation and Validation, ICST 2010, Paris, France, April 7-9, 2010. 498-507. https: //doi.org/10.1109/ICST. 2010.15 Google ScholarDigital Library
- Liu Yang, Rezwana Karim, Vinod Ganapathy, and Randy Smith. 2010. Improving NFA-Based Signature Matching Using Ordered Binary Decision Diagrams. In Recent Advances in Intrusion Detection. Springer Berlin Heidelberg, Berlin, Heidelberg, 58-78. https://doi.org/10.1007/978-3-642-15512-3_4 Google ScholarCross Ref
Index Terms
- Regex matching with counting-set automata
Recommendations
Regular Expression Matching using Bit Vector Automata
Regular expressions (regexes) are ubiquitous in modern software. There is a variety of implementation techniques for regex matching, which can be roughly categorized as (1) relying on backtracking search, or (2) being based on finite-state automata. ...
Efficient determinization of visibly and height-deterministic pushdown automata
New algorithms for the determinization of nondeterministic visibly and nondeterministic real-time height-deterministic pushdown automata are presented. The algorithms improve the results of existing algorithms. They construct only accessible states and ...
Descriptional complexity of determinization and complementation for finite automata
CATS 2011: Proceedings of the Seventeenth Computing on The Australasian Theory Symposium - Volume 119In this paper we study the subset construction that transforms nondeterministic finite automata (NFA) to deterministic finite automata (DFA). It is well known that given a n-state NFA, the subset construction algorithm produces a 2n-state DFA in the ...
Comments