Task 1: Models, Measures, and Classes
The focus of this project is study of the structural properties of regular language representations and associated complexity measures, towards the development of practical efficient manipulation methods. We analyse the models along three interconnected axes:
average-case complexity;
measures of nondeterminism and of ambiguity;
operational complexity.
In Task 1, the objective was to advance in novel analytic combinatorics methods and identify nondeterministic models and classes of languages for which decision procedures and behaviours may be feasible on average. Three primary axes of work were planned:
Analytic combinatorics methods
Analysis of nondeterministic models (NFA and RE)
Models with symbolic labels, parametrised by specific monoids
The work developed focus on the first two axes. Regarding the third axe was decided to consider it only in the second year of the project as a follow up of some activities and results of the second axe related with succinct and compressed representations.
In the next sections, we provide a detailed account of the work accomplished in each axis, including the challenges faced, the progress made, and the resulting outcomes.
Analytic Combinatorics Methods
The plan activities included the following.
To develop algebraic techniques, using Galois Theory, to decide whether a partial derivative of an algebraic curve at the point corresponding to a singularity of an algebraic curve embedded is, or not, zero. This is crucial to get the minimal polynomial of the singularity, in order to obtain the asymptotic behaviour of the generating function around that singularity.
To extend our previous work to exponential generating functions;
To refine the methods we have developed so far to more complex situations, in particular to deal with sets of combinatorial objects such as expressions and integers.
And the expected results were to obtain analytic methods for the average complexity estimates for sets of expressions based on inductive defined objects
In the following we detail the work developed within this axis, some preliminary results, and planned work.
For this activity it was not possible to hire a young researcher although the Ph. D student Guilherme Duarte that join the team started working on the new methods analytic combinatorial methods developed before by tem members (and described in the proposal of this project) having already obtained new results [@DuarteMR:2025aa; @duarte26:_shuff_words; @duarte26:_averag_state_compl_shuff_ideal_finit_languag].
Regarding item 1., although we had foreseen the possibility of using some ideas of Galois Theory to deal with the question above mentioned, in all the more than 20 papers that we, since then, wrote on the subject of the asymptotic behaviour of measures of descriptional complexity, we have not yet encountered a single case where we needed to resort to that method, and thus pursue its study.
The work on exponential generating functions (item 2.) is on going without results yet but the same kind of functions appeared in the work related with Boolean Shuffle products developed in [@broda26:_autom_regul_expres_synch_shuff; @broda:_partial_deriv_boolean_produc_averag].
Finally, for item 3., we have started the study of the combinatorial behaviour of sets measured by the total sum of the weights of its elements, rediscovering some results that we then found in the OEIS database (https://oeis.org).
Nondeterministic Models
The plan activities included the following.
Study structural properties of automata's underlying graphs based on several measures of nondeterminism such as size of the transition monoid, branching or tree width.
Examples to be considered are: (partially)-ordered, series-parallel, acyclic, strongly connected graphs or bipartite graphs.
Consider NFAs for subregular languages such as block, finite, definite, Wheeler, deterministic, bounded or commutative.
Consider NFAs with different types of ambiguities degrees
Development of uniform random generators for some classes of NFAs and extensive experimental tests of the behaviour of the random generated objects or existing data sets.
Study the use of polynomial approximation/randomised algorithms (PAX/PRAX)
Focus on the conversions from REs to DFAs studied by team members in [BHMMR19] for the identification of inductive definitions for that conversions. We can start with Brzozowski derivatives for unambiguous expressions (deterministic or unique operations, as mentioned in the plan). Then develop techniques to reduce the ambiguity of REs using simplification rules and consider other constructions RE to DFAs.
The expected results included
characterisations of properties of the automata underlying graphs towards a better understanding of the conversions between representations;
practical analysis of determinisation and the identification of subclasses for which it is feasible on average;
development of random generators for (sub)classes of NFAs under given probabilistic distributions, as well as PRAX algorithms
develop techniques to reduce the ambiguity of REs based on simplification rules
In the following sections, we detail the work developed within this axis, some preliminary results, and planned work. The grant holders of the project have mainly worked within this axis. Reports, articles, or software developed will be indicated in each case.
Characterisation of NFAs Properties
Reducing the size of NFAs
In order to reduce the size of determinisation, a first step is to reduce the size of the NFA. It is well known that NFA minimisation is a PSPACE hard problem. However, to reduce the size of the NFAs there are several techniques based on the use right (or left) invariant equivalence relations on the set of states of the NFA. For bisimilarity (i.e. the largest right invariant relation), the \(O(n\log{n})\) Paige-Tarjan algorithm [@tarjan87] is well-known. Moreover, simulation pre-orders may lead to higher reductions. Both algorithms were implemented in the FAdo system. Experiments were performed with both algorithms using the Ondrik benchmark [@ondrik]. On the average a \(30\%\) reduction was obtained, but using pre-orders was too time expensive. Similar results were obtained considering NFAs converted from uniform random generated regular expressions.
Multi-entry DFAs
Multi-entry DFAs (MDFAs) are DFAs that permit multiple initial states, thereby representing a highly restricted form of nondeterminism. Recently, MDFAs have been employed in parallel matching algorithms [@bbcm24]. This application has prompted a thorough examination of this class of NFAs. Notably, MDFAs can only exhibit finite ambiguity. We studied some aspects of descriptional complexity of this class of NFAs [@Teixeira:2026aa; @moreira26:_multi_dfas] as well as, the computational complexity of several decision problems. We plan to study specific reduction algorithms, as well as, their average complexity. Within this activity we collaborated with a group of researchers from the Politecnico di Milano (Angelo Borsotti, Luca Breveglieri, Stefano Crespi Reghezzi and Angelo Morzenti) on the use of MDFAs for parallel matching. We provided the NFAs obtained as explained before and also developed a random generator for words accepted by NFAs.
NFAs with special induced monoids
The transformation monoids of DFAs are well studied and play an important role in state complexity . In particular, the syntactic monoid of a regular language \(L\) is isomorphic to the transformation monoid of its minimal DFA. Transformation monoids for NFAs has been less studied, but they play a role in the characterisation of the NFAs and their determinisation [@BaburinC25; @Holzer:2023aa]. We considered the class of NFAs which transformation monoids correspond to upper triangular Boolean matrices. This class corresponds to a lower level of the Straubing-Thérien hierarchy and also to shuffle ideals. We studied the average complexity of regular expressions representing this class of languages and compared with representations of finite languages [@duarte26:_averag_state_compl_shuff_ideal_finit_languag]. We plan to research the average size of the DFAs resulting from determinization for this class. It will be interesting to study other (lower) levels of the Straubing-Thérien hierarchy.
Block Languages
Block languages are languages where all words have the same length. In this period we continued previous on the studying this class and prepared a journal article [@duarte26:_block_languag] with a characterisation for this class of languages based on bitmaps and algorithms to obtain minimal DFAs and NFAs directly from those maps. Moreover, we studied the operational state complexity both in the deterministic and non-deterministic case for several operations. Although determinisation is in the worst-case exponential (\(O(2^{\sqrt{n}})\)) except for reversal, all the bounds are polynomial. We plan to deepen this study in order to obtain average case results. In particular, one case considers succinct regular expressions for these cases of languages.
Operational Complexity
For regular languages, the most studied measure has been the state complexity (SC) , i.e. the size of its minimal DFA. Operational state complexity heavily depends on the conversion from NFAs to DFAs (determinisation). For instance, the worst-case state complexity of concatenation is \(m2^n -2^{n-1}\), where operands have SC n and m, respectively. Thus, fine-grained analysis of determinisation is crucial for a better understanding of the state complexity of language operations and the main motivation of this project. The team has previously developed DesCo, a web-based knowledge system on descriptional complexity results. But it relied on outdated technologies that no longer receive active support, making continued development difficult. Moreover the database need to be redesigned in order to include new measures and models. Thus this project was an opportunity to have a grant holder assigned to this activity [@Teixeira:2025aa] which is public available (https:/ /desco.up.pt).
Descriptional Complexity of Shuffle Operations
The shuffle operation models several behaviours of systems from concurrency to biological frameworks and document processing. Moreover succinctly represent regular languages. Team members have studied the descriptional complexity of several shuffle operators mainly considering regular expressions and conversions to NFAs. We extended the notion of synchronised shuffles to Boolean Products of an arbitrary number of operands where synchronisation and interleaving of each symbol is controlled by by a Boolean function [@broda26:_boolean_produc_regul_languag]. These operations formalize the framework of Team automata introduced by Beek et al [@BeekEKR03] and that has been extensively studied. We studied the average complexity for the case of multiple intersections [@broda:_partial_deriv_boolean_produc_averag]. This extends the results obtained in [@Broda:2026aa]. For the general shuffle, we also consider the shuffle of words. In this case, we studied the deterministic and nondeterministic state complexity as well as the average case for the conversion from regular expressions to NFAs [@DuarteMR:2025aa; @duarte26:_shuff_words]. Finally, we also studied the descriptive complexity of literal shuffles (where only single symbols can alternate). In case the operands are DFAs, the $1$-limited automaton of the literal shuffle has a polynomial blow-up [@duarte26:_descr_compl_liter_shuff]. Thus, $1$-limited automata are interesting models to obtain feasible manipulations [@PighizziniPS24].
PRAX Algorithms
PRAX algorithms combine approximation and randomisation to tackle hard problems such as universality of NFAs [@konstantinidis23:_approx_nfa_univer_relat_probl; @AndreouKS25]. A major question that arises here is whether it is possible to give a PRAX algorithm for problems \(U_T\) relative to a family \(T\) of polynomially samplable uniform distributions. More specifically, the question is whether every problem (language) in the class NP has a PRAX algorithm relative to a family uniform distributions. This was possible in the case of the block NFA universality problem. Now, we shown that every NP language has a a PRAX algorithm relative to a family of uniform distributions. Moreover we also presented a solution to tackle the bias of ehe Dirichlet distribution that is meant to be a heuristic for the fictitious uniform distribution and that we have used in the previous works cited above. A paper describing this work and also with experimental tests are planed until the end of the project.
Random Generators
Apart from the random generator mention above for sample words of the language of a given NFA, the method of and Tabakov and Vardi [@vardi05:_evaluat_class_autom_theor_algor] was implemented. Several classes of NFAs were also considered namely Wheeler [@bec24], extensions of FIFAs [@FerreiraMR18] and multi-entry. However extensive tests are still needed to be made.
Regular Expression Simplifications and Matching
The simplification of REs is a major problem with practical impact but not well-studied. In the last 10 years a lot of effort were made to tackle with the so called ReDoS (regular denial of service) attack that was caused by the use of backtrack algorithms for matching regular expressions but as well for the REs used [@BerglundDM14; @SungCH22; @TuronovaHHLVV22]. Several extensions of standard regular expressions are used in the usual RE software libraries and some of them lead to non-regular features. In particular, counting expressions are widely used. We started studying these expressions and to develop derivative based systems and their relation with multiple matching [@Resende:2025aa]. Regarding the simplification of (generalised) REs we are developing feasible rewriting systems that may lead to REs with less ambiguities. For the resulting REs we plan to study the size of the Brzozowski DFA (based on derivatives).
References
Pantelis Andreou, Stavros Konstantinidis, and Taylor J. Smith. Improved randomized approximation of hard universality and emptiness problems. , 30(1-3):5--26, 2025.
Ivan Baburin and Ryan Cotterell. A close analysis of the subset construction. In Andreas Malcher and Luca Prigioniero, editors, Proc. DCFS 2025, volume 15759 of LNCS, pages 17--33. Springer, 2025.
Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Riccardo Maso, and Nicola Prezza. . In Symposium on Combinatorial Pattern Matching, pages 5:1--5:15, 2024.
Martin Berglund, Frank Drewes, and Brink van der Merwe. Analyzing catastrophic backtracking behavior in practical regular expression matching. In Zoltán Ésik and Zoltán Fülöp, editors, Proceedings 14th International Conference on Automata and Formal Languages, AFL 2014, Szeged, Hungary, May 27-29, 2014, volume 151 of EPTCS, pages 109--123, 2014.
Angelo Borsotti, Luca Breveglieri, Angelo Morzenti, and Stefano Crespi Reghizzi. Minimizing speculation overhead in a parallel recognizer for regular texts. In Symposium on Principles and Practice of Parallel Programming, 2024.
Sabine Broda, António Machiavelo, Nelma Moreira, and Rogério Reis. Partial derivatives for boolean products and average complexity of multiple intersections. Submitted.
Sabine Broda, António Machiavelo, Nelma Moreira, and Rogério Reis. . , 2026. Submitted.
Sabine Broda, António Machiavelo, Nelma Moreira, and Rogério Reis. , pages 1--21, 2026.
Sabine Broda, António Machiavelo, Nelma Moreira, and Rogério Reis. Boolean products of (regular) languages. In 30th CIAA 2026, 2026. Submitted.
Guilherme Duarte, Markus Holzer, Nelma Moreira, and Rogério Reis. Average state complexity of shuffle ideals and finite languages. In 27th DCFS, 2026. In preparation.
Guilherme Duarte, Nelma Moreira, Luca Prigioniero, and Rogério Reis. . , 2026. Submitted.
Guilherme Duarte, Nelma Moreira, Luca Prigioniero, and Rogério Reis. On the descriptional complexity of literal shuffle. In DLT 2026, 2026. Submitted.
Guilherme Duarte, Nelma Moreira, and Rogério Reis. Two-word shuffle: Some results. In Andreas Malcher and Luca Prigioniero, editors, Descriptional Complexity of Formal Systems, 26th International Workshop (DCFS 2025), volume 15759, pages 79--93. Springer, 2025.
Guilherme Duarte, Nelma Moreira, and Rogério Reis. On the shuffle of words. , 2026. Submitted.
Miguel Ferreira, Nelma Moreira, and Rogério Reis. Forward injective finite automata: Exact and random generation of nonisomorphic nfas. In Stavros Konstantinidis and Giovanni Pighizzini, editors, Descriptional Complexity of Formal Systems - 20th IFIP WG 1.02 International Conference, DCFS 2018, Halifax, NS, Canada, July 25-27, 2018, Proceedings, volume 10952 of Lecture Notes in Computer Science, pages 88--100. Springer, 2018.
Markus Holzer. Comments on monoids induced by nfas. Technical Report IFIG Research Report 2301, 2023.
Stavros Konstantinidis, Mitja Mastnak, Nelma Moreira, and Rogério Reis. Approximate nfa universality and related problems motivated by information theory. , 972:114076, 2023.
Ondra Lengál. A repository for automata benchmarks. https://github.com/ondrik/automata-benchmarks, Last accessed: 2025.
Nelma Moreira, Rogério Reis, and Gonçalo Teixeira. On the complexity of multi-entry dfas. In 27th DCFS 2026, 2026. In preparation.
R. Paige and R. E. Tarjan. Three partition refinement algorithms. , 16(6):973--989, 1987.
Giovanni Pighizzini, Luca Prigioniero, and Simon Sádovský. Performing regular operations with 1-limited automata. , 68(3):465--486, 2024.
Daniel Resende. Efficient regular pattern matching avoiding denial of service. Mestrado em segurança informática,, Faculdade de Ciêncais da Universidade do Porto, 2025.
Sicheol Sung, Hyunjoon Cheon, and Yo-Sub Han. How to settle the redos problem: Back to the classical automata theory. In Pascal Caron and Ludovic Mignot, editors, Proc. 26th CIAA 2022, volume 13266 of LNCS, pages 34--49. Springer, 2022.
D. Tabakov and M. Vardi. Evaluating classical automata-theoretic algorithms. In LPAR'05, 2005.
Ana Sofia Teixeira. Desco: A descriptional desco: A descriptional complexity platform. Master's thesis, Faculdade de Ciêncais da Universidade do Porto, Mestrado em Engenharia de Redes e Sistemas Informáticos, 2025.
Gonçalo Teixeira. On the plurality of initial states, 2026.
Maurice H. ter Beek, Clarence A. Ellis, Jetty Kleijn, and Grzegorz Rozenberg. Synchronizations in team automata for groupware systems. , 12(1):21--69, 2003.
- Lenka Turonová, Lukás Holı́k, Ivan Homoliak, Ondrej Lengál, Margus
- Veanes, and Tomás Vojnar. Counting in regexes considered harmful:
- Exposing redos vulnerability of nonbacktracking matchers. In Kevin R. B.
- Butler and Kurt Thomas, editors, 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, pages 4165--4182.
- USENIX Association, 2022.
- ::