Free access
Proceedings
Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms

Regular Languages meet Prefix Sorting

Abstract

Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et al., TCS 2017]— which extends naturally the concept of prefix sorting to labeled graphs—we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton. We first characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: the sorted prefixes of strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class. We proceed by proving several results related to Wheeler automata: (i) We show that every Wheeler NFA (WNFA) with n states admits an equivalent Wheeler DFA (WDFA) with at most 2n – 1 – |Σ| states (Σ being the alphabet) that can be computed in O(n3) time. (ii) We describe a quadratic algorithm to prefix-sort a proper superset of the WDFAs, a O(n log n)-time online algorithm to sort acyclic WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. (iii) We provide a minimization theorem that characterizes the smallest WDFA recognizing the same language of any input WDFA. The corresponding constructive algorithm runs in optimal linear time in the acyclic case, and in O(n log n) time in the general case. (iv) We show how to compute the smallest WDFA equivalent to any acyclic DFA in nearly-optimal time. Our contributions imply new results of independent interest. Contributions (i-iii) provide a new class of NFAs for which the minimization problem can be approximated within a constant factor in polynomial time. Contribution (iv) provides a provably minimum-size solution for the well-studied problem of indexing deterministicacyclic graphs for linear-time pattern matching queries.

Formats available

You can view the full content in the following formats:

Information & Authors

Information

Published In

cover image Proceedings
Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms
Pages: 911 - 930
Editor: Shuchi Chawla
ISBN (Online): 978-1-611975-99-4

History

Published online: 23 December 2019

Authors

Affiliations

Notes

*
Partially supported by the project MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L.

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

There are no citations for this item

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media