warebas.blogg.se - Regex golf it never ends

Hence, if the golf instance is solvable, there is a set of at most $k$ letters from $\Sigma$ so that each word in $A$ is covered by this set of letters. Thus, any regex solving the golf problem has to contain at least one letter from each of the words in $A$.

A regex matching the empty subword would match $x$.

If $c_1, \ldots, c_k$ is a solution for the set cover instance, the regex $c_1 + \cdots + c_k$ is a solution to regex golf.

This reduction is obviously in P and equivalence is also quite simple to see: The word consists of exactly the characters representing subsets in $C$ that contain $e$ (in arbitrary order).

$A$ contains one word for each element $e$ of $U$.

$\Sigma$ contains one character for each subset in $C$ and one additional character (denoted $x$ in the following).

We translate an input for Set cover into one for regex golf as follows: Given a universe $U$ and a collection $C$ of subsets of $U$, is there a set $C' \subseteq C$ of size $\leq k$ so that $\bigcup_ S = U$?

In order to show NP-hardness, we reduce Set cover: That it is in NP is straightforward, as explained in the comments (verify a candidate-RE by translating it into an NFA and running that on all words from $A$ and $B$).

(Changing any of these assumptions should only influence the complexity of the construction below, but not the general result.)

As in the comic strip, we consider a regex to match a word, if it matches a substring of the word. Length of a regex is defined as the number of characters from $\Sigma$. letters from $\Sigma$, matching themselves,Īnd nothing else.It seems highly likely to me that it is NP-complete (and in fact, I would expect the reduction to be to some sort of cover problem) but a few searches haven't turned up anything particularly useful.Īssuming the TCS-variant of regex, the problem is indeed NP-complete. Is anything known about the complexity of this particular separation problem? (Note that since I've specified $A$ and $B$ as finite sets of strings, the natural notion of size for the problem is the total lengths of all strings in $A$ and $B$ this swamps any contribution from $k$). Given two (finite) sets $A$ and $B$ of strings over some alphabet $\Sigma$, is there a regular expression of length $\leq k$ that accepts every string in $A$ and rejects every string in $B$? Norvig's post includes an algorithm for generating a reasonably short candidate, and he notes that his approach involves solving an NP-complete Set Cover problem, but he's also careful to point out that his approach doesn't consider every possible regular expression, and of course his isn't necessarily the only algorithm, so his solutions aren't guaranteed to be optimal, and it's also possible that some other assuredly polynomial-time algorithm could find equivalent or better solutions.įor concreteness' sake and to avoid having to solve the optimization question, I think the most natural formulation of Regular Expression Separation would be: As seen in this recent XKCD strip and this recent blog post from Peter Norvig (and a Slashdot story featuring the latter), "regex golf" (which might better be called the regular expression separation problem) is the puzzle of defining the shortest possible regular expression that accepts every word in set A and no word in set B.