Given an input string, suppose you want to capture every maximal substring formed of at least two letters and ending with a 'x'. (To make it simple, we shall admit that a letter is defined by its belonging to the \w class.) The solution is trivial, just run the regex /\w+x/g on the input string.

TEST #1 (ExtendScript/JavaScript)
input :: "Box x taxi x vaxx."
regex :: /\w+x/g
3 MATCHES FOUND -- See the ┊...┊ parts
✔ ┊Box┊ x taxi x vaxx.
✔ Box x ┊tax┊i x vaxx.
✔ Box x taxi x ┊vaxx┊.

Note. — We represent the consecutive matches nested in ┊...┊ based on a basic while(regex.exec(input)){...} loop that inspects RegExp.leftContext and RegExp.rightContext at each step. This presentation is more instructive than a simple display of the array of matches, because it tells us how the internal pointer of the regular expression progresses during execution.

A few things worth pointing out:
   • The isolated x are not captured because we target at least 2 letters.
   • ┊tax┊ is found in "taxi" since no \b assertion is specified.
   • ┊vaxx┊ is found rather than ┊vax┊x because \w+ is greedy, meaning that the + quantifier looks for a maximal match.

The most important fact, of course, is that the final character 'x' also belongs to the class \w. So, whenever \w+ eats too many 'x's, the regular expression is forced to backtracking. It must return what it has captured in excess from \w+ to make room for the x part. If the quantifier had been made non-greedy (/\w+?x/g) then the last match would be ┊vax┊x, which is no longer maximal.

So far, so good.

Now to the Bad News

Then, suppose your script should also consider some special syntax, say '%' followed by two digits, as representing a letter in the input string. And your goal is still to retrieve all maximal substrings formed of letters (including the %## meta-codes) and ending with 'x'. What would be the proper regex then? Easy:


is the logical answer you probably have in mind, and it works perfectly fine… in JavaScript!

TEST #2a (JavaScript only!)
input :: "Box x A%22%34x V%99xx."
regex :: /(%\d\d|\w)+x/g
✔ ┊Box┊ x A%22%34x V%99xx.
✔ Box x ┊A%22%34x┊ V%99xx.
✔ Box x A%22%34x ┊V%99xx┊.

Unfortunately, ExtendScript has a backtracking bug since the dawn of time—which in my opinion will never be fixed—and is not able to properly manage “quantified alternatives,” that is, subpatterns of the form (A|B)+, (A|B)*, etc.

TEST #2b (ExtendScript bug!)
input :: "Box x A%22%34x V%99xx."
regex :: /(%\d\d|\w)+x/g
✖ Bo┊x┊ x A%22%34x V%99xx.
✖ Box ┊x┊ A%22%34x V%99xx.
✖ Box x A%22%34┊x┊ V%99xx.
✖ Box x A%22%34x V%99┊x┊x.
✖ Box x A%22%34x V%99x┊x┊.

Since our regex /(%\d\d|\w)+x/g should parse as well input strings without %## codes, the bug is visible already in our original example "Box x taxi x vaxx."

TEST #3 (ExtendScript bug!)
input :: "Box x taxi x vaxx."
regex :: /(%\d\d|\w)+x/g
✖ Bo┊x┊ x taxi x vaxx.
✖ Box ┊x┊ taxi x vaxx.
✖ Box x ta┊x┊i x vaxx.
✖ Box x taxi ┊x┊ vaxx.
✖ Box x taxi x va┊x┊x.
✖ Box x taxi x vax┊x┊.

And in fact you could put anything else in the alternation structure, even a single character (% instead of %\d\d), this won't clear the symptoms. The bug is not related to the complexity of the regular expression, only to its (A|B)+ form. In many circumstances, ExtendScript's engine will just fall in an infinite loop (which you can eventually prevent by adding to the input string a suffix that the alternation can never match.)

Note, in the above results, that a ┊x┊ match is wrongly reported at any position of 'x' in the input string, including isolated 'x's that couldn't even be regarded as the tail of a regular match. So we cannot rely either on the value of regex.lastIndex as a hint for completing the matches.

Just remember this sad law:

   Any Greedy Quantified Alternative Is
   Deeply And Irreparably Flawed
   (in ExtendScript.)

Ironically, I found a partial workaround that only works when, well, we don't actually need it! Indeed, if your pattern looks like /(%|\w)+x/g in the sense that it consumes a single character on each side of the alternation, then it seems that the “lookahead trick” /((?=%|\w).)+x/g escapes the bug:

TEST #4 (Lookahead trick for single char. alternates)
input :: "Box x taxi x vaxx."
regex :: /((?=%|\w).)+x/g
✔ ┊Box┊ x taxi x vaxx.
✔ Box x ┊tax┊i x vaxx.
✔ Box x taxi x ┊vaxx┊.

But as I said we don't need it at all, since %|\w could be just encoded as a class, [%\w], which entirely eliminates the alternation structure.

Sometimes you can be satisfied with a non-greedy quantifier. In this case the RegExp instance seems to work much better, but have in mind that it won't capture maximal matches:

TEST #5 (Non-greedy quantifier)
input :: "Box x taxi x vaxx."
regex :: (%|\w)+?x
✔ ┊Box┊ x taxi x vaxx.
✔ Box x ┊tax┊i x vaxx.
✔ Box x taxi x ┊vax┊x.

(Note that the last match is then ┊vax┊x, not ┊vaxx┊.)


The bug documented above is without a doubt one of the most dreadful in the ExtendScript regular expression engine. Barring a miracle, it will never be resolved since ExtendScript has been sentenced to death in the medium term. However, hundreds of existing scripts are likely to use RegExp objects, and sometimes very advanced patterns, which can collide at any time with the simple case of quantified alternations.

If you found a solution that I hadn't thought of, thank you in advance for sharing it!