Indiscripts :: Making String.split() support U+0000 in ExtendScript CS4

In JavaScript a String value is nothing but an ordered sequence of zero or more 16-bit unsigned integer(s). Although each integer in that sequence usually represents a character in the scope of UTF-16, the ECMAScript specification “does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.” (ECMA-262, section 4.3.16). Therefore, zero is treated as any other value and may appear anywhere in a string. (Unlike the C language, JavaScript does not consider it a string-terminator.)

You can easily build strings with zero values using the literal expression "\x00" (or "\u0000")—which is equivalent to String.fromCharCode(0).

There are many situations in which we like to manage “non-human-readable” strings in JavaScript (file parsing, data compression etc.) and of course the zero value is very common in such binary streams. Unfortunately, ExtendScript 3.x (CS4) has several flaws in processing the NUL character, especially in the context of the split() method. Here are a few examples:

 
// Malfunction of String.prototype.split() in CS4
// ============================
 
var s;
 
s = "A\x00B\x00C\x00D";
alert( s.split('\x00').toSource() );
// CS4 => ["", "", "", "", "", "", "", ""]
// CS5 => ["A", "B", "C", "D"]
 
s = "\x00A B C D";
alert( s.split(' ').toSource() );
// CS4 => ["\x00A B C D"]
// CS5 => ["\x00A", "B", "C", "D"]
 
s = "A B \x00C D";
alert( s.split(' ').toSource() );
// CS4 => ["A", "B", "\x00C D"]
// CS5 => ["A", "B", "\x00C", "D"]

As you can see, the CS4 engine cannot properly split the string due to the NUL characters it contains. Everything happens as if "\x00" was interpreted as a reserved value or something of a breakpoint.

I needed to fix this problem in a recent project, which seemed simple at first… but it took me a long time to implement a fair substitute of String.prototype.split(), having in mind both the NUL character issues and the parameters this method should support.

I post below the currently most advanced version of the solution I came to, in case it might be of some use to other developers:

var splitCS4 = function F(str, separator, limit)
// -------------------------------------
// Allows to split a string even if it contains
// occurrences of U+0000, in ExtendScript CS4
// Version 1.00 alpha  |  30-Jan-2013
// -------------------------------------
// <str> :        The source string
// <separator> :  String or RegExp
// <limit> :      Max. number of results [optional]
// -------------------------------------
// Returns the resulting Array of substrings
// Cf. String.prototype.split() for details
{
    // Cache
    // ---
    F.Q || (F.Q = {
        splitZero: function(s, l)
            {
            ('undefined'== typeof l) && (l = 0xFFFFFFFF);
            var a = [], z = 0, p;
            if( !s ) return a;
            while( (z < l) && -1 < (p = s.indexOf('\x00')) )
                {
                a[z] = s.substr(0,p);
                s = s.substr(1+p);
                ++z;
                }
            (z < l) && (a[z] = s);
            return a;
            },
        mergeSplit: function(a, sep, l, MERGE)
            {
            MERGE || (MERGE = '');
 
            var ll = 0xFFFFFFFF > l ? (1+l) : l,
                n = a.length,
                r = a[0].split(sep, ll),
                z = r.length,
                i, t;
 
            for( i= 1 ; i < n ; ++i )
                {
                t = r[--z];
                r.length = z;
                r = r.concat(a[i].split(sep, ll-z));
                r[z] = t + MERGE + r[z];
                if( ll <= (z=r.length) ) break;
                }
            (l < r.length) && (r.length = l);
            return r;
            },
        subSplit: function(a, sep, l)
            {
            var n = a.length,
                r = [],
                z,
                i;
            for( i = 0 ; l > (z=r.length) && i < n ; ++i )
                {
                r = r.concat(a[i].split(sep, l-z));
                }
            return r;
            }
        });
 
    // Default limit is 2^32-1 (cf. ECMA-262)
    // ---
    if( 'undefined'== typeof limit )
        { limit = 0xFFFFFFFF; }
 
    // If:   limit===0
    // then  return the empty array
    // ---
    if( 0 >= limit )
        { return []; }
 
    // If:   (a) separator is undefined/empty OR
    //       (b) str does not contain U+0000 OR
    //       (c) separator is a regexp that matches ''
    // then  invoke the regular split() method
    // ---
    if( (!separator) || -1 == str.indexOf('\x00') ||
        ((separator instanceof RegExp) && separator.test('')) )
        { return str.split(separator,limit); }
 
    // If:   separator is U+0000
    // then  directly invoke the splitZero routine
    // ---
    if( '\x00'===separator )
        { return F.Q.splitZero(str, limit); }
 
    // If separator matches U+0000...
    // ---
    return F.Q[
        (separator instanceof RegExp) && separator.test('\x00') ?
        'subSplit' :    // used if separator matches U+0000
        'mergeSplit'    // otherwise
        ](splitZero(str), separator, limit, '\x00');
};
 
// =============================================
// Sample code (tested from InDesign CS4/Win XP)
// =============================================
 
var s;
 
s = "A\x00B\x00C\x00D";
alert( splitCS4(s,'\x00').toSource() );
// => ["A", "B", "C", "D"]
 
s = "AB\x00 CD\x00EF GH\x00\x00IJ";
alert( splitCS4(s, " ").toSource() );
// => ["AB\x00", "CD\x00EF", "GH\x00\x00IJ"]
 
alert( splitCS4(s, "\x00", 3).toSource() );
// => ["AB", " CD", "EF GH"]
 
alert( splitCS4(s, "").toSource() );
// => ["A", "B", "\x00", " ", "C", "D", "\x00", "E",
//     "F", " ", "G", "H", "\x00", "\x00", "I", "J"]

Thanks for any feedback.

PostScript. — One might also note that String.prototype.replace() cannot properly digest U+0000 in both CS4, CS5, and CS6 (!)—while String.prototype.match() seems to work. Volunteers to write a similar patch for replace are welcome!

Indiscripts

Automating InDesign since 2009

Making String.split() support U+0000 in ExtendScript CS4

About Indiscripts

Note on the author

Thanks | Credits