A word is not a word!

First of all, an InDesign-oriented word is not a lexical unit. If you create a new text frame and enter a crazy string like: *;§!:-/~_%$»., the frame is surprisingly regarded as containing a word, and a single one. That's what is displayed in the Info panel and this is confirmed by the following test:

// selected is a text frame that only contains:
// *;§!:-/~_%$».
var myTextFrame = app.selection[0];
alert( myTextFrame.words.length ); // => 1

Note that in this simple case we use myTextFrame.words rather than myTextFrame.parentStory.words, but of course the distinction is crucial when you deal with threaded text frames, or text frames that have overset contents.

Obviously, an empty story has no word, since it has no character. However, do not conclude that a story contains a word as soon as it contains a character. The Info panel is misleading on this point and, in fact, may display wrong counts. For example, if a text frame only contains space characters—say one tab and one simple space—the Info panel will claim:
    Characters: 2
    Words: 1 // this is wrong!
    Lines: 1
    Paragraphs: 1
whereas actually myTextFrame.words.length==0.

So, what is the right rule? In the InDesign scripting perspective a Word is a maximal range of characters that do not own any word separator (like space, tab, etc.). Words are just defined as the pieces between these non-word regions. Hence, the whole point is to identify what a word separator is.

To break or not to break (words)

We already know that the space and the tab character are word separators. It is not so easy to find other specimens! General punctuation characters—including comma, colon, semicolon, slash…—are not word separators. The figure below shows some (surprising) examples of character strings which count for one single word:

All stories in this screenshot contain a single word!

Note that special characters such as the footnote marker, the table placeholder, a text variable, or any object anchor, do not break words.

Finally, there are only three kinds of characters that are word separators: white spaces (including tab), break characters, and dashes (excluding hyphens). As I didn't find a complete list in Adobe documentation, I used the following script to identify every word separator:

var s = app.selection[0].parentStory,
    c = s.characters[1],
    u = 0,
    r = [],
    z = -1,
s.contents = "a_b";
for( u=0 ; u <= 0xFFFC ; ++u )
    try {
        OK = 0;
        c.contents = String.fromCharCode(u);
        OK = 1;
    if( !OK ) continue;
    if( 1 < s.words.length )
        t = u.toString(16).toUpperCase();
        while( t.length<4 ) t = '0'+t;
        r[++z] = 'U+'+t;
alert( r.join('\r') );

Here is the result we obtain in CS4 and CS5:

List of InDesign Word Separators

U+0008 <ctrl> BACKSPACE Right indent tab.
U+0009 <ctrl> TAB Regular tabulation.
U+000A <ctrl> LINE FEED Forced line break.
U+000D <ctrl> CARRIAGE RETURN Reflects several break characters, including paragraph return.
U+0020 SPACE Usual space.
U+0085 <ctrl> NEXT LINE Hidden character. (Behaves like a space.)
U+00A0 NO-BREAK SPACE Nonbreaking space.
U+1680 OGHAM SPACE MARK Hidden character. (Behaves like a space.)
U+180E MONGOLIAN VOWEL SEP. Hidden character. (Behaves like a space.)
U+2000 EN QUAD Hidden character. (Behaves like a space.)
U+2001 EM QUAD Flush space.
U+2002 EN SPACE EN Space.
U+2003 EM SPACE EM Space.
U+2004 THREE-PER-EM SPACE Third Space.
U+2005 FOUR-PER-EM SPACE Quarter Space.
U+2006 SIX-PER-EM SPACE Sixth Space.
U+2007 FIGURE SPACE Figure Space.
U+2008 PUNCTUATION SPACE Punctuation Space.
U+2009 THIN SPACE Thin Space.
U+200B ZERO WIDTH SPACE Discretionary Line Break.
U+2013 EN DASH EN Dash (–).
U+2014 EM DASH EM Dash (—).
U+2028 LINE SEPARATOR Hidden character. (Behaves like a space.)
U+2029 PARAGRAPH SEPARATOR Hidden character. (Behaves like a space.)
U+202F NARROW NO-BREAK SPACE Nonbreaking Space (Fixed Width.)
U+205F MEDIUM MATH. SPACE Hidden character, actually implemented though.

A curious fact is that Hair Space (U+200A), Non-Joiner (U+200C), End Nested Style Here (U+0003), and Indent To Here (U+0007) do not act as word separators in InDesign.

Counting and extracting words

Due to the recursive structure of the document layout components, addressing the entire set of textual entities can be a real headache. The general strategy is to browse every Story from the Document.stories collection. This allows to exhaustively explore text contents at any sub-level of the document hierarchy, since any text is supposed to belong to a story. Well, this is almost true, but there are two critical exceptions: footnote and table contents are managed through special ‘strands’ which are not seen as story containers. That's why this usual word counter lacks footnotes and tables:

// Superficial Word Counter
// (ignoring footnotes and table cells)
alert( app.activeDocument.stories.everyItem().words.length );

Given a story, you need to inspect footnotes and table cells separately. And you have to use a recursive algorithm because both Cell and Footnote objects may contain nested table(s). Here is a generic utility that implements a deep word count, including footnotes and cells at every level:

// Deep Word Counter
// (considering footnotes and tables)
// Like any digit sequence, each number that starts a footnote
// counts itself as a word--unless you use an empty separator!
var countWords = function F(/*Story|Cell|Footnote*/every)
    var ret, t;
    every = every || app.activeDocument.stories.everyItem();
    if( !every.isValid ) return 0;
    ret = every.words.length;
    t = every.texts &&
        every.texts.everyItem &&
    if( !t ) return ret;
    t.tables.length &&
        ret += F( t.tables.everyItem().cells.everyItem() );
    t.footnotes.length &&
        ret += F( t.footnotes.everyItem() );
    t = null;
    return ret;
alert( "Number of words: " + countWords() );

Note. — In the above code, the every parameter is a specifier which may address Story, Cell, or Footnote object. Thanks to the everyItem() syntax, this specifier can also encapsulate a collective command, so the recursive countWords function never needs to create, manage, and browse JavaScript arrays. Everything is done through the command subsystem, which I think improves the performance of the function.

Finally, turning our word counter into a word extractor is not too difficult:

// Deep Word Extractor
// - considering footnotes and tables
// - removing duplicates
// This script is not optimized for long documents!
var extractWords = function(MIN_LENGTH)
    var obj = {},
        reSkip = /[\x00-\x1F\uFFFC\uFFFD]/g,
        cleanKeys = function(a)
            var re = reSkip,
                i = a.length >>> 0,
                o = obj, k;
            while( i-- )
                k = a[i].replace(re,'');
                (MIN_LENGTH <= k.length) && o[' '+k]=null;
            re = o = null;
        browse = function(every)
            var t;
            if( !every.isValid ) return;
            every.words.length &&
                cleanKeys( every.words.everyItem().contents );
            t = every.texts &&
                every.texts.everyItem &&
            if (!t ) return;
            t.tables.length &&
                browse( t.tables.everyItem().cells.everyItem() );
            t.footnotes.length &&
                browse( t.footnotes.everyItem() );
            t = null;
    browse( app.activeDocument.stories.everyItem() );
    reSkip = cleanKeys = browse = null;
    var k,
        z = -1,
        r = [];
    for( k in obj )
        if( !obj.hasOwnProperty(k) ) continue;
        r[++z] = k.substr(1);
    obj = null;
    return r;
    "Words that contain 5+ characters:\r\r" +
    extractWords(5).sort().join(' | ')

• See also:
InDesign Special Characters;
On ‘everyItem()’ – Part 1;
On ‘everyItem()’ – Part 2.