[wp-hackers] Portable tokenising from the shell

Sat Dec 1 16:57:08 UTC 2012

Hi,

Some of you may remember an earlier discussion about parsing JSON 
output, which is one of the formats available from api.wordpress.org. 
JSON was the most suitable for portably parsing from a Bourne/Bash shell.

This guy has implemented such a parser already: 
http://github.com/dominictarr/JSON.sh

One part of the parser is this. It's the tokeniser, splitting up the 
JSON into parts:

     local ESCAPE='(\\[^u[:cntrl:]]|\\u[0-9a-fA-F]{4})'
     local CHAR='[^[:cntrl:]"\\]'
     local STRING="\"$CHAR*($ESCAPE$CHAR*)*\""
     local NUMBER='-?(0|[1-9][0-9]*)([.][0-9]*)?([eE][+-]?[0-9]*)?'
     local KEYWORD='null|false|true'
     local SPACE='[[:space:]]+'
     grep -E -o "$STRING|$NUMBER|$KEYWORD|$SPACE|."

It's an interesting use of grep; basically it matches *everything*, but 
splits it up based on certain separators, in a certain order.

However... my research shows that the "-o" switch (which causes grep to 
output only each matched portion, one per line) is not part of POSIX, 
but is nonetheless available in GNU (hence Linux and Cygwin), 
Free/Net/OpenBSD and Mac OS X - but not in Solaris (either in the grep 
in /usr/bin or in /usr/xpg4/bin).

So it's not quite totally portable. My question: does anyone have 
sufficient sed or awk skills to advise me how to reproduce the above in 
one of those? As I said, it's a tokeniser, that splits the input into 
the discrete chunks indicated. I'm an awk novice. I'm trying to write 
code that assumes only POSIX, or failing that the common subset of 
GNU/BSD/Mac/Solaris. If I fail I can use various hacks (e.g. search for 
perl, use that if found, search for PHP, use that), but it'd be nice if 
I didn't have to resort to multiple code paths in that way.

Many thanks,
David

-- 
WordShell - WordPress fast from the CLI - www.wordshell.net