KushalKumaran - 2011-09-19:
Regular expressions, as implemented by common programming languages, are way more powerful than "regular expressions", as intended by computer science theorists in connection with finite state automata. There are very good reasons behind this power. Even something very basic, such as backreferences, are beyond the capabilities of regular expressions. I found this:
http://dev.perl.org/perl6/doc/design/apo/A05.html, where Larry Wall says: "... generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions".
PeterThoeny - 2011-09-19:
Thanks for the pointer Kushal, I see there are dramatic changes/fixes to regexes coming with Perl 6.
EdGrimm - 2011-09-19:
That tagging is a great idea. I've used recursion in some toy code of mine, which works well, but has less than wonderful performance, and it has the limitation that the inner parentheses must be processed away or the outer parentheses cannot be evaluated. I haven't tested this yet, but it does have potential.
That having been said, it looks to me like it'd be better to put the tag
after the parentheses. That way, the parenthesis itself indicates the next character will be the first character of a tag. There's no way any non-parenthesis will be confused as a tag temporarily, triggering a back track. You then need to have a sequence to end the tag, but since you can't possibly be looking at a false positive, a single, non-digit character will do.
So, instead of
$ROUND-esc-1( $TIME-esc-2( -esc-2), $TIMEDIFF-esc-2( $TIME-esc-3(
$T-esc-4( R$ROW-esc-5( -esc-5):C$COLUMN-esc-5( -1 -esc-5) -esc-4)
-esc-3), day -esc-2) -esc-1)
you have
$ROUND(1. $TIME(2. )2., $TIMEDIFF(2. $TIME(3.
$T(4. R$ROW(5. )5.:C$COLUMN(5. -1 )5. )4.
)3., day )2. )1.
PeterThoeny - 2011-09-20:
Thanks Ed for your insights. Not sure how much performance you gain by swapping the parenthesis with the escape token. The
-esc- was just for visuals, in reality it is a null character. Does it make a difference in performance if you scan for
.N( vs
(N.? Your approach is better in a sense that you can take any non-digit char to terminate the sequence, vs. a null character to start the sequence.