 |
|
| Computers Forum Index » Computer Compilers » How to rewrite a regexp without word boundaries?... |
|
Page 1 of 1 |
|
| Author |
Message |
| Hallvard B Furuseth... |
Posted: Sun Jul 05, 2009 11:11 pm |
|
|
|
Guest
|
dave_140390 at (no spam) hotmail.com writes:
Quote: I have been wondering, with limited success, how to rewrite a regexp
without word boundaries.
Why do you want to? Most likely, the answer is that your regexps are
getting too clever and thus too unreadable/bug-prone, so you should
break them up and use more ordinary programming instead.
However:
Quote: (...)
Thus, regexp "ex" matches "example" and "text", whereas regexp "\bex"
matches "example" but not "text".
Now, if "\b" occurs at the beginning (or end) of the regexp, I think
it's easy to rewrite the regexp without using "\b". For example,
"\bex" could be rewritten as "\Wex".
No, that would not match "ex" at the beginning of the string being
matched. You could use (?:^|\W)ex. But the matched substring (Perl's
$&) will differ from with \bex, so it depends on how the regexp is used.
Quote: But what if "\b" occurs within the regexp? For example, how to get rid
of "\b" in "<RE>\bex" (with "<RE>" being any regexp)? "<RE>\Wex"
wouldn't work here: for example (with "<RE>" = "\W"), "\W\Wex" is not
equivalent to "\W\bex".
Rewrite <RE> to a regexp which ends with \W or \W|^ and is equivalent
in the cases where it is followed by \b\w.
--
Hallvard |
|
|
| Back to top |
|
|
|
| ... |
Posted: Mon Jul 06, 2009 8:43 pm |
|
|
|
Guest
|
Quote: I have been wondering, with limited success, how to rewrite a regexp
without word boundaries.
Why do you want to? Most likely, the answer is that your regexps are
getting too clever and thus too unreadable/bug-prone, so you should
break them up and use more ordinary programming instead.
The regexps are not mine... Sorry, I should have explained. I am
actually writing a tool that takes regexps as input and transforms
them internally into NFAs/DFAs. Since the regexps are not really in my
hands, I should be ready for weird regexps - for example, regexps with
"\b" preceded or followed by other regexps. And I don't know how to
transform a regexp that contains "\b" at an arbitrary position into an
equivalent NFA/DFA.
Quote: Now, if "\b" occurs at the beginning (or end) of the regexp, I think
it's easy to rewrite the regexp without using "\b". For example,
"\bex" could be rewritten as "\Wex".
No, that would not match "ex" at the beginning of the string being
matched. You could use (?:^|\W)ex.
You are right. My mistake.
Quote: But the matched substring (Perl's
$&) will differ from with \bex, so it depends on how the regexp is used.
I am not interested in the matched substring, so "(?:^|\W)ex" (or
rather the NFA/DFA corresponding to "(?:^|\W)ex") is fine... in this
particular case.
Quote: But what if "\b" occurs within the regexp? For example, how to get rid
of "\b" in "<RE>\bex" (with "<RE>" being any regexp)? "<RE>\Wex"
wouldn't work here: for example (with "<RE>" = "\W"), "\W\Wex" is not
equivalent to "\W\bex".
Rewrite <RE> to a regexp which ends with \W or \W|^ and is equivalent
in the cases where it is followed by \b\w.
Hmm, not so simple in the general case, but I will have to think about
this possibility.
-- dave |
|
|
| Back to top |
|
|
|
| Hans Aberg... |
Posted: Tue Jul 07, 2009 3:10 am |
|
|
|
Guest
|
dave_140390 at (no spam) hotmail.com wrote:
Quote: I have been wondering, with limited success, how to rewrite a regexp
without word boundaries.
...I am
actually writing a tool that takes regexps as input and transforms
them internally into NFAs/DFAs. ... I don't know how to
transform a regexp that contains "\b" at an arbitrary position into an
equivalent NFA/DFA.
I don't think that is possible: that is an additional mechanism on top
of the reg-ex structure.
Formally, let C be the character set, and C* the set of strings (the
free monoid on the set C). Then a language L is a subset of C*. The
regular expressions are operations on subsets of C* that define
languages. If L is a regular expression language, then a string either
matches - it is a subset of L - or it does not.
Now, when used in a regular expression matcher program, one selects a
parsing point in the string and looks at the set of strings of
successive lengths from that point. Usually, the match is the longest of
those strings that are in L. But there is a choice, which does not
depend on L.
So I think you need to implement a mechanism without the NFA/DFA
structure that identifies the \b boundaries.
Hans |
|
|
| Back to top |
|
|
|
| Andrew Tomazos... |
Posted: Tue Jul 07, 2009 4:33 pm |
|
|
|
Guest
|
On Jul 6, 10:43 pm, dave_140... at (no spam) hotmail.com wrote:
Quote: I have been wondering, with limited success, how to rewrite a regexp
without word boundaries.
Why do you want to? Most likely, the answer is that your regexps are
getting too clever and thus too unreadable/bug-prone, so you should
break them up and use more ordinary programming instead.
The regexps are not mine... Sorry, I should have explained. I am
actually writing a tool that takes regexps as input and transforms
them internally into NFAs/DFAs. Since the regexps are not really in my
hands, I should be ready for weird regexps - for example, regexps with
"\b" preceded or followed by other regexps. And I don't know how to
transform a regexp that contains "\b" at an arbitrary position into an
equivalent NFA/DFA.
Why don't you study Perl's regex engine to see how they implement it?
It is all open source. I did this a while ago. It is very
interesting to look at, Perl has the most advanced and heavily used
regex engine out of just about anything.
Also one thing to note is that in formal language theory "regular
expression" has a well-defined meaning. See Chomsky. Perl's regular
expressions do *not* classify as regular expressions under the formal
definition.
-Andrew.
[Perl's regex engine is swell, but if performance is an issue it's nowhere
near as fast as a DFA generated by flex or re2c. -John] |
|
|
| Back to top |
|
|
|
| Hallvard B Furuseth... |
Posted: Tue Jul 07, 2009 6:27 pm |
|
|
|
Guest
|
dave_140390 at (no spam) hotmail.com writes:
Quote: I am actually writing a tool that takes regexps as input and
transforms them internally into NFAs/DFAs.
It's been way too long since I played with NFAs/DFAs, but anyway:
Since you are saying NFA, maybe this a somewhat theoretical exercise
so it's OK to produce one with a godawful number of states. Is the
following possible?
Make a \b-tracking DFA, one which only notices word boundaries.
Make an NFA for your regexp without '\b's, but rewrite it so any jump
from a \b in the regexp is the only jump into the destination state.
Join the two machines (combining all possible states from the two) so
they in effect run in parallel in a big machine. Give \b as additional
input from the 1st submachine to the "coming from \b" states in the 2nd.
--
Hallvard |
|
|
| Back to top |
|
|
|
| Hannah Schroeter... |
Posted: Sun Aug 16, 2009 12:37 am |
|
|
|
Guest
|
Hello!
Andrew Tomazos <andrew at (no spam) tomazos.com> wrote:
Quote: [...]
Also one thing to note is that in formal language theory "regular
expression" has a well-defined meaning. See Chomsky. Perl's regular
expressions do *not* classify as regular expressions under the formal
definition.
-Andrew.
[Perl's regex engine is swell, but if performance is an issue it's nowhere
near as fast as a DFA generated by flex or re2c. -John]
IIRC perl's regex implementation (as well as glibc, btw) even lose
compared to *NFA emulation*, for example with a?^na^n, matched against
a^n (where "^n" is manually expanded to mean n times, i.e. with n=3
it'd be the regular expression a?a?a?aaa against string aaa). Java
loses, too (been there, tested it). OpenBSD's libc doesn't (seems to
be plain vanilla NFA emulation when it sees the expression is in fact
regular; yields the expected, roughly quadratic runtime behaviour).
See http://swtch.com/~rsc/regexp/regexp1.html for information about that
problem.
I also tested the same thing with flex, using a small script that
generates flex input and a test program for a given n. Horrendous
turnaround times (flex generation mostly) for increasing n, blazingly
fast match times, both as expected.
Kind regards,
Hannah.
[Spamassassin has an option to compile its patterns with re2c and link that into
perl. It takes the better part of an hour to compile, but it runs really fast. -John] |
|
|
| Back to top |
|
|
|
| Tony Finch... |
Posted: Sun Aug 16, 2009 3:58 pm |
|
|
|
Guest
|
Quote: [Spamassassin has an option to compile its patterns with re2c and link
that into perl. It takes the better part of an hour to compile, but it
runs really fast. -John]
On my 2Ghz Xeon E5335 server it takes 15 seconds to download an updated
ruleset and compile it using re2c.
Regarding the performance of parsing engines, the PEG formalism is
nice in that it gives the programmer much more straightforward control
over backtracking than perl-style regexes and nuch more predictable
performance, while allowing small and fast implementations (e.g. LPEG).
Tony.
--
f.anthony.n.finch <dot at (no spam) dotat.at> http://dotat.at/
[I timed it, it's not an hour, on my server it takes about 15 seconds to run re2c to create the C
source, but iover 15 minutes to compile and link that. Still hugely worth it for spamassassin which
runs several times per minute. -John] |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Sun Dec 06, 2009 10:07 am
|
|