 |
|
| Computers Forum Index » Computer Compilers » Parser classifications?... |
|
Page 1 of 1 |
|
| Author |
Message |
| Alexander Morou... |
Posted: Tue Aug 11, 2009 11:58 pm |
|
|
|
Guest
|
Hello,
First, thanks to everyone else who's given me advice in the past, I'm
not sure if I've said it, but I do appreciate it.
Previous posts I've made, about a parser compiler project I'm working
on, have discussed my efforts in trying to find a good method to
represent a top-down class parser that uses the utmost minimal
backtracking method possible. My original aim was a recursive descent
parser, after getting into it and evaluating other such programs that
use this approach, I've decided that recursive descent probably isn't
my ideal solution.
I think I've solved the major problem: finding a way to evaluate the
grammar in a deterministic manner, that can be represented on a
machine with finite limitations.
I'm not very good at explaining things in a formal manner, so if I'm a
bit cryptic, I apologize. So far, my goal has been to break a complex
problem down into smaller steps. Here's an overview:
1. Define each of the productions (aka rules) in a deterministic view
Separating transitions from state to state into two categories:
a. Token transitions move from one state to another, using only
information about the small, named, lexical patterns defined
by a given language.
b. Rule Transitions move from one state to another using the
identifier of a rule as the requirement, and also give a
pointer to the initial state of the rule represented by the
transition. This essentially is the primary bit of
information needed to calculate the follow sets, since it
specifies the internal target state of matching a rule, as
well the initial state for that rule.
2. Using the above deterministic model of the productions, define
a lazily evaluated, also deterministic, view of the language from
the start production. To determine the look ahead at any given
point, the process is further divided into two phases:
a. First evaluate the given state-trail used to reach a given
production. Due to finite limitations, this trail only needs
to be aware of the rules that are active in the current path
(for follow set inclusion) and the individual states of that
rule which can be used in place of that rule in the path (since
they point back to it). The segmented view does not provide
enough context information for parse graph production, but it
does provide ample data to give the valid set of tokens at any
given state of the language.
b. The first step operates by looking at every rule introduced
into the path 'stream', as well as the states that are
encountered as a result of a rule terminating. Since two
different paths can enter (or exit) the same rule, the second
phase concerns itself with only the original states, since the
path itself does not matter for calculation (but they are
remembered for sake of continuing the path[s].)
The above describes what I have so far, it's not going to give me a
parser, but it might be suitable for recognition, I'm presently
working on a supplemental step that will work with the above system to
handle path remembrance to build a proper parse graph.
From initial testing, it appears to handle left recursion fine.
Ironically, because left recursion repeats its follow information
infinitely until there are no more tokens associated to that state
set, it performs better than other types of recursion (since for a
left-most derivative parser, inner recursion and right recursion have
to have as many trailing states as they have starting states, so the
paths become longer and longer)
http://alexandermorou.com/text/CSStreamState.html
http://alexandermorou.com/text/CSStreamState+SourcesBuilder.html
http://alexandermorou.com/text/CSStreamState+BuildTransition.html
http://alexandermorou.com/text/CSStreamState+BuildTransition+ParentChildPairing.html
http://alexandermorou.com/text/CSStreamState+RulePath.html
http://alexandermorou.com/text/CSStreamState+RulePath+FollowInfo.html
http://alexandermorou.com/text/CSStreamState+SourcedFrom.html
http://alexandermorou.com/text/CSStreamState+SourcesInfo.html
http://alexandermorou.com/text/CSStreamState+StatePath.html
http://alexandermorou.com/text/CSStreamState+TransitionTable.html
http://alexandermorou.com/text/CSStreamState+TransitionTable+KeysCollection.html
http://alexandermorou.com/text/CSStreamState+TransitionTable+ValuesCollection.html
http://alexandermorou.com/text/TokenTransition.html
The above thirteen files is the bulk of the code associated to the
second phase. They were generated by a code generator, in which I
hand wrote the logic for a smaller language and embedded the logic
into the generator. For the non-colored versions, use .cs for the
file extension.
Does anyone here know what class of grammar this will be by the time
I'm done? Are there serious issues with the concept? Initial path
discovery and look-ahead generation can be slow (100 ms on 229 tokens
for the first attempt, 135 microseconds on the second pass.) I think
a majority of this is just the price paid for it being lazily
evaluated. The good news is (if there is such a thing), parses that
have similar components to them will use many of the same states.
As an example, if I were to step through the states on the production
of a method, regardless of the internal structure of that method, the
exit state on both should be identical (or at least, the states
leaving that exit state should be).
I've tried reading other articles about LL-class parsers, but I can't
seem to find a specific example of something similar. Perhaps I'm
looking for the wrong thing. Knowing how to refer to the kind of
parser used is helpful for discussion purposes.
Thanks,
PS: Yes I realize I'm using LL and top-down interchangeably, if this
is wrong, please let me know.
PPS: The above works a lot with paths and set construction what would
a formal description of its function look like? What am I lacking
knowledge wise so that I can describe it in a terser manner, because
long messages like this are painful to write, or read for that matter. |
|
|
| Back to top |
|
|
|
| Kenneth 'Bessarion' Boyd... |
Posted: Thu Aug 13, 2009 4:55 pm |
|
|
|
Guest
|
On Aug 11, 6:58 pm, Alexander Morou <alexander.mo... at (no spam) gmail.com> wrote:
Quote: I think I've solved the major problem: finding a way to evaluate the
grammar in a deterministic manner, that can be represented on a
machine with finite limitations.
I'm not very good at explaining things in a formal manner, so if I'm a
bit cryptic, I apologize. So far, my goal has been to break a complex
problem down into smaller steps. Here's an overview:
1. ....
2. Using the above deterministic model of the productions, define
a lazily evaluated, also deterministic, view of the language from
the start production. To determine the look ahead at any given
point, the process is further divided into two phases:
a. First evaluate the given state-trail used to reach a given
production. Due to finite limitations, this trail only needs
to be aware of the rules that are active in the current path
(for follow set inclusion) and the individual states of that
rule which can be used in place of that rule in the path (since
they point back to it). The segmented view does not provide
enough context information for parse graph production, but it
does provide ample data to give the valid set of tokens at any
given state of the language.
Looking ahead and seeing the performance concerns:
* May I assume this *is* being done incrementally, in a way that
calculates each step along the state-trail once?
* If not, do we have a strong memory conservation design requirement
that mandates this?
(I see this is being written in C#, so I'll hold off on commentary
specific to C++ .)
Quote: ....
Does anyone here know what class of grammar this will be by the time
I'm done?
You're setting this up to support arbitrarily long lookahead. I
assume that if I'm wrong that this approach could support LL(infinity)
(aside from resource limitations), that an actual expert will correct
my misreading.
Quote: ....
I've tried reading other articles about LL-class parsers, but I can't
seem to find a specific example of something similar. Perhaps I'm
looking for the wrong thing. Knowing how to refer to the kind of
parser used is helpful for discussion purposes.
Thanks,
...
PPS: The above works a lot with paths and set construction what would
a formal description of its function look like? ....
A formal description would be a state machine. I think your
difficulty in searching the literature is the same as mine:
* there is remarkably little concern about how to implement the state
machine, and
* the untutored ways of improvising state machines are very different
from what is generated by tools like yacc.
So, for instance, while distinguishing where the rules are coming from
(1a and 1b) is important for hand-maintainability, it's not nearly so
important for a formal description and it actually complicates the
data representation.
Likewise, working with paths is a specific way of using the
combinatoric graph used in formally defining the state machine. The
literature concentrates on how the mature automatic generators work
(e.g., rigging the control flow of the generated parser to to process
the initial conditions of rules with shared initial conditions only
once, and working well with an incoming token stream). |
|
|
| Back to top |
|
|
|
| Alexander Morou... |
Posted: Thu Aug 13, 2009 7:56 pm |
|
|
|
Guest
|
On Aug 13, 11:55 am, "Kenneth 'Bessarion' Boyd" <zaim... at (no spam) zaimoni.com>
wrote:
Quote: . . .
Looking ahead and seeing the performance concerns:
* May I assume this *is* being done incrementally, in a way that
calculates each step along the state-trail once?
* If not, do we have a strong memory conservation design requirement
that mandates this?
(I see this is being written in C#, so I'll hold off on commentary
specific to C++ .)
The process is done on an as-needed basis. So yes, it's incremental.
The primary reason that it reuses states within the system is due to
the way most rules in the examples I've been using work. From the
provided example (where I omitted the flat-DFA view of the rules),
methods always end with a '}', so while the path that reaches the '}'
might vary, leading to alternate versions of that '}' state, the
actual sub-set construction for what's -after- that is always
identical, so the states exiting are usually repeated. Exceptions
most notably to cases where the actual state set differ, as an
example, a 'compilation unit' from C# would use a different initial
state set for its very first state than it would after encountering
its first class or first namespace. After that class, certain rules
go out of scope, such as includes. This changes the initial/first set
after the first class' '}' brace.
The overall caching mechanism is two fold. Each state has its own
transition table, and there's a higher level unhinged cache for the
states. The actual paths themselves are fairly lightweight. While
they cross reference one another, it's in a single direction, the GC
for C# should be able to detect when useless paths were created and
clean them up on its own. Though I've often thought it might not be a
bad idea to do self cleanup, I want to get it working before I worry
about performance concerns (since those are really only important
after it works, if you optimize too early and you might end up making
it worse when you need to change something). State caching is used,
which works by checking the initial/first/follow set groups calculated
when a token transition is requested on the transition table of a
given state. If the resulted initial/first/follow set, is a set of
paths that exists, that match is found and reused (from the unhinged
cache). Since the paths themselves are very simple structures, the
time involved in checking for them shouldn't be terribly high (since
C#'s lists, and dictionaries use hash codes to avoid useless full
checks.) The issue I've found so far is non-left recursion, if you
were working with a parametered expression, there would only be as
many right parenthesis as there were left parenthesis, which is
similar to a recursive descent parser, except you could consider the
paths an emulation of that theme, though instead of looking at one
rule at a time, it looks at all of them. I don't know enough about
Big-O notation to tell you what kind of time concern is involved;
however as an example, if there was an expression with 200 left
parenthesis it would take 0.7 seconds to calculate all the lookaheads
as the paths get deeper, but 5.39 seconds if there were just twice as
many (400 left parenthesis, and for 800, 55.39 seconds). The number
of sources per level is constant. I doubt someone's going to use 200,
400, or even 800 parenthesis in the same expression. I'm guessing the
time requirement is quadratic in nature, 7.7 times longer from
200->400, 10.27 times longer from 400->800. Since the time
requirement is a multiplier versus a linear climb, it makes me think
quadratic. Though the multiplier doesn't appear constant, so there's
probably other variables involved in calculating it. Could someone
pipe in if they know more about this?
Quote: Does anyone here know what class of grammar this will be by the time
I'm done?
You're setting this up to support arbitrarily long lookahead. I
assume that if I'm wrong that this approach could support LL(infinity)
(aside from resource limitations), that an actual expert will correct
my misreading.
This is primarily due to the focus on no back tracking. It still
doesn't include cases where tokens might overlap. The parser that
drives the data aspect of this project is a very simple [badly]
hand-written parser. So I ignored the maximal munch principle, though
I think I will add a small bit of information to the parser to define
precedence (on equal length tokens) since grammars can span multiple
files.
Quote: So, for instance, while distinguishing where the rules are coming
from (1a and 1b) is important for hand-maintainability, it's not
nearly so important for a formal description and it actually
complicates the data representation.
The primary reason for including path information was to hopefully aid
in path discovery. Figuring out which rule you're on should be as
simple as evaluating the initial state set for a given state in the
language. From what I could tell, even if two productions shared the
first 100 tokens, from a shared sub-production, based upon a few
states that were distinct from either, the actual type of production
made itself clear, pretty quickly after the shared point in common
(the shared sub-production). Without this context information to aid
in such discovery, I'll have to rediscover which rules are involved,
and essentially redetermine which rule is being parsed when the parse
graph is constructed. So it's more than just about maintainability.
Quote: A formal description would be a state machine. I think your
difficulty in searching the literature is the same as mine:
* there is remarkably little concern about how to implement the state
machine, and
* the untutored ways of improvising state machines are very different
from what is generated by tools like yacc.
Improvising? Yes, true; however, everything was improvised at one
point. If I had a more complete educational background, perhaps I
would have had better luck defining a perfect theory the first time.
It might be improvised, but that doesn't mean it will always be so.
I hope to go back to school and research these things further, in a
directed manner where I can avoid common mistakes. I don't have that
luxury (school is a business, without money, no amount of interest
can help you), so I have to make due with what I have.
Quote: Likewise, working with paths is a specific way of using the
combinatoric graph used in formally defining the state machine. The
literature concentrates on how the mature automatic generators work
(e.g., rigging the control flow of the generated parser to to process
the initial conditions of rules with shared initial conditions only
once, and working well with an incoming token stream).
If you don't mind me asking, what literature?
Thanks, |
|
|
| Back to top |
|
|
|
| Kenneth 'Bessarion' Boyd... |
Posted: Mon Aug 17, 2009 4:30 am |
|
|
|
Guest
|
On Aug 13, 2:56 pm, Alexander Morou <alexander.mo... at (no spam) gmail.com> wrote:
Quote: On Aug 13, 11:55 am, "Kenneth 'Bessarion' Boyd" <zaim... at (no spam) zaimoni.com
wrote:
....
Does anyone here know what class of grammar this will be by the time
I'm done?
You're setting this up to support arbitrarily long lookahead. I
assume that if I'm wrong that this approach could support LL(infinity)
(aside from resource limitations), that an actual expert will correct
my misreading.
This is primarily due to the focus on no back tracking. It still
doesn't include cases where tokens might overlap. The parser that
drives the data aspect of this project is a very simple [badly]
hand-written parser. So I ignored the maximal munch principle, though
I think I will add a small bit of information to the parser to define
precedence (on equal length tokens) since grammars can span multiple
files.
Ok; no comment there (not sure how no back-tracking and the maximal
munch principle/optimization interact).
Quote: So, for instance, while distinguishing where the rules are coming
from (1a and 1b) is important for hand-maintainability, it's not
nearly so important for a formal description and it actually
complicates the data representation.
The primary reason for including path information was to hopefully aid
in path discovery. Figuring out which rule you're on should be as
simple as evaluating the initial state set for a given state in the
language. From what I could tell, even if two productions shared the
first 100 tokens, from a shared sub-production, based upon a few
states that were distinct from either, the actual type of production
made itself clear, pretty quickly after the shared point in common
(the shared sub-production). Without this context information to aid
in such discovery, I'll have to rediscover which rules are involved,
and essentially redetermine which rule is being parsed when the parse
graph is constructed. So it's more than just about maintainability.
Ok. This is an example of something I'd have never guessed just from
reading about parsers, but is blindingly obvious when actually having
to implement one along these lines.
Quote: A formal description would be a state machine. I think your
difficulty in searching the literature is the same as mine:
* there is remarkably little concern about how to implement the state
machine, and
* the untutored ways of improvising state machines are very different
from what is generated by tools like yacc.
Improvising? Yes, true; however, everything was improvised at one
point. If I had a more complete educational background, perhaps I
would have had better luck defining a perfect theory the first time.
Unlikely, when one is doing any kind of new research.
Quote: It might be improvised, but that doesn't mean it will always be so.
I hope to go back to school and research these things further, in a
directed manner where I can avoid common mistakes. I don't have that
luxury (school is a business, without money, no amount of interest
can help you), so I have to make due with what I have.
Understood.
Quote: Likewise, working with paths is a specific way of using the
combinatoric graph used in formally defining the state machine. The
literature concentrates on how the mature automatic generators work
(e.g., rigging the control flow of the generated parser to to process
the initial conditions of rules with shared initial conditions only
once, and working well with an incoming token stream).
If you don't mind me asking, what literature?
Ok...consider how yacc and lex interact.
* yacc (and its successor bison) are basically tools for automatically
constructing finite state machines in C that parse token sequences.
So the relevant literature would be on finite state machines, and how
to reduce state machines from the easy-to-specify ones
(nondeterministic ones) to the easy-to-automatically-generate ones
(deterministic). One wouldn't do badly starting with Wikipedia's
citations for finite state machines (
http://en.wikipedia.org/wiki/Finite_state_machine
); note that I haven't carefully checked that the article itself is
accurately stating its citations.
* lex is a tool that generates a C library (and a driver main() ) for
incrementally lexing a text file. Again, there's a state machine in
the background. The constraint of returning tokens in sequence,
rather than doing all of them at once, seems to be a historical
efficiency measure. (I'm speculating, but based on what I've read
about the 1970's systems this seems very plausible.)
However, while state machines are relatively easy to write automatic
generators for, and even to automatically optimize -- they are very
hard to explicitly verify. So they're not closely related to how one
would write a parser directly, as explicit verification (both proof,
and extensive unit-testing) becomes very important.
[Lex and yacc both generate state machines to match their input. Yacc
also has a state stack that lets it match nested and recursive
patterns, which the lex state machine can't do. A lex scanner calls
code you write every time it matches a token; you can return it
immediately to a calling routine (usually a yacc parser) or buffer it
somewhere and keep going, lex doesn't care. Compilers frequently feed
hints back from the parser to the scanner, that modify its behavior
for subsequent tokenizing, which wouldn't work if the scanner blasted
through the whole input first. -John] |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Tue Dec 01, 2009 8:59 am
|
|