I still have a little way to go but OILexer is effectively nearly capable of Boot Strapping.
I found oddities within the grammar where my hand-written parser handled things incorrectly or, more accurately, inconsistently.
I have a few more things to fix-up on the predictions, follow ambiguities don't appear to behave correctly when a rule is being parsed as a part of a reduction (which is itself done within a prediction.)
As a result, if a rule is being parsed as a reduction, and that rule enters an exit state, if the token is potentially ambiguous, it yields a situation where the next token's identity is marked as 'none'. A quick fix for this would be to let that rule terminate and check for the none on the stack and clear the look-ahead, but I feel that's the wrong approach and I'm just putting a band-aid on the problem. It's especially concerning if the ambiguity that's possible in that edge is valid for the current rule to continue. Allowing lexical ambiguities opened up a whole new can of worms!
After I fix this, I'm thinking about allowing the tokenizer to terminate early in its scan based off of the rule's context this will allow for situations where two tokens are identical up to some point, but one of them continues on, if that longer token isn't valid in the language at the time, terminating early could enable certain oddly lexically ambiguous languages to parse accurately.
Take a look at type identifiers for instance, the type-name, outlined can be 'Version=1.0.0.0' so unless you made your version token to parse Version, with '=', and #.#.#.#, the terminal that's present to specify 'Version' then '=' will never be registered on the token stream, because the QualifiedIdentifier
would consume all the characters for it. One option is to break apart the QualifiedIdentifier
into a rule and separate its elements with a '=' but in testing this is hackish and doesn't appear to work for the worst case:
Version=1.0.0.0, Version\=1.0.0.0, Version=1.0.0.0,...
Why someone would do that is beyond me, but .NET itself has no issues parsing it, though it does cause issues with the Visual Studio Hosting Process. It seems to really dislike even the assembly name matching that pattern, let alone the type (which would have to be built using ilasm.)CodeProject