Correct way to derive parse trees, for a given input, to show ambiguity.
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Correct way to derive parse trees, for a given input, to show ambiguity.
Is it correct to state that the below grammar can handle the expression: a - b * c, by the below two different parse trees.
Code:
Grammar:
Expression = Expression "-"
Expression | Expression "*"
Expression | Factor
Factor = "a" | "b" | "c"
Parse tree #1:
Expression -> Expression - Expression
=> Factor - Expression
=> a - Expression
=> a - Expression * Expression
=> a - Factor * Expression
=> a - b * Factor
=> a - b * c
Parse tree #2:
Expression -> Expression * Expression
=> Expression - Expression * Expression
=> Factor - Expression * Expression
=> a - Expression * Expression
=> a - Factor * Expression
=> a - b * Factor
=> a - b * c
I feel that the second derivation above is completely wrong, as not considers the left-to-right scan of the input string.
The decision should have been based on the lookahead symbol only.
If possible show by a C language code snippet, how the implementation of the second derivation, is correct, as the book titled: Compliers and Compiler Generators An Introduction With C++, by Patrick D. Terry, shows the above example, in Section 6.4, page #131
For me, the ambiguous grammar can only be shown by the below second derivation:
Code:
Parse tree #2:
Expression -> Expression - Expression
=> Expression - Expression * Expression
=> Factor - Expression * Expression
=> a - Factor * Expression
=> a - b * Expression
=> a - b * Factor
=> a - b * c
The second way, is inspired by Louden's book example, on page #115, as shown in the attachment below.
But, the bigger question is am not sure if the implementation of the above grammar, say in C, would leave any chance for the second parse tree generation, either of the first type (Patrick Terry's book); or of the second type (inspired by Louden's book, whose screenshot is attached).
Otherwise, I don't know what to tell my students even, as which approach is correct.
Better, if could state some implementation details about the new grammar being used to build an integrated parser-lexer, with lookahead of lexer being passed on to the parser.
But, have to tell why one approach is correct.
Or, if both approaches are correct, then why?
Edit:
It is not a question of the grammar being ambiguous or not, but in syllabus the topic is mandatory, and hence the example must be drawn from an ambiguous grammar only.
And to simplify the parse tree examples let's shorten the names and add rule numbers:
Code:
1 E -> E - E
2 | E * E
3 | F
4 F -> "a"
5 | "b"
6 | "c"
It is important that you see that this grammar is the same as the one you posted, but more readable.
Quote:
Originally Posted by ajiten
I feel that the second derivation above is completely wrong, as not considers the left-to-right scan of the input string.
The decision should have been based on the lookahead symbol only.
What decision? What lookahead symbol?
I think you may be confusing yourself by trying to see code where there is none. The question and example grammar you have posted have nothing obvious to do with scanning the input line or any lookahead symbols - nothing to do with any parser or code at all, nor making of decisions.
It appears to be all about understanding that the grammar itself is ambiguous because there is more than one way to produce the sentence a-b*c from it (i.e., more than one parse tree). Deriving those parse trees (production graphs) from the grammar is what makes the ambiguity visible, or as the questions says, "to show the ambiguity".
Important things to remember are that the grammar produces or generates the sentence, and the production graph (i.e., parse tree) visually represents how it does so. This is why the grammars are called generative grammars, and the rules are called productions.
So, to produce a-b*c from this grammar we start with the start symbol E, and replace it with each of its right sides. We have two possible choices which produce the sentence: Rules 1 and 2, E -> E - E and E -> E * E respectively.
First, using production 1 (rule 1), both leftmost and rightmost derivations produce the same parse tree:
Code:
Rule order, leftmost derivation: (1,3,4) (2,3,5) (3,6)
Rule order, rightmost derivation: (1,2,3,6) (3,5) (3,4)
E(-)
/|\
/ | \
/ | \
E | \
| | \
| | \
F | E(*)
| | /|\
| | / | \
| | / | \
| | E | E
| | | | |
| | | | |
| | F | F
| | | | |
| | | | |
a - b * c
But production 2 also leads to the sentence. Using production 2 (rule 2), both leftmost and rightmost derivations produce the same parse tree, but different from the one above:
Code:
Rule order, leftmost derivation: (2,1,3,4) (3,5) (3,6)
Rule order, rightmost derivation: (2,3,6) (1,3,5) (3,4)
E(*)
/|\
/ | \
/ | \
/ | E
/ | |
/ | |
E(-) | F
/|\ | |
/ | \ | |
/ | \ | |
E | E | |
| | | | |
| | | | |
F | F | |
| | | | |
| | | | |
a - b * c
Leftmost and rightmost derivations are not about scanning the input from left or right - they are about the order in which you replace nonterminals with their alternatives at each step of the production process.
This is the correct way to derive parse trees, for a given input, to show ambiguity.
Last edited by astrogeek; 08-09-2023 at 01:55 AM.
Reason: less obscure
And to simplify the parse tree examples let's shorten the names and add rule numbers:
Code:
1 E -> E - E
2 | E * E
3 | F
4 F -> "a"
5 | "b"
6 | "c"
It is important that you see that this grammar is the same as the one you posted, but more readable.
What decision? What lookahead symbol?
I think you may be confusing yourself by trying to see code where there is none. The question and example grammar you have posted have nothing obvious to do with scanning the input line or any lookahead symbols - nothing to do with any parser or code at all, nor making of decisions.
It appears to be all about understanding that the grammar itself is ambiguous because there is more than one way to produce the sentence a-b*c from it (i.e., more than one parse tree). Deriving those parse trees (production graphs) from the grammar is what makes the ambiguity visible, or as the questions says, "to show the ambiguity".
Important things to remember are that the grammar produces or generates the sentence, and the production graph (i.e., parse tree) visually represents how it does so. This is why the grammars are called generative grammars, and the rules are called productions.
So, to produce a-b*c from this grammar we start with the start symbol E, and replace it with each of its right sides. We have two possible choices which produce the sentence: Rules 1 and 2, E -> E - E and E -> E * E respectively.
First, using production 1 (rule 1), both leftmost and rightmost derivations produce the same parse tree:
Code:
Rule order, leftmost derivation: (1,3,4) (2,3,5) (3,6)
Rule order, rightmost derivation: (1,2,3,6) (3,5) (3,4)
E(-)
/|\
/ | \
/ | \
E | \
| | \
| | \
F | E(*)
| | /|\
| | / | \
| | / | \
| | E | E
| | | | |
| | | | |
| | F | F
| | | | |
| | | | |
a - b * c
But production 2 also leads to the sentence. Using production 2 (rule 2), both leftmost and rightmost derivations produce the same parse tree, but different from the one above:
Code:
Rule order, leftmost derivation: (2,1,3,4) (3,5) (3,6)
Rule order, rightmost derivation: (2,3,6) (1,3,5) (3,4)
E(*)
/|\
/ | \
/ | \
/ | E
/ | |
/ | |
E(-) | F
/|\ | |
/ | \ | |
/ | \ | |
E | E | |
| | | | |
| | | | |
F | F | |
| | | | |
| | | | |
a - b * c
Leftmost and rightmost derivations are not about scanning the input from left or right - they are about the order in which you replace nonterminals with their alternatives at each step of the production process.
This is the correct way to derive parse trees, for a given input, to show ambiguity.
Request inputs as am unable to grasp. Can only understand that if left-to-right scan is not needed, then lexer should have read the entire input, in a given line, till end-of-line marker; before the parsing can begin.
[Very sorry, but cannot resist to ask that as C follows rightmost derivation (BUP); i.e. one statement is processed in stack, at a time. So, C has in its context, a single statement. But, why the error-reporting is so imprecise in C, as an error is ascribed often, to a different line number]
Though, that means that the whole line must be kept in primary storage, i.e. stack (for BUP).
Hence, parser must be a seperate phase than lexer.
As a BUP parser can apply rule #2, before rule #1; so would not focus on that.
The focus should be on top-down parser.
For the case of top-down parsing, can have more economical space storage; as there can be a closed-loop parser-scanner (or, let me call it loop-type parser), where the lexer returns one token on-demand to the parser.
But, for this loop-type parser, where the derivation is decided based on each subsequent token supplied by the lexer, please tell :
How the left-to-right parsing can pick up the rule #2 first, (as is done in your second diagram) rather than only being able to pick rule #1 first.
I can only derive the first diagram under the leftmost derivation, as shown below. And, am unable to derive the second diagrams's derivation, using leftmost derivation.
For the given input: a - b * c, for the loop-type parser, the parser is fed the token 'a' first.
Will apply rule #4 first.
Next, should apply rule #3.
Next, the lexer provides '-' token.
This provides the incomplete context for applying rule #1.
At this point, need next token, which is 'b'.
But, the leftmost derivation has successfully got : E + b
till now.
But, that means, can apply rule #1.
However, need lookahead to save on unsuccessful application.
That yields the token: *.
Now, have: E + b *
Next, rule no. 5, is applied to get: E + F *
Then, rule no. 3, is applied to get: E + E *
Next, needs more input to decide, if rule #2 can be applied, or it is an erroneous input, i.e. if no suitable token(s) after *.
The next token is: c.
Hence, get: E + E * c
Then, rule #6 is applied, to get: E + E * F
Next, rule #3 is applied to get: E + E * E
Next, rule #2 is applied to get: E + E
Next, apply rule #1 to get : E.
But, even here cannot see how for the first diagram, get the first set, i.e. (1,3,4).
I am somewhat reluctant to add further comment for fear of adding confusion. But it is an important point to understand and I have begun, so one more attempt at clarity...
Only you can judge whether and how this information might fit into your current class plan. Consider carefully.
Quote:
Originally Posted by ajiten
Request inputs as am unable to grasp. Can only understand that if left-to-right scan is not needed, then lexer should have read the entire input, in a given line, till end-of-line marker; before the parsing can begin.
As Neo was told by a young potential in the movie The Matrix, "The important thing to realize is, there is no spoon".
Code:
Correct way to derive parse trees, for a given input, to show ambiguity.
In the scope of this question and these replies, there is no scan, there is no lexer, there is no stack, there is no top-down and no bottom-up - there is no parser or parsing method involved in any way. There is only the grammar and the sentence to be produced, a - b * c (here called the input).
Can the grammar produce this sentence? If so then the sentence is valid in the language described by the grammar.
Can the grammar produce this sentence in more than one way? If so then the grammar is ambiguous.
This grammar was chosen for its ambiguity and the question is how to make the ambiguity visible by deriving the sentence from the grammar in all possible ways (i.e., all parse trees).
Derivation (pencil and paper or chalkboard, not parser) is performed by beginning with the start symbol nonterminal and rewriting it in terms of each of its right-hand sides. Then repeatedly rewriting nonterminals in the resultant forms until there are none left. The sequences of replacements which lead to the sentence provide the parse tree(s).
Quote:
Originally Posted by ajiten
But, even here cannot see how for the first diagram, get the first set, i.e. (1,3,4).
I think it important to note that you have refered to my rewriting order notation (1,3,4) as the "first set". Your use may have been incidental, or not. The term FIRST set is important in parser implementations, but the notation I have used is not related to that in any way and I have not used that term at all. Just to ward off further confusion of terms.
Here reproduced in full, that first derivation (i.e., parse tree), leaving similar derivation of the other tree as an exercise.
And remember - there is no scanner, no parser and no parse method in this context.
Code:
The grammar:
1 E -> E - E
2 | E * E
3 | F
4 F -> "a"
5 | "b"
6 | "c"
The first derivation:
E(-)
/|\
/ | \
/ | \
E | \
| | \
| | \
F | E(*)
| | /|\
| | / | \
| | / | \
| | E | E
| | | | |
| | | | |
| | F | F
| | | | |
| | | | |
a - b * c
Leftmost derivation: Rewrite leftmost nonterminal until none remain
E (start symbol)
=> E - E Rule 1
=> F - E Rule 3
=> a - E Rule 4 (1,3,4)
=> a - E * E Rule 2
=> a - F * E Rule 3
=> a - b * E Rule 5 (2,3,5)
=> a - b * F Rule 3
=> a - b * c Rule 6 (3,6)
Rightmost derivation: Rewrite rightmost nonterminal until none remain
E (start symbol)
=> E - E Rule 1
=> E - E * E Rule 2
=> E - E * F Rule 3
=> E - E * c Rule 6 (1,2,3,6)
=> E - F * c Rule 3
=> E - b * c Rule 5 (3,5)
=> F - b * c Rule 3
=> a - b * c Rule 4 (3,4)
Here, the exposure (as well as inclination) to practical implementation is very less (if at all), and that makes me worried enough, to just state that the ambiguity is a theoretical one.
So, students need to go here the reverse way, i.e. first understand the way it can be done.
Though, I fear if coded, the top-down parser would fail due to left recursion
But, again that has to be shown before going onto the LL(1) grammar.
Otherwise, the disconnect would be too much that the students are mugging, without a real understanding of the issues involved.
Yes, the theoretical ways need be shown of leftmost and rightmost derivation, but that would only confuse more, than help, in the absence of implementation.
The current context is top-down parsing, without reaching the LL parsing.
The practical (implementation) part would be needed to show how leftmost derivation would occur, both by parse tree derivation, and code. Need show then how left-recursive grammar makes it fail.
Similarly, need show how rightmost derivation would occur, by parse tree derivation for a given input string.
I think that the rightmost derivation needs slightly more efforts, as keeping in stack, and that needs the concept of handles immediately.
So, can postpone the implementation of rightmost derivation, unless am wrong.
The focus for now (till reach LR parsing) can be on implementation of leftmost derivation alone.
My lack is there, but somewhere it has to be plugged.
Coming back, must show for the top-down parser the different implementation methods:
(1) if top-down parsing is there, then loop-type parser is a possibility, but not for the rightmost derivation.
(2) how to implement the leftmost derivation, and also show the infinite loop error, for the given left-recursive grammar.
So, my last response was in that line, i.e. to show (1).
But, need to derive first the leftmost derivation, in a top-down manner, as the possible implementation for leftmost derivation.
Though, need show theoretically the rightmost derivation.
---------------------------------------
Am going for coding too, but there too am unable to progress more on my own.
What I plan is to take the code of Louden, as at: http://www.cs.sjsu.edu/faculty/louden/cmptext/
and adapting (simplifying) it to suit a starter class, and then move to the actual code, which is for the full course.
-----------------------------------------
Want to add that your leftmost derivation of parse tree, is not seemingly like for top-down parsing to me.
Hence, face a fix to elaborate that as done by top-down parsing.
The reason is that in class, would only start from the terminals.
But, you are starting from NT symbols.
Even for the rightmost derivation, cannot elaborate without taking terminals first, and deriving the left part of a derivation (production rule) from the right one.
My inability to not being able to see parsing, without terminal symbols being put first, and the rules later, is basic; as otherwise the actual rule application process cannot be seen.
Please vet my analysis of top-down parsing for the given grammar and the input string:
The first token input symbol is: a,
the parser starts with the start symbol, E.
Then, the rules are tried successively, from the rule no. 1.
The first rule demands 1 symbol of lookahead, i.e. '-'.
----------------
First, assuming that the implementation is without lookahead, then the rule #3, is the only one left.
That leads to application of rule #3.
But, now the next input token is : '-'.
There is no rule that can handle it, except #1.
Hence, need to backtrack and apply rule #1 first.
Then, comes the application of rule #3, & then rule #4.
But, how this memory of earlier failed attempts is kept, by the top-down parser, is the issue.
----------------
Second, assume that lookahead is available.
Here, lookahead is needed apart from the current input.
Then it is easy, as the rule #1 can be applied, in anticipation of the next symbol after the lookahead (of '-') being any terminal from a/b/c.
Then, rule #3 can be applied.
Then, rule #4 is applicable.
Have read up to: a -
The parse tree has the rules applied as: (1, 3, 4)
The next input is: b.
But, the lookahead token is: *
For this need the rule #2, then #3, then #5.
So, the rules applied are: (1, 3, 4) (2, 3, 5)
Now read token : *, and the lookahead is: c.
So, need to choose among the six rules, and due to eoi marker, as the lookahead, can only apply rule #3, and then choice limits to the only rule #6.
So, the rules applied in sequence, for the branches in leftmost derivation, are: (1, 3, 4) (2, 3, 5) (3, 6).
But, am confused (even with the lookahead based implementation), as how to put the code in place.
Also, how to show infinite recursion error, for the given C code implementation; & hence the need to show the need for the LL parser.
Request vetting, for the below, which is a part of preparation for an easily accessible notes on pumping-lemma, for my class :
Have read that finite languages are subsets of Regular languages, but can only have >= 1 terminals on the rhs of any production rule.
That in turn means that the set of productions, is the set of all words in the language.
Say, for the below DFA generating finite language:
Code:
G_f = (q_0, Σ={X,a,b}, q_F, P={X -> aa | bb})
If a regular grammar were to generate, with DFA of the form:
How to prove using pumping lemma, on a word in the C-language's parser is not regular grammar.
Want to show using pumping lemma, that the parser (LALR) used for (bottom-up parsing) C -language is not regular.
Request some inputs, or some literature sources.
Need a word in the language to show that it fails the test.
But, not clear what criteria the word must possess.
At best can think of a programming language construct, say: while statement.
The construct is:
Code:
while(i> minimum)
i--;
Cannot see how to split this in three subparts, as needed to show by contradiction, and find the pumping length.
Seems an arithmetic expression grammar (as below, which is CFG) can be a suitable candidate, and ideally should be able to pick a word in that, to prove by contradiction.
Code:
stmt: ID '=' expr ';'
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
The two code-parts in your post have no connection at all.
There is no ambiguity in parsing stmt -> "while" '(' expression ')' stmt rule. If you wish an ambiguity, consider 'else':
if (color==yellow) if (taste==acerb) fruit= lemon; else fruit= mellon;
How to interpret this?
Code:
#1 if (color==yellow) { if (taste==acerb) fruit= lemon; else fruit= mellon; } /*no else here */
#2 if (color==yellow) { if (taste==acerb) fruit= lemon; /* no else here */} else fruit= mellon;
I never thought that ambiguity in a programming construct (i.e., the if-else statement), alone would be the sole area where to apply the pumping lemma.
In fact, was more hopeful of getting an application of pumping lemma, in arithmetic expressions.
Have understood that the basic (i.e., the rule #3: stmt -> "if" '(' expr ')' stmt )
if-else statement, would provide the pumping-length.
The rule #3 can be repeated any number of times.
In pumping lemma, we divide input word (string) into a concatenation of 3 parts, s.t.:
Hence, if before entering the if-else statement, if the program execution had some state q_i;
then if rule #3 is applied, and the program enters another state q_j;
then any number of executions of the rule #3, would have the program at the same state, i.e. q_j.
for any regular language L there exists a constant p such that any word w in L with length at least p can be split into three substrings, w = xyz, where the middle portion y must not be empty, such that the words xz, xyz, xyyz, xyyyz, … constructed by repeating y zero or more times are still in L.
So, have the program represented as: xy^iz, with y being the part concerned with the: if-else programming construct.
The block of executable statements occuring sequentially before y, are denoted by : x; and the block after y, by: z.
The pumping-length would then be the length of the y block.
if (color==yellow) if (taste==acerb) fruit= lemon; else fruit= mellon;
if ((color==yellow) && (taste==acerb)) fruit= lemon;
else if ((color==yellow) && (taste!=acerb)) fruit= mellon;
Quote:
How to interpret this?
Code:
#1 if (color==yellow) { if (taste==acerb) fruit= lemon; else fruit= mellon; } /*no else here */
#2 if (color==yellow) { if (taste==acerb) fruit= lemon; /* no else here */} else fruit= mellon;
Both #1, #2 are handled the same as above.
Edit: Why the first example is given is unclear
Code:
if (color==yellow) if (taste==acerb) fruit= lemon; else fruit= mellon;
as the outer if, does not count in pumping lemma.
The addition of braces to that example doesn't make a difference, to non-applicability of the pumping lemma.
Also, the second example forces me to think that both if, and else parts, standalone are useful candidates for the pumping lemma, apart from the the construct: if-else.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.