Lex/YACC Question

MTK358 · 03-23-2010, 06:45 AM

Code:

bump();

Sergei Steshenko · 03-23-2010, 02:32 PM

Quote:

Originally Posted by MTK358

Code:

bump();

http://www2.informatik.hu-berlin.de/...P/CSP06_28.pdf
http://www.complang.org/thurston/thu...ON_06_btlr.pdf
http://pdos.csail.mit.edu/~baford/packrat/thesis/
http://pauillac.inria.fr/~ddr/camlp5.../bparsers.html
http://www.cs.rhul.ac.uk/research/la...backtrack.html

MTK358 · 03-24-2010, 11:57 AM

I still don't get why not continue on with Lex and YACC?

It's not that Lex and YACC are too complicated, it's mostly that there are lots of little important things that aren't mensioned in tutorials. Such as, for exapmle the type of $$, $1, $2, etc. in YACC.

Sergei Steshenko · 03-24-2010, 03:48 PM

Quote:

Originally Posted by MTK358

I still don't get why not continue on with Lex and YACC?

It's not that Lex and YACC are too complicated, it's mostly that there are lots of little important things that aren't mensioned in tutorials. Such as, for exapmle the type of $$, $1, $2, etc. in YACC.

The little things make it complicated.

The overall approach of backtracking parser which tries language constructs makes the developer's life easier - the developer implements language constructs and callbacks step by step - of course, feeding currently supported set of language constructs fro testing.

So, by construction there is no possibility to obtain a message saying some piece is not described or what was it in your case. The developer exactly knows (until proven wrong by tests) what is supported and whatis not.

Also, as I wrote, it's easy to visualize the trial-error-backtracking process performed by the parser, i.e. it's easy to debug definitions of language constructs.

theNbomr · 03-24-2010, 08:25 PM

Lex & Yacc are great for someone who doesn't want to learn everything about grammars and parsing and the theory of it all. I couldn't imagine writing a decent parser for a non-trivial grammar without some formal background in the art. I've seen enough cases of it being done poorly, when a much better equivalent product could be produced with less effort with a documented tool such as lex & yacc (even including the effort to learn lex & yacc, which is non-trivial itself). Moreover, a roll-your-own parser will be understood by you only, whereas anyone acquainted with BNF will be able to quickly understand a lex & yacc based parser. Even if you can't produce a parser yourself, figuring out what an existing grammar and parser based on yacc does is fairly straightforward.
To Sergei and others who prefer to roll-their-own, I take my hat off. But for a novice trying to cobble together a calculator or simple programming language, it doesn't make sense to me. If you are interested in the theory behind the subject, that's a different matter, and probably the only way to go is do it yourself. Write your own lex & yacc, then use it to build a calculator. There's a learning experience.
I agree that it is difficult to find good information/education on the tools. Other than the links already posted in this thread, have you come across any nuggets that you would like to share?

--- rod.

MTK358 · 03-25-2010, 08:04 AM

@theNbomr

I agree.

Anyway, the main things that I keep stumpling on are:

How to have different tokens return different types?
The types of $$, $1, $2, etc.
The organization of the syntax tree
- What kind of data structure to use?
  - structs with unions that hold the data for every type of node?
  - Faked "polymorphic" structs, where structs inherit each other by adding to the super-struct's template macro?
  - C++ classes
- Should unary and binary operators have different node types, or should there be only one universal operator node type?
- Should if, if/else, and loop constructs be types of the universal operator node or should they have their own node type?

ntubski · 03-25-2010, 09:23 AM

Quote:

Originally Posted by MTK358

How to have different tokens return different types?

Don't: http://dinosaur.compilertools.net/yacc/index.html

Quote:

3: Lexical Analysis

The user must supply a lexical analyzer to read the input stream and communicate tokens (with values, if desired) to the parser. The lexical analyzer is an integer-valued function called yylex. The function returns an integer, the token number, representing the kind of token read.

Quote:

Originally Posted by MTK358

The types of $$, $1, $2, etc.

It is YYSTYPE:

Quote:

Support for Arbitrary Value Types

By default, the values returned by actions and the lexical analyzer are integers. Yacc can also support values of other types, including structures. In addition, Yacc keeps track of the types, and inserts appropriate union member names so that the resulting parser will be strictly type checked. The Yacc value stack (see Section 4) is declared to be a union of the various types of values desired. The user declares the union, and associates union member names to each token and nonterminal symbol having a value. When the value is referenced through a $$ or $n construction, Yacc will automatically insert the appropriate union name, so that no unwanted conversions will take place. In addition, type checking commands such as Lint[5] will be far more silent.

...

Alternatively, the union may be declared in a header file, and a typedef used to define the variable YYSTYPE to represent this union.

MTK358 · 03-25-2010, 10:05 AM

Quote:

Originally Posted by ntubski

Quote:

3: Lexical Analysis

The user must supply a lexical analyzer to read the input stream and communicate tokens (with values, if desired) to the parser. The lexical analyzer is an integer-valued function called yylex. The function returns an integer, the token number, representing the kind of token read.

You don't understand. I am not talking about the return type of yylex(), but about the semantic value of the token.

For example, an integer token will have an int value, and a string token will have a char* value.

Sergei Steshenko · 03-25-2010, 10:09 AM

Quote:

Originally Posted by MTK358

You don't understand. I am not talking about the return type of yylex(), but about the semantic value of the token.

For example, an integer token will have an int value, and a string token will have a char* value.

And if you use backtracking parser (at least, it was the case in my implementation), the engine returns start and end pointers in text (i.e. start line/column numbers and end line/column numbers) corresponding to the recognized language construct, so you question about the token value becomes irrelevant.

theNbomr · 03-25-2010, 10:44 AM

I think ntubski actually did answer your question regarding how to return token types. In lex, the standard idiom is to use the symbols defined in y.tab.h, which derive from the %tokens in the yacc grammar, as the return values. This is different from the value of the token, the type of which will vary depending on the data being parsed. The value of the token is stored in yylval, which is usually a union, so that it can store data of differing types. So, there are two things supplied by the lexical analyzer; the type of the data, and the value of the data. The former is the (integer) return value of the lexical analyzer, and the latter is a side effect of the lexical analyzer.
On your second question, the types of the $$, $1, $2 etc. The type of these will vary, depending on the type of the data for the token. The grammar parser code that uses the data knows/assumes that the appropriate data will be in yylval.
One concept that may be missing in your understanding is that there is a somewhat circular nature to the use of the grammar parser and the lexical analyzer. The grammar parser creates y.tab.h based on the grammar, and in essence says to the lexical analyzer "here is what I expect and here is the data structure(s) I will use to accept the data". At least this was something that I was late to understand; maybe the same for you. When used in combination, it is important to first run yacc to cause any changes in the grammar to be reflected in the y.tab.h, which is then used by lex.
The rest of your questions seem very project specific, and I will defer to those more skilled in the art to make suggestions.
--- rod.

Sergei Steshenko · 03-25-2010, 10:46 AM

Quote:

Originally Posted by theNbomr

...
One concept that may be missing in your understanding is that there is a somewhat circular nature to the use of the grammar parser and the lexical analyzer. The grammar parser creates y.tab.h based on the grammar, and in essence says to the lexical analyzer "here is what I expect and here is the data structure(s) I will use to accept the data". At least this was something that I was late to understand; maybe the same for you. When used in combination, it is important to first run yacc to cause any changes in the grammar to be reflected in the y.tab.h, which is then used by lex.
The rest of your questions seem very project specific, and I will defer to those more skilled in the art to make suggestions.
--- rod.

Which just proves my point - tokens is a redundant entity.

MTK358 · 03-25-2010, 11:17 AM

So you mean tokens are not necessary if the parser generator can accept regexes as terminals?

Is there such a parser generator?

Sergei Steshenko · 03-25-2010, 11:26 AM

Quote:

Originally Posted by MTK358

So you mean tokens are not necessary if the parser generator can accept regexes as terminals?

Is there such a parser generator?

I gave enough links and "tips". Start thinking differently, i.e. not in terms of tokens.

Regular expressions are just a tool to formulate what a language construct is. The idea is there is an action per language construct, and the language constructs may be compound.

In my engine regular expressions were just a kind of "built-in" parametrized (the parameter being the regular expression itself) language constructs.

To better understand what is about implement a primitive back tracking parser from scratch. Make it understand, say, assignments in the form of

Code:

abc = def;
fgh = 123;

Test it on a false cases like

Code:

zxc = ();

MTK358 · 03-25-2010, 03:12 PM

Still,

what is so bad about Lex and YACC?

OK, so they use tokens, but as long as you can easily pass their semantic value, who really cares? And even so, that's not my main problem, which is the AST (which exists no matter what parser you use).

smeezekitty · 03-25-2010, 03:16 PM

GCC is written with L&Y.