Parser in Perl

MTK358 · 01-25-2010, 12:08 PM

I wonder if you have a large string in Perl, for example this:

Code:

stuff
keyword morestuff { blah{
blah{}
}blah
} test
end

Is it possible to make it extract everything from the beginning of the sequence "keyword" until the matching brace to the first left brace after "keyword"?

For example, it should extract this from the above example:

Code:

keyword morestuff { blah{
blah{}
}blah
}

And is is possible to pass this section to a function, where it is processed, and then to replace it with the function's returned string?

Sergei Steshenko · 01-25-2010, 03:19 PM

Quote:

Originally Posted by MTK358

I wonder if you have a large string in Perl, for example this:

Code:

stuff
keyword morestuff { blah{
blah{}
}blah
} test
end

Is it possible to make it extract everything from the beginning of the sequence "keyword" until the matching brace to the first left brace after "keyword"?

For example, it should extract this from the above example:

Code:

keyword morestuff { blah{
blah{}
}blah
}

And is is possible to pass this section to a function, where it is processed, and then to replace it with the function's returned string?

Start from here:

http://perldoc.perl.org/Text/Balanced.html ,
http://perldoc.perl.org/perlfaq6.html - in this one look for "Can I use Perl regular expressions to match balanced text?".

...

I once wrote a simple elegant parser specifically for nested pairs of {} ; I hope the aboveText::Balanced is a better solution. If not, I'll describe my idea.

Sergei Steshenko · 01-25-2010, 03:26 PM

... Just curious - maybe you are parsing SYNOPSYS library format file ? That's because I wrote my nested {} parser exactly for this purpose.

MTK358 · 01-25-2010, 03:32 PM

No, it's for a silly little project I am doing that adds classes and polymorphism tc C using a simple Perl search-and-replace script. It actually kind of works now but the syntax is really terrible and I thought that using curly braces as the block delimiter and semicolons insted of newlines as separators would make it fit in much nicer.

Basically I want the syntax to be:

Code:

class ClassName SuperClass1 SuperClass2 ... {
    int var;

    char str;

    void method();

    int anotherMethod(int a) {
        return a + this.var;
    }
}

Sergei Steshenko · 01-25-2010, 04:11 PM

Quote:

Originally Posted by MTK358

No, it's for a silly little project I am doing that adds classes and polymorphism tc C using a simple Perl search-and-replace script. It actually kind of works now but the syntax is really terrible and I thought that using curly braces as the block delimiter and semicolons insted of newlines as separators would make it fit in much nicer.

Basically I want the syntax to be:

Code:

class ClassName SuperClass1 SuperClass2 ... {
    int var;

    char str;

    void method();

    int anotherMethod(int a) {
        return a + this.var;
    }
}

Oh, it looks like you're in trouble: think of the following:

Code:

char *s = "a nasty string with ; inside";

- what will your parser do with the ';' inside ? Or it won't delve into strings ? I.e. are you writing a partial parser ?

But still:

Code:

char *s = "a nasty string with { ... } inside";

I.e. the robust parser should:

get rid of comments (and there should be a possibility to restore them);
temporarily get rid of strings

.

My point is that to make a half-hearted parser robust one should make it more than just half hearted

.

Sergei Steshenko · 01-25-2010, 04:15 PM

... I have already described here in part my PerlPreProcessor - have a look at it. It can help you a lot with text substitution, it can cope with metadata, unlike C++ template it is stateful, i.e. one can pass data between templates - because it's all in Perl.

And no new language is invented - in your case it will be pure "C" + pure Perl. The only new entity is simple reserved comments - like

Code:

// PERL_BEGIN
// PERL_END
// PERL_ONE_LINER

.

MTK358 · 01-25-2010, 04:45 PM

I haven't thought about strings containing semicolons or braces. The parser will somehow have to ignore characters in double or single quotes.

Sergei Steshenko · 01-26-2010, 07:27 AM

Quote:

Originally Posted by MTK358

I haven't thought about strings containing semicolons or braces. The parser will somehow have to ignore characters in double or single quotes.

And it's more than that - C99 allows anonymous structs to be passed as function parameters, and the srtucts too contain { ... }.

MTK358 · 01-26-2010, 09:21 AM

Quote:

Originally Posted by Sergei Steshenko

And it's more than that - C99 allows anonymous structs to be passed as function parameters, and the srtucts too contain { ... }.

That won't be a problem if the parser looks for the matching brace, not the first right brace.

I mabe a quick little program that does this, but it doesn't seem to work:

Code:

$text = "{ testext } to test.";

print get_matching_brace($text, 0);

# get_matching_brace(string, index of left brace)
# returns text between matching braces, including braces
sub get_matching_brace {
	$start = $index;
	$index = $_[1];
	$depth = 1;
	while(depth > 0) {
		$index++;
		if(substr($_[0], $index, 1) eq '{') {
			$depth++;
		} elsif(substr($_[0], $index, 1) eq '}') {
			$depth--;
		}
	}
	return substr($_[0], $start, $index);
}

Sergei Steshenko · 01-26-2010, 01:53 PM

Quote:

Originally Posted by MTK358

That won't be a problem if the parser looks for the matching brace, not the first right brace.

I mabe a quick little program that does this, but it doesn't seem to work:

Code:

$text = "{ testext } to test.";

print get_matching_brace($text, 0);

# get_matching_brace(string, index of left brace)
# returns text between matching braces, including braces
sub get_matching_brace {
	$start = $index;
	$index = $_[1];
	$depth = 1;
	while(depth > 0) {
		$index++;
		if(substr($_[0], $index, 1) eq '{') {
			$depth++;
		} elsif(substr($_[0], $index, 1) eq '}') {
			$depth--;
		}
	}
	return substr($_[0], $start, $index);
}

Nah. It's "not" Perl. Use regular expressions - the engine can cope with multi-new-line strings.

And won't you try Text::Balanced ?

MTK358 · 01-26-2010, 02:12 PM

Quote:

Originally Posted by Sergei Steshenko

And won't you try Text::Balanced

I would like to try but I can't find a good, simple explanation.

Sergei Steshenko · 01-26-2010, 03:34 PM

Quote:

Originally Posted by MTK358

I would like to try but I can't find a good, simple explanation.

You know my standard question in such cases: what is the first thing you do not understand ?

MTK358 · 01-26-2010, 03:53 PM

This, from the Text::Balanced SYNIPSIS section:

Code:

use Text::Balanced qw (
                        extract_delimited
                       extract_bracketed
                       extract_quotelike
                       extract_codeblock
                       extract_variable
                        extract_tagged
                  extract_multiple
                        gen_delimited_pat
                       gen_extract_tagged
                     );

Sergei Steshenko · 01-26-2010, 04:05 PM

Quote:

Originally Posted by MTK358

This, from the Text::Balanced SYNIPSIS section:

Code:

use Text::Balanced qw (
                        extract_delimited
                       extract_bracketed
                       extract_quotelike
                       extract_codeblock
                       extract_variable
                        extract_tagged
                  extract_multiple
                        gen_delimited_pat
                       gen_extract_tagged
                     );

This piece of code tells which functions to import from Text::Balanced. You most likely will need 'extract_bracketed', and maybe 'extract_codeblock' - the latter one may help dealing with "nasty" strings.

Just copy the above piece - it will import all the functions, you may need some of them later.

But start from 'extract_bracketed'.

MTK358 · 01-26-2010, 07:25 PM

Next, I don't understand extract_bracketed()'s third parameter.