[SOLVED] [regexp] Meaning of '?' after '*'

Michael Uplawski · 01-01-2024, 02:44 PM

Code:

/\<!--(.|\s)*?-->/m

Matches HTML-comments. The pattern has given me a hard time and the final version is partly copied from a code-example on the Web. I have added the multi-line option.

I see that it works with the HTML-files at my disposal, but do not understand the question-mark at this position after '*'. During my experiments, it had been my idea to match the closing '-->' greedily like in

Code:

/\<!--(.|\s)*(-->)?/m

. At least I could claim to understand the ? here (probably don't either).

When I omit the question-mark, a script tries for an eternity to match something but I have never been patient enough to wait for a result. The first pattern above makes the routine return quasi immediately and had always been successful.

Thank you for any clarification.

wpeckham · 01-01-2024, 02:50 PM

You do not say which regexp engine is involved, but I presume we are talking about javascript.

The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior.

NevemTeve · 01-01-2024, 10:34 PM

Also how is `(.|\s)` different to `.` ?

Michael Uplawski · 01-02-2024, 01:06 AM

Quote:

Originally Posted by NevemTeve

Also how is `(.|\s)` different to `.` ?

It is not.
I am reminded by this code example (copied from the Web), that both are matched. Not that I had not used .* instead of (.\s)*, but as most of my trials were s... sub-optimal, all kinds of doubt and Jeffrey Friedl crossed my mind. The fact that the overly explicit notation finally *works* as I wanted it, was enough to keep me from touching the keyboard for a while.

But what does "one, more or none of the previous" gain by adding '?'. That the here quoted rule were optional does not make sense for me.

Michael Uplawski · 01-02-2024, 01:08 AM

Quote:

Originally Posted by wpeckham

You do not say which regexp engine is involved, but I presume we are talking about javascript.

The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior.

I understand, but do not comprehend why this were necessary...

My script is (of course, and for the rest of my life) in Ruby.

grail · 01-02-2024, 01:55 AM

Try this

https://stackoverflow.com/questions/...-a-phrase-left

Michael Uplawski · 01-02-2024, 02:39 AM

Quote:

Originally Posted by grail

Try this

https://stackoverflow.com/questions/...-a-phrase-left

“as few as possible”

Several ways to explain my difficulties:

Greediness occupied my mind, while it should have been “Laziness”
I feared to miss the first closing “-->”in the comment and concentrated on this detail
Having read the greediness/laziness chapters in Jeffrey Friedl's book much more often than I ever had need for them, my brain got muddy

Retranslated from English to Regexp to German to English:
.*? will “collect” as few as possible matches (from .*), just enough to comply to the entire rule. Thus. When, after having matched nothing (. ~= nothing) a closing --> appears, all is well. As this is not the case (there is not “nothing” between ), only then, another match is tried with “something” (. ~= anything). This works just as well and immediately --> is supposed to follow. It does not. And so on.

My own initial idea was to find '-->' as quickly as possible. Lookahead may be a way to achieve this, but I do not care to try it. The book is back on its shelve.

Sorry folks.
[Solved]
And thank you for helping out.

MadeInGermany · 01-03-2024, 10:46 AM

Code:

/<!--.*?-->/m

The *? is a minimum match; the match will span to the first -->

Code:

/<!--.*-->/m

The * is a greedy match; the match will span to the last -->

Code:

/<!--.*(-->)?/m

The --> is optional; the match will span until the very end. The addtitional wildcard expression might cost extra time. Effectively it is

Code:

/<!--.*/m

You can say the ? is a modifier of the preceding quantifier; it modifies greedyness to mimimum.
This mimimum match is from perl/PCRE; it is not defined in ERE or BRE.
grep -P understands it; grep -o prints just the match:

Code:

echo 'bla1<!--bla2-->bla3<!--bla4-->bla5' | grep -Po '<!--.*?-->'

prints the two minimum matches, while

Code:

echo 'bla1<!--bla2-->bla3<!--bla4-->bla5' | grep -Po '<!--.*-->'

prints the one greedy match.
(With color support you can see it without the -o option. But sometimes the color support seems buggy...)

Michael Uplawski · 01-04-2024, 01:05 AM

Quote:

Originally Posted by MadeInGermany

Code:

/<!--.*-->/m

The * is a greedy match; the match will span to the last -->

That is why I wanted to avoid it by “insisting on the very first -->” (wrong) instead of “insisting on the last anything before -->” (right).

Ruby's engine is Onigmo, which is Oniguruma with a little Perl. Put another way, PHP with Perl. Quite PCRE, a lot Perl-like. As far as I could identify differences, they concern patterns that I do not use. Talking about them would render Ruby way more incompatible with PCRE than it ever will be for anybody, in reality.

boughtonp · 01-04-2024, 08:47 AM

Quote:

Originally Posted by NevemTeve

Also how is `(.|\s)` different to `.` ?

"\s" means "whitespace", and is generally equivalent to "[\n\t ]" (can include other whitespace characters).

On the other hand "." means either "all characters" or "all except newline" (depending on regex engine and mode); in the latter case . is equivalent to "[^\n]"

So the expression is similar to "([^\n]|[\n\t ])", and will result in matching all characters, but it's simpler to enable the "dot all" flag (usually "s") and just use ".*?".

In this instance, an even more efficient way to do that would be a greedy match of "[^-]" combined with a negative lookahead for the terminating pattern, e.g: "([^-]+|-(?!->))*"

(And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.)

pan64 · 01-04-2024, 09:45 AM

Just use www.regex101.com, it will be nicely explained (and you can also check how does it work).
https://regex101.com/r/uRhHob/1

Michael Uplawski · 01-04-2024, 04:21 PM

Quote:

Originally Posted by boughtonp

(And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.)

I am using an XML parser, but the comments are obstructive before I handle individual tags. In the program in question, I have to eliminate successive rows full of tabulators ('\t') and a lot of empty lines. I chose to do this and also to eliminate HTML comments before the actual code parsing takes place.

wpeckham · 01-04-2024, 04:36 PM

XML is not HTML, or even close. A tool for one may not act in a useful way when applied to the other.

NevemTeve · 01-04-2024, 11:33 PM

Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[

Code:

<input type="text" value="<!-- not comment -->">
<![CDATA[ <!-- not comment --> ]]>

Michael Uplawski · 01-05-2024, 02:46 AM

Quote:

Originally Posted by NevemTeve

Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[

Code:

<input type="text" value="<!-- not comment -->">
<![CDATA[ <!-- not comment --> ]]>

You may doubt and it is a good thing to doubt. We are, though, not doing rocket science. You would know, if I did (everybody would).

I am using an xml-parser (... actually. Only a few would) which qualifies as a HTML-parser as well and I will not explain, why this is so natural a thing, that you will not worry, anyway. Skip this part. My program was working and *I* only had problems with following its actions in a log file that is automatically created. It had been *my idea* to clear things up, before the parser comes into play and *I* state afterwards that this was not so bad an idea (outside of rocket-science, that is).

The essence of this thread is that there are concepts which need to be *actively* kept apart from each other, because their *uses* seem so similar that it is too late, when you stumble over only one of them, seemingly apt to help you. Maybe add that examples are not superfluous when you try to understand lookahead, lazyness and creediness.

No need for XML. My fault to have mentioned it.