[regexp] Meaning of '?' after '*'
Code:
/\<!--(.|\s)*?-->/m I see that it works with the HTML-files at my disposal, but do not understand the question-mark at this position after '*'. During my experiments, it had been my idea to match the closing '-->' greedily like in Code:
/\<!--(.|\s)*(-->)?/m When I omit the question-mark, a script tries for an eternity to match something but I have never been patient enough to wait for a result. The first pattern above makes the routine return quasi immediately and had always been successful. Thank you for any clarification. |
You do not say which regexp engine is involved, but I presume we are talking about javascript.
The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior. |
Also how is `(.|\s)` different to `.` ?
|
Quote:
I am reminded by this code example (copied from the Web), that both are matched. Not that I had not used .* instead of (.\s)*, but as most of my trials were s... sub-optimal, all kinds of doubt and Jeffrey Friedl crossed my mind. The fact that the overly explicit notation finally *works* as I wanted it, was enough to keep me from touching the keyboard for a while. But what does "one, more or none of the previous" gain by adding '?'. That the here quoted rule were optional does not make sense for me. |
Quote:
My script is (of course, and for the rest of my life) in Ruby. |
|
LAZYNESS
Quote:
Several ways to explain my difficulties:
Retranslated from English to Regexp to German to English: .*? will “collect” as few as possible matches (from .*), just enough to comply to the entire rule. Thus. When, after having matched nothing (. ~= nothing) a closing --> appears, all is well. As this is not the case (there is not “nothing” between <!-- and -->), only then, another match is tried with “something” (. ~= anything). This works just as well and immediately --> is supposed to follow. It does not. And so on. My own initial idea was to find '-->' as quickly as possible. Lookahead may be a way to achieve this, but I do not care to try it. The book is back on its shelve. Sorry folks. [Solved] And thank you for helping out. |
Code:
/<!--.*?-->/m Code:
/<!--.*-->/m Code:
/<!--.*(-->)?/m Code:
/<!--.*/m This mimimum match is from perl/PCRE; it is not defined in ERE or BRE. grep -P understands it; grep -o prints just the match: Code:
echo 'bla1<!--bla2-->bla3<!--bla4-->bla5' | grep -Po '<!--.*?-->' Code:
echo 'bla1<!--bla2-->bla3<!--bla4-->bla5' | grep -Po '<!--.*-->' (With color support you can see it without the -o option. But sometimes the color support seems buggy...) |
Quote:
Ruby's engine is Onigmo, which is Oniguruma with a little Perl. Put another way, PHP with Perl. Quite PCRE, a lot Perl-like. As far as I could identify differences, they concern patterns that I do not use. Talking about them would render Ruby way more incompatible with PCRE than it ever will be for anybody, in reality. |
Quote:
On the other hand "." means either "all characters" or "all except newline" (depending on regex engine and mode); in the latter case . is equivalent to "[^\n]" So the expression is similar to "([^\n]|[\n\t ])", and will result in matching all characters, but it's simpler to enable the "dot all" flag (usually "s") and just use ".*?". In this instance, an even more efficient way to do that would be a greedy match of "[^-]" combined with a negative lookahead for the terminating pattern, e.g: "([^-]+|-(?!->))*" (And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.) |
Just use www.regex101.com, it will be nicely explained (and you can also check how does it work).
https://regex101.com/r/uRhHob/1 |
Quote:
|
XML is not HTML, or even close. A tool for one may not act in a useful way when applied to the other.
|
Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[
Code:
<input type="text" value="<!-- not comment -->"> |
Quote:
I am using an xml-parser (... actually. Only a few would) which qualifies as a HTML-parser as well and I will not explain, why this is so natural a thing, that you will not worry, anyway. Skip this part. My program was working and *I* only had problems with following its actions in a log file that is automatically created. It had been *my idea* to clear things up, before the parser comes into play and *I* state afterwards that this was not so bad an idea (outside of rocket-science, that is). The essence of this thread is that there are concepts which need to be *actively* kept apart from each other, because their *uses* seem so similar that it is too late, when you stumble over only one of them, seemingly apt to help you. Maybe add that examples are not superfluous when you try to understand lookahead, lazyness and creediness. No need for XML. My fault to have mentioned it. |
All times are GMT -5. The time now is 02:12 PM. |