ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Matches HTML-comments. The pattern has given me a hard time and the final version is partly copied from a code-example on the Web. I have added the multi-line option.
I see that it works with the HTML-files at my disposal, but do not understand the question-mark at this position after '*'. During my experiments, it had been my idea to match the closing '-->' greedily like in
Code:
/\<!--(.|\s)*(-->)?/m
. At least I could claim to understand the ? here (probably don't either).
When I omit the question-mark, a script tries for an eternity to match something but I have never been patient enough to wait for a result. The first pattern above makes the routine return quasi immediately and had always been successful.
Thank you for any clarification.
Last edited by Michael Uplawski; 01-01-2024 at 02:45 PM.
You do not say which regexp engine is involved, but I presume we are talking about javascript.
The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior.
It is not.
I am reminded by this code example (copied from the Web), that both are matched. Not that I had not used .* instead of (.\s)*, but as most of my trials were s... sub-optimal, all kinds of doubt and Jeffrey Friedl crossed my mind. The fact that the overly explicit notation finally *works* as I wanted it, was enough to keep me from touching the keyboard for a while.
But what does "one, more or none of the previous" gain by adding '?'. That the here quoted rule were optional does not make sense for me.
Last edited by Michael Uplawski; 01-02-2024 at 01:09 AM.
Reason: Kraut2English
You do not say which regexp engine is involved, but I presume we are talking about javascript.
The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior.
I understand, but do not comprehend why this were necessary...
My script is (of course, and for the rest of my life) in Ruby.
Greediness occupied my mind, while it should have been “Laziness”
I feared to miss the first closing “-->”in the comment and concentrated on this detail
Having read the greediness/laziness chapters in Jeffrey Friedl's book much more often than I ever had need for them, my brain got muddy
Retranslated from English to Regexp to German to English:
.*? will “collect” as few as possible matches (from .*), just enough to comply to the entire rule. Thus. When, after having matched nothing (. ~= nothing) a closing --> appears, all is well. As this is not the case (there is not “nothing” between <!-- and -->), only then, another match is tried with “something” (. ~= anything). This works just as well and immediately --> is supposed to follow. It does not. And so on.
My own initial idea was to find '-->' as quickly as possible. Lookahead may be a way to achieve this, but I do not care to try it. The book is back on its shelve.
Sorry folks.
[Solved]
And thank you for helping out.
Last edited by Michael Uplawski; 01-02-2024 at 02:41 AM.
The *? is a minimum match; the match will span to the first -->
Code:
/<!--.*-->/m
The * is a greedy match; the match will span to the last -->
Code:
/<!--.*(-->)?/m
The --> is optional; the match will span until the very end. The addtitional wildcard expression might cost extra time. Effectively it is
Code:
/<!--.*/m
You can say the ? is a modifier of the preceding quantifier; it modifies greedyness to mimimum.
This mimimum match is from perl/PCRE; it is not defined in ERE or BRE.
grep -P understands it; grep -o prints just the match:
The * is a greedy match; the match will span to the last -->
That is why I wanted to avoid it by “insisting on the very first -->” (wrong) instead of “insisting on the last anything before -->” (right).
Ruby's engine is Onigmo, which is Oniguruma with a little Perl. Put another way, PHP with Perl. Quite PCRE, a lot Perl-like. As far as I could identify differences, they concern patterns that I do not use. Talking about them would render Ruby way more incompatible with PCRE than it ever will be for anybody, in reality.
Last edited by Michael Uplawski; 01-04-2024 at 01:06 AM.
Reason: kraut2English
"\s" means "whitespace", and is generally equivalent to "[\n\t ]" (can include other whitespace characters).
On the other hand "." means either "all characters" or "all except newline" (depending on regex engine and mode); in the latter case . is equivalent to "[^\n]"
So the expression is similar to "([^\n]|[\n\t ])", and will result in matching all characters, but it's simpler to enable the "dot all" flag (usually "s") and just use ".*?".
In this instance, an even more efficient way to do that would be a greedy match of "[^-]" combined with a negative lookahead for the terminating pattern, e.g: "([^-]+|-(?!->))*"
(And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.)
(And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.)
I am using an XML parser, but the comments are obstructive before I handle individual tags. In the program in question, I have to eliminate successive rows full of tabulators ('\t') and a lot of empty lines. I chose to do this and also to eliminate HTML comments before the actual code parsing takes place.
Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[
Code:
<input type="text" value="<!-- not comment -->">
<![CDATA[ <!-- not comment --> ]]>
Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[
Code:
<input type="text" value="<!-- not comment -->">
<![CDATA[ <!-- not comment --> ]]>
You may doubt and it is a good thing to doubt. We are, though, not doing rocket science. You would know, if I did (everybody would).
I am using an xml-parser (... actually. Only a few would) which qualifies as a HTML-parser as well and I will not explain, why this is so natural a thing, that you will not worry, anyway. Skip this part. My program was working and *I* only had problems with following its actions in a log file that is automatically created. It had been *my idea* to clear things up, before the parser comes into play and *I* state afterwards that this was not so bad an idea (outside of rocket-science, that is).
The essence of this thread is that there are concepts which need to be *actively* kept apart from each other, because their *uses* seem so similar that it is too late, when you stumble over only one of them, seemingly apt to help you. Maybe add that examples are not superfluous when you try to understand lookahead, lazyness and creediness.
No need for XML. My fault to have mentioned it.
Last edited by Michael Uplawski; 01-05-2024 at 02:50 AM.
Reason: exitus
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.