How to grep on grep with regex? How to match only numbers and dots from a grep?

debianfella · 02-04-2024, 10:27 PM

Code:

curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.*

Quote:

How to take only the 10.2.2 from this?

This command failed (no output is given):

Code:

curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.* | grep -oP '^\d+(\.\d+)*$'

How to grep on grep with regex? How to match only numbers and dots from a grep?

syg00 · 02-04-2024, 10:33 PM

You seem to misunderstand the usage of the anchors - take them out and see what happens.

debianfella · 02-04-2024, 11:11 PM

Anchors? I don't know the meaning of this word in this context.

Turbocapitalist · 02-05-2024, 01:41 AM

You'd use anchors to associate your pattern with either the end or start of the string, as two examples. Within the pattern, keep in mind that a dot . stands for any single character so if you are looking for a number with points the symbol will have to be escaped: [0-9\.]+

However, if you are parsing HTML, then regular expressions are generally not so appropriate and you'd be better off with a proper parser. There are a lot of choices. One is xmlstarlet though you do have to specify a namespace with -N to use it:

Code:

curl -s https://packagist.org/packages/drupal/core \
| tidy -q -asxml 2>/dev/null \
| xmlstarlet sel -N ns="http://www.w3.org/1999/xhtml" \
        -t -v '//ns:span[@class="version-number"]'

Other HTML parsers can be worked into Perl or Python, such as HTML::TreeBuilder::XPath or lxml.

debianfella · 02-05-2024, 03:21 AM

Turbocapitalist thanks that's a nice approach.

shruggy · 02-05-2024, 06:08 AM

Could be simplified a bit by using the default namespace:

Code:

curl -s https://packagist.org/packages/drupal/core \
| tidy -q -asxml 2>/dev/null \
| xmlstarlet sel -t -v '//_:span[@class="version-number"]'

pan64 · 02-05-2024, 07:51 AM

you can use something like:

Code:

curl -s https://packagist.org/packages/drupal/core | grep -oP '(?<="version-number">)[^<]*'

but as it was explained the correct way is to use xmlstarlet, not a regex

dugan · 02-05-2024, 09:24 PM

Obligatory:

https://blog.codinghorror.com/parsin...e-cthulhu-way/

MadeInGermany · 02-06-2024, 06:14 AM

Quote:

Originally Posted by pan64

you can use something like:

Code:

curl -s https://packagist.org/packages/drupal/core | grep -oP '(?<="version-number">)[^<]*'

but as it was explained the correct way is to use xmlstarlet, not a regex

The (?<=...) is a lookahead. It is a hidden match that does not appear in the output.
grep -oP '...' is like perl -lne 'm#...# and print $&'

Code:

curl -s https://packagist.org/packages/drupal/core | perl -lne 'm#(?<="version-number">)[^<]*# and print $&'

Perl is really the master of extended regular expressions.
Another method is a reference $1 to a (capture group):

Code:

curl -s https://packagist.org/packages/drupal/core | perl -lne 'm#"version-number">([^<]*)# and print $1'

grep -oP cannot do this, but bash builtins can:

Code:

var=$(curl -s https://packagist.org/packages/drupal/core); [[ $var =~ '"version-number">'([^<]*) ]]; echo "${BASH_REMATCH[1]}"

<yanetut>
I think I met a bug in my bash 5.0.17: it treats a > in the ERE as a redirection attempt. Would this ever make sense??
The work-around is to escape the >
Here I enclosed it in the 'string' that also quotes the "" quotes.
</yanetut>

syg00 · 02-06-2024, 06:51 AM

And what pray-tell is wrong with good ol' sed ?.

Every time I see you lot spruking xpath tools I have to go re-learn it. sed I know I can get an answer immediately. What a dinosaur I be ...

pan64 · 02-06-2024, 07:59 AM

yes, sed is exactly as good as grep. Or perl. By the way, if you use perl you might download the page with it too, there is no need to use curl. Using perl you can also have correct html parser instead of regex.

MadeInGermany · 02-06-2024, 11:12 AM

<sed nerds>
A simple partial match cannot be used in sed, because a \1 back-reference only works on the current RE, not on a previous RE command.
And sed doesn't know a lookahead either.
==> You need a substitution of the full line.

Code:

sed -En 's#.*"version-number">([^<]*).*#\1#p'

The leading and trailing .* are there to cover the full line. This is slower than a partial match.
With GNU sed you can do one .* first, and if successful do the other .*

Code:

sed -En 's#.*"version-number">##; T; s#<.*##; p'

Code:

sed -En '\#.*"version-number"># { s###; s#<.*##; p; }'

The s## (empty RE) is a back-reference to the previous match, in fact the only supported back-reference to a previous RE command.
</sed nerds>

TB0ne · 02-06-2024, 11:23 AM

Quote:

Originally Posted by debianfella

Code:

curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.*

Code:

<span class="version-number">10.2.2</span>

How to take only the 10.2.2 from this? This command failed (no output is given):

Code:

curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.* | grep -oP '^\d+(\.\d+)*$'

How to grep on grep with regex? How to match only numbers and dots from a grep?

Yet another sed solution:

Code:

 curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.* | sed -e 's/.*>\(.*\)<.*/\1/'

pan64 · 02-06-2024, 01:59 PM

Code:

curl -s https://packagist.org/packages/drupal/core | sed -z 's/.*"version-number">//;s/<.*//'

no backreference, no difficult regex, or lookahead (or lookbehind), no -E. And also there is no need to use both grep and sed.
And also you can do something similar in awk. And it is still unreliable.