LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-04-2024, 10:27 PM   #1
debianfella
LQ Newbie
 
Registered: Jul 2023
Posts: 15

Rep: Reputation: 0
How to grep on grep with regex? How to match only numbers and dots from a grep?


Code:
curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.*
Quote:
<span class="version-number">10.2.2</span>
How to take only the 10.2.2 from this?

This command failed (no output is given):

Code:
curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.* | grep -oP '^\d+(\.\d+)*$'
How to grep on grep with regex? How to match only numbers and dots from a grep?
 
Old 02-04-2024, 10:33 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,152

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
You seem to misunderstand the usage of the anchors - take them out and see what happens.
 
Old 02-04-2024, 11:11 PM   #3
debianfella
LQ Newbie
 
Registered: Jul 2023
Posts: 15

Original Poster
Rep: Reputation: 0
Anchors? I don't know the meaning of this word in this context.
 
Old 02-05-2024, 01:41 AM   #4
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,356
Blog Entries: 3

Rep: Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767
You'd use anchors to associate your pattern with either the end or start of the string, as two examples. Within the pattern, keep in mind that a dot . stands for any single character so if you are looking for a number with points the symbol will have to be escaped: [0-9\.]+

However, if you are parsing HTML, then regular expressions are generally not so appropriate and you'd be better off with a proper parser. There are a lot of choices. One is xmlstarlet though you do have to specify a namespace with -N to use it:

Code:
curl -s https://packagist.org/packages/drupal/core \
| tidy -q -asxml 2>/dev/null \
| xmlstarlet sel -N ns="http://www.w3.org/1999/xhtml" \
        -t -v '//ns:span[@class="version-number"]'
Other HTML parsers can be worked into Perl or Python, such as HTML::TreeBuilder::XPath or lxml.
 
Old 02-05-2024, 03:21 AM   #5
debianfella
LQ Newbie
 
Registered: Jul 2023
Posts: 15

Original Poster
Rep: Reputation: 0
Turbocapitalist thanks that's a nice approach.
 
Old 02-05-2024, 06:08 AM   #6
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,678

Rep: Reputation: Disabled
Could be simplified a bit by using the default namespace:
Code:
curl -s https://packagist.org/packages/drupal/core \
| tidy -q -asxml 2>/dev/null \
| xmlstarlet sel -t -v '//_:span[@class="version-number"]'
 
1 members found this post helpful.
Old 02-05-2024, 07:51 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,039

Rep: Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347
you can use something like:
Code:
curl -s https://packagist.org/packages/drupal/core | grep -oP '(?<="version-number">)[^<]*'
but as it was explained the correct way is to use xmlstarlet, not a regex
 
Old 02-05-2024, 09:24 PM   #8
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,249

Rep: Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323
Obligatory:

https://blog.codinghorror.com/parsin...e-cthulhu-way/
 
1 members found this post helpful.
Old 02-06-2024, 06:14 AM   #9
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,832

Rep: Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218
Quote:
Originally Posted by pan64 View Post
you can use something like:
Code:
curl -s https://packagist.org/packages/drupal/core | grep -oP '(?<="version-number">)[^<]*'
but as it was explained the correct way is to use xmlstarlet, not a regex
The (?<=...) is a lookahead. It is a hidden match that does not appear in the output.
grep -oP '...' is like perl -lne 'm#...# and print $&'
Code:
curl -s https://packagist.org/packages/drupal/core | perl -lne 'm#(?<="version-number">)[^<]*# and print $&'
Perl is really the master of extended regular expressions.
Another method is a reference $1 to a (capture group):
Code:
curl -s https://packagist.org/packages/drupal/core | perl -lne 'm#"version-number">([^<]*)# and print $1'
grep -oP cannot do this, but bash builtins can:
Code:
var=$(curl -s https://packagist.org/packages/drupal/core); [[ $var =~ '"version-number">'([^<]*) ]]; echo "${BASH_REMATCH[1]}"
<yanetut>
I think I met a bug in my bash 5.0.17: it treats a > in the ERE as a redirection attempt. Would this ever make sense??
The work-around is to escape the >
Here I enclosed it in the 'string' that also quotes the "" quotes.
</yanetut>

Last edited by MadeInGermany; 02-06-2024 at 08:45 AM.
 
Old 02-06-2024, 06:51 AM   #10
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,152

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
And what pray-tell is wrong with good ol' sed ?.

Every time I see you lot spruking xpath tools I have to go re-learn it. sed I know I can get an answer immediately. What a dinosaur I be ...
 
Old 02-06-2024, 07:59 AM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,039

Rep: Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347
yes, sed is exactly as good as grep. Or perl. By the way, if you use perl you might download the page with it too, there is no need to use curl. Using perl you can also have correct html parser instead of regex.
 
Old 02-06-2024, 11:12 AM   #12
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,832

Rep: Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218Reputation: 1218
<sed nerds>
A simple partial match cannot be used in sed, because a \1 back-reference only works on the current RE, not on a previous RE command.
And sed doesn't know a lookahead either.
==> You need a substitution of the full line.
Code:
sed -En 's#.*"version-number">([^<]*).*#\1#p'
The leading and trailing .* are there to cover the full line. This is slower than a partial match.
With GNU sed you can do one .* first, and if successful do the other .*
Code:
sed -En 's#.*"version-number">##; T; s#<.*##; p'
Code:
sed -En '\#.*"version-number"># { s###; s#<.*##; p; }'
The s## (empty RE) is a back-reference to the previous match, in fact the only supported back-reference to a previous RE command.
</sed nerds>
 
Old 02-06-2024, 11:23 AM   #13
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,751

Rep: Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983
Quote:
Originally Posted by debianfella View Post
Code:
curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.*
Code:
<span class="version-number">10.2.2</span>
How to take only the 10.2.2 from this? This command failed (no output is given):
Code:
curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.* | grep -oP '^\d+(\.\d+)*$'
How to grep on grep with regex? How to match only numbers and dots from a grep?
Yet another sed solution:
Code:
 curl -s https://packagist.org/packages/drupal/core | grep -oP '<span class="version-number">'.* | sed -e 's/.*>\(.*\)<.*/\1/'
 
Old 02-06-2024, 01:59 PM   #14
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,039

Rep: Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347Reputation: 7347
Code:
curl -s https://packagist.org/packages/drupal/core | sed -z 's/.*"version-number">//;s/<.*//'
no backreference, no difficult regex, or lookahead (or lookbehind), no -E. And also there is no need to use both grep and sed.
And also you can do something similar in awk. And it is still unreliable.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to capture 1000 lines before a string match and 1000 line a string match including line of string match ? sysmicuser Linux - Newbie 12 11-14-2017 05:21 AM
[SOLVED] Using grep or sed to return a regex match davee Linux - General 7 08-02-2011 02:48 AM
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 01:10 AM
output the occurence number in sed or grep results in every regex match mbaste2 Linux - General 3 04-06-2011 01:58 AM
grep/sed/awk - find match, then match on next line gctaylor1 Programming 3 07-11-2007 08:55 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 01:29 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration