[SOLVED] Simple REGEX to return price and date from web page for kmymoney online price updater

lpallard · 02-20-2022, 12:26 PM

Hello,

I use kmymoney to manage my finances and it uses regexes to retrieve investment prices/dates for its online price updater. Unfortunately, most (if not all) of the online sources have proven extremely unreliable and I am fed up with having constant errors so I have decided to use my actual investment's firm website and regexes to retrieve the prices directly.

Unfortunately, I have never worked with regexes before. Kmymoney seems to "download" a copy of the web page source code, then using regex, return price/date. For my actual case, the relevant section of the web page's source looks like (heavily truncated):

Code:

                        </div>
                                              </div>

                    </td>

                    <td class="unwrappable text-right">2022-02-18</td>
                    <td class="text-right">21.8080</td>
                    <td class="text-right">-0.2535</td>
                    <td class="text-right">-0.5045</td>
                                        <td class="text-right">
                        
                            <a  target="_blank"
                               href="/en/ajax/fund-fact.html?series_id=261">
                                (PDF <span >127K)</span>
                            </a>
                                            </td>
                                        <td class="text-right">

I need a regex to return the date (2022-02-18) between the

Code:

<td class="unwrappable text-right">

and

Code:

</td>

tags.

The best I could come up with is

Code:

<td class="unwrappable text-right">(\d{4}([.\-/])\d{2}([.\-/])\d{2})</td>

which unfortunately returns the entire line where the date is located. That wont work. I need only YYYY-MM-DD.

The same issue happens for the price. I must be close but I couldnt get it to work.

Any regex gurus that can help?

Thanks!

shruggy · 02-20-2022, 01:15 PM

Quote:

Originally Posted by lpallard

Kmymoney seems to "download" a copy of the web page source code, then using regex, return price/date.

I doubt this is what KMyMoney actually does. It fetches stock quotes from finance.yahoo.com, and the latter provides data as CSV.

Check if your investment firm provides those data as CSV as well. Or the site has a JSON API, or whatever. Other than that, use an HTML parser. E.g. HTML-XML-utils

Code:

hxselect -cs\\n td.unwrappable.text-right

or pup

Code:

pup 'td.unwrappable.text-right text{}'

Many of XPath tools will do as well. You may start with something like

Code:

xidel -se //td

or

Code:

xmlstarlet sel -t -v //td

and work your way from there.

lpallard · 02-20-2022, 02:18 PM

@shruggy: what you're proposing is very interesting.

However:

1. Most of these investment firms (I am doing business with 6 of them) have NO api of any sort, nor do they offer what I call "raw data" (like a simple web page with raw data). They do provide a page with detailled ticker data (the investment characteristics, price, fluctuation, all other stuff) and some page with tables but the page alone is several MB's because it has LOTS of scripting and useless stuff......

2. Some offer downloadable data (like an excel file, no CSV) but their excel spreadsheets are organized with the name of the investments, not the ticker. Moreover the investments are US/CAD, etc..... A real mess.

If I was a conspiracy theorist, I'd say they are making it cumbersome and difficult to automate data retrieval from their site...

I also looked at the web pages of some investments from the SAME firm, and the web URL's are different!!! If the URL's were identical with the exception of a unique number or ticker, that would make things easier....

I am thinking to try to make a script that would be called once per day via cron, make a local copy of all the pages for all the investments, use regexes or other utilities (such as those you suggested), then extract the data and dump it on a small html file that kmymoney could connect to and grab the numbers.

However, if that works, how long will it work? Until these firms change something on their sites, which happens constantly.

I am open to any suggestions at this point.

shruggy · 02-20-2022, 02:59 PM

Quote:

Originally Posted by lpallard

2. Some offer downloadable data (like an excel file, no CSV)

Of course, there are ways to convert from Excel to CSV. Gnumeric comes with ssconvert and ssgrep. catdoc has xls2csv for the old Excel format. For the new format (.xlsx), there are XLSX I/O and xlsx2csv. LibreOffice is also an option (it can be started with --headless on the command line, or use a wrapper script like unoconv).

But given all the uncertainties, I would just download quotes from finance.yahoo.com as everybody else does.

lpallard · 02-20-2022, 03:49 PM

Hey

I tried to work with the excel files but believe it or not, 2 of my investments are NOT listed in this file. I've sent an email to ask to the investment firm, but I agree with you, its already complicated enough, I'd rather work with web pages the way KMM was intended to do.

I was using Yahoo a while ago but they did some changes and some investments disappeared. The Globe and Mail, on the other hand, seems to carry all of them. I used to work with this one but they changed the page format (source code) and the regex that came with KMM stopped working hence why I switched to yahoo...

Now I just need to find a way to get the prices/dates from Globe and Mail and all will work like before.

The page source shows this:

Code:

aria-describedby="tradeTime-legend-caption"> <barchart-field binding="false" symbol="INV6565.CF" type="time" name="tradeTime" value="02/18/22"></barchart-field> </span> </span> </div> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-3 col-lg-5"> <div class="bc-quick-action-tools"> <a is="barchart-alert" class="" quote='{"symbol":"INV6565.CF","symbolName":"Investment ABC Global Stocks","exchange":"CADFUNDS","lastPrice":"26.6252","priceChange":"-0.1220","percentChange":"-0.64%",

I need to extract 26.6252 and 02/18/22 from that blob...

Do you have an idea how? I must have tried 100 ways to get the data each time always working incorrectly (returning everything after the number, or before it, only the integers (19 and not 19.0026), or nothing... If you are good with regex could you point how ?

I need to study these regex....

shruggy · 02-20-2022, 05:47 PM

From what I can see, the Glob and Mail uses Barchart and Polygon.io. The latter seems to offer a free basic API for personal use.

I would scrape the page in question with xidel and jq like this

Code:

#!/bin/sh
url=https://www.theglobeandmail.com/investing/markets/stocks/GOOG
xidel "$url" -se '//a[@is="barchart-alert"]/@quote'|
  jq -r .lastPrice,.tradeTime

pup cannot read directly from a URL, so you'll have to pipe it from curl:

Code:

#!/bin/sh
url=https://www.theglobeandmail.com/investing/markets/stocks/GOOG
curl -s "$url"|
  pup -p 'a[is="barchart-alert"] attr{quote}'|
  jq -r .lastPrice,.tradeTime

dugan · 02-21-2022, 03:56 PM

Don't try to write your own regex for this!

For plain HTML (as in posts #1 and #5), use this, as Shruggy says:

https://github.com/ericchiang/pup

If you need to execute scripts on the page in order to render it and get the data you need, use this:

https://phantomjs.org/

ondoho · 02-22-2022, 12:01 PM

Quote:

Originally Posted by dugan

Don't try to write your own regex for this!

Nice read, esp. the screenshot of a coder losing their mind

lpallard · 03-08-2022, 10:33 AM

Thanks guys for saving my mental health.... That coder who lost his mind had be laughing!!!!

For the resolution of this topic, what I ended up doing is writing a very simple bash script using PUP and JQ as the main tools to extract prices and dates, then passing the output to Kmymoney with the precious help of the main dev of this project.

Shruggy, your proposed script helped a LOT!

Code:

#!/bin/sh
url=https://www.theglobeandmail.com/investing/markets/stocks/GOOG
curl -s "$url"|
  pup -p 'a[is="barchart-alert"] attr{quote}'|
  jq -r .lastPrice,.tradeTime

Thanks!

lpallard · 03-13-2022, 01:27 PM

Hello,

I'm trying to do something simple (?) with pup to automatically retrieve a specific value from discogs. The median price is what I'm after. They have an API which would be the best way to do this, but unfortunately (AFAIK), the stats (lowest, median and max prices, last sold, etc) are not available thru the API....

Examining the webpage with Chrome's html inspector, I see 2 sections where the median price is provided. Example here: https://www.discogs.com/release/2193...ustice-For-All

Section 1:

Code:

<section id="release-stats" class="section_9nUx6 open_BZ6Zt mobile_SYavk">
   <header class="header_W2hzl" role="button">
      <h3>Statistics</h3>
   </header>
   <div class="content_1TFzi">
      <div class="items_3gMeU">
         <ul>
            <li>
               <h4>
                  Have<!-- -->:
               </h4>
               <a href="/release/stats/2193760" hreflang="en" class="link_1ctor">815</a>
            </li>
            <li>
               <h4>
                  Want<!-- -->:
               </h4>
               <a href="/release/stats/2193760" hreflang="en" class="link_1ctor">608</a>
            </li>
            <li>
               <h4>
                  Avg Rating<!-- -->:
               </h4>
               <span>
                  4.68<!-- --> / 5
               </span>
            </li>
            <li>
               <h4>
                  Ratings<!-- -->:
               </h4>
               <a href="/release/stats/2193760" hreflang="en" class="link_1ctor">98</a>
            </li>
         </ul>
         <ul>
            <li>
               <h4>
                  Last Sold<!-- -->:
               </h4>
               <a href="/sell/history/2193760" hreflang="en" class="link_1ctor"><time datetime="2022-03-04">4 Mar 2022</time></a>
            </li>
            <li>
               <h4>
                  Lowest<!-- -->:
               </h4>
               <span>CA$114.20</span>
            </li>
            <li>
               <h4>
                  Median<!-- -->:
               </h4>
               <span>CA$156.69</span>
            </li>
            <li>
               <h4>
                  Highest<!-- -->:
               </h4>
               <span>CA$223.42</span>
            </li>
         </ul>
      </div>
   </div>
</section>

Section 2:

Code:

<script id="dsdata" type="application/json">{"config":{"BUILD_ID":"c26355d1-02616cba","NODE_ENV":"production","SEARCH_URL":"","SENTRY_CLIENT_DSN":"https://31684db80f89494bbbc2a5e385f387b0@o8337.ingest.sentry.io/5173537","DATADOG_APP_ID":"49b2300a-317f-4807-839c-30a5c950a730","DATADOG_CLIENT_TOKEN":"pub36c1cd06ed7ff2432b450daaa9a5896c","DATADOG_SAMPLE_RATE":0,"GTM_CLIENT_ID":"GTM-KMW8HRV","REDIRECT_DEFAULT_LOCALE":true,"AD_SCRIPT":"https://lngtd.com/discogs_a.js","DISABLE_ADS":false,"ASSET_HOST":"https://catalog-assets.discogs.com/","YOUTUBE_API_KEY":"AIzaSyBapJWXaq7LdQ-kJ6IYCo8VgMSSf4zb2r8","DISCOGS_HOST":"https://www.discogs.com"},"data":{"ROOT_QUERY":{"__typename":"Query","viewer":null,"unreadMessagesCount":{"__typename":"ProfileUnreadMessagesCountConnection","totalCount":0},"cartCount":{"__typename":"MarketplaceCartCountConnection","totalCount":0},"release({\"discogsId\":2193760})":{"__ref":"Release:{\"discogsId\":2193760}"}},"Release:{\"discogsId\":2193760}":{"discogsId":2193760,"__typename":"Release","title":"...And Justice For All","blockedFromSale":false,"formats":[{"__typename":"Format","name":"Vinyl","quantity":"4","description":["12\"","45 RPM","Album","Reissue","Remastered"],"text":null},{"__typename":"Format","name":"Box Set","quantity":"1","description":[],"text":null}],"listings({\"first\":0})":{"__typename":"InventoryItemConnection","totalCount":14},"lowestPrice":{"__typename":"Price","converted({\"toCurrency\":\"CAD\"})":{"__typename":"Price","amount":139.27153416997535,"currency":"CAD"}},"ratings({\"first\":1})":{"__typename":"RatingConnection","averageRating":4.68,"totalCount":98},"inCollectionCount":815,"inWantlistCount":608,"statistics":{"__typename":"ReleaseStatisticsConnection","lastSaleDate":"2022-03-04T14:02:54-08:00","max":{"__typename":"Price","converted({\"toCurrency\":\"CAD\"})":{"__typename":"Price","amount":223.41661453612326,"currency":"CAD"}},"median":{"__typename":"Price","converted({\"toCurrency\":\"CAD\"})":{"__typename":"Price","amount":156.6852634401052,"currency":"CAD"}},"min":{"__typename":"Price","converted({\"toCurrency\":\"USD\"})":{"__typename":"Price","currency":"USD","amount":89.45},"converted({\"toCurrency\":\"CAD\"})":{"__typename":"Price","amount":114.19780668717844,"currency":"CAD"},"amount":89.45,"currency":"USD"}},"siteUrl":"/release/2193760-Metallica-And-Justice-For-All","labels":[{"__typename":"LabelRelationship","catalogNumber":"00600753135440","labelRole":"LABEL","label":{"__typename":"Label","discogsId":22532,"siteUrl":"/label/22532-Universal","name":"Universal"},"displayName":"Universal"},{"__typename":"LabelRelationship","catalogNumber":null,"labelRole":"SERIES","label":{"__typename":"Label","discogsId":322430,"siteUrl":"/label/322430-Metallica-45-RPM-Series","name":"Metallica 45 RPM Series"},"displayName":"Metallica 45 RPM Series"}],"isOffensive":false,"dataQuality":"CORRECT","visibility":"PUBLIC","masterRelease":{"__ref":"MasterRelease:{\"discogsId\":6571}"},"barcodes":[{"__typename":"Barcode","type":"BARCODE","description":null,"value":"6 00753 13544 0"},{"__typename":"Barcode","type":"LABEL_CODE","description":null,"value":"LC 01633"},{"__typename":"Barcode","type":"RIGHTS_SOCIETY","description":null,"value":"BEIM/SABAM"}],"contributors":[{"__typename":"ReleaseContributor","isOriginalSubmitter":true,"user":{"__typename":"User","username":"doomtrax"}},{"__typename":"ReleaseContributor","isOriginalSubmitter":false,"user":{"__typename":"User","username":"jh59095"}},{"__typename":"ReleaseContributor","isOriginalSubmitter":false,"user":{"__typename":"User","username":"GoVinylGo"}},{"__typename":"ReleaseContributor","isOriginalSubmitter":false,"user":{"__typename":"User","username":"LLUZNIAK"}},{"__typename":"ReleaseContributor","isOriginalSubmitter":false,"user":{"__typename":"User","username":"japc"}},{"__typename":"ReleaseContributor","isOriginalSubmitter":false,"user":{"__typename":"User","username":"ashmataz"}},{"__typename":"ReleaseContributor","isOriginalSubmitter":false,"user":{"__typename":"User","username":"syke"}},{"__typename":"ReleaseContributor","isOriginalSubmitter":false,"user":{"__typename":"User","username":"kbell75"}},{"__typename":"ReleaseContributor","isOriginalSubmitter":false,"user":{"__typename":"User","username":"vpaluzga"}}],"releaseCredits":[{"__typename":"ReleaseCredit","displayName":"James Hetfield","artist":{"__typename":"Artist","discogsId":251874,"siteUrl":"/artist/251874-James-Hetfield","name":"James Hetfield"},"nameVariation":"Hetfield","creditRole":"Arranged By","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Lars Ulrich","artist":{"__typename":"Artist","discogsId":251550,"siteUrl":"/artist/251550-Lars-Ulrich","name":"Lars Ulrich"},"nameVariation":"Ulrich","creditRole":"Arranged By","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Jason Newsted","artist":{"__typename":"Artist","discogsId":390503,"siteUrl":"/artist/390503-Jason-Newsted","name":"Jason Newsted"},"nameVariation":null,"creditRole":"Bass","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"James Hetfield","artist":{"__typename":"Artist","discogsId":251874,"siteUrl":"/artist/251874-James-Hetfield","name":"James Hetfield"},"nameVariation":"Hetfield","creditRole":"Design Concept [Cover Concept]","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Lars Ulrich","artist":{"__typename":"Artist","discogsId":251550,"siteUrl":"/artist/251550-Lars-Ulrich","name":"Lars Ulrich"},"nameVariation":"Ulrich","creditRole":"Design Concept [Cover Concept]","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Reiner Design Consultants, Inc.","artist":{"__typename":"Artist","discogsId":1829130,"siteUrl":"/artist/1829130-Reiner-Design-Consultants-Inc","name":"Reiner Design Consultants, Inc."},"nameVariation":null,"creditRole":"Design, Layout","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Lars Ulrich","artist":{"__typename":"Artist","discogsId":251550,"siteUrl":"/artist/251550-Lars-Ulrich","name":"Lars Ulrich"},"nameVariation":null,"creditRole":"Drums","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Flemming Rasmussen","artist":{"__typename":"Artist","discogsId":202516,"siteUrl":"/artist/202516-Flemming-Rasmussen","name":"Flemming Rasmussen"},"nameVariation":null,"creditRole":"Engineer","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Toby Wright","artist":{"__typename":"Artist","discogsId":248142,"siteUrl":"/artist/248142-Toby-Wright","name":"Toby Wright"},"nameVariation":"Toby \"Rage\" Wright","creditRole":"Engineer [Assistant And Additional Engineering]","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"George Cowan","artist":{"__typename":"Artist","discogsId":469857,"siteUrl":"/artist/469857-George-Cowan","name":"George Cowan"},"nameVariation":null,"creditRole":"Engineer [Assistant Mixing Engineer]","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"James Hetfield","artist":{"__typename":"Artist","discogsId":251874,"siteUrl":"/artist/251874-James-Hetfield","name":"James Hetfield"},"nameVariation":null,"creditRole":"Guitar [Harmony, Melody], Vocals, Acoustic Guitar, Rhythm Guitar","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Pushead","artist":{"__typename":"Artist","discogsId":270207,"siteUrl":"/artist/270207-Pushead","name":"Pushead"},"nameVariation":null,"creditRole":"Illustration [Hammer Illustration]","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Stephen Gorman","artist":{"__typename":"Artist","discogsId":2015686,"siteUrl":"/artist/2015686-Stephen-Gorman","name":"Stephen Gorman"},"nameVariation":null,"creditRole":"Illustration, Cover","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Kirk Hammett","artist":{"__typename":"Artist","discogsId":18836,"siteUrl":"/artist/18836-Kirk-Hammett","name":"Kirk Hammett"},"nameVariation":null,"creditRole":"Lead Guitar","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"James Hetfield","artist":{"__typename":"Artist","discogsId":251874,"siteUrl":"/artist/251874-James-Hetfield","name":"James Hetfield"},"nameVariation":"Hetfield","creditRole":"Lyrics By","applicableTracks":"A to F, H"},{"__typename":"ReleaseCredit","displayName":"Bob Ludwig","artist":{"__typename":"Artist","discogsId":271098,"siteUrl":"/artist/271098-Bob-Ludwig","name":"Bob Ludwig"},"nameVariation":null,"creditRole":"Mastered By","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Steve Thompson & Michael Barbiero","artist":{"__typename":"Artist","discogsId":142166,"siteUrl":"/artist/142166-Steve-Thompson-Michael-Barbiero","name":"Steve Thompson & Michael Barbiero"},"nameVariation":"Steve Thompson And Michael Barbiero","creditRole":"Mixed By","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Ross Halfin","artist":{"__typename":"Artist","discogsId":1456861,"siteUrl":"/artist/1456861-Ross-Halfin","name":"Ross Halfin"},"nameVariation":"Ross \"Tobacco Road\" Halfin","creditRole":"Photography By","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Metallica","artist":{"__typename":"Artist","discogsId":18839,"siteUrl":"/artist/18839-Metallica","name":"Metallica"},"nameVariation":null,"creditRole":"Producer","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Flemming Rasmussen","artist":{"__typename":"Artist","discogsId":202516,"siteUrl":"/artist/202516-Flemming-Rasmussen","name":"Flemming Rasmussen"},"nameVariation":null,"creditRole":"Producer [Produced With]","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"James Hetfield","artist":{"__typename":"Artist","discogsId":251874,"siteUrl":"/artist/251874-James-Hetfield","name":"James Hetfield"},"nameVariation":"Hetfield","creditRole":"Written-By","applicableTracks":null},{"__typename":"ReleaseCredit","displayName":"Kirk Hammett","artist":{"__typename":"Artist","discogsId":18836,"siteUrl":"/artist/18836-…</script>

As far as I can tell, section 1 provides basic data, section 2 seems to be providing raw data from their database...

I tried with section 1 with the following command:

Code:

curl -s https://www.discogs.com/release/2193760-Metallica-And-Justice-For-All | pup -p 'section#release-stats#li#h4 span text{}'

which returns

Code:

4.68
 / 5
CA$114.20
CA$156.69
CA$223.42

Question 1: Would you use section 1 or 2 and does it matters?
Question 2: The output of the above command provides all values inside of "span" tags. I tried to return only the value which is preceded by a "h4" tag containing the exact word "Median" but it didn't work.

I also tried using the section 2, with

Code:

curl -s https://www.discogs.com/release/2193760-Metallica-And-Justice-For-All | pup -p 'script#dsdata'

which returns the entire block of the script section with ID=dsdata but retrieving the median price from that blob is a challenge... Is it possible?

ondoho · 03-14-2022, 02:18 AM

It says that it's json! So, pipe it into jq again, after removing the script tags.

That said, discogs apparently has an API you can use: https://www.discogs.com/developers

lpallard · 03-14-2022, 07:49 PM

Its working!

Thanks @ondoho for pointing out that jq could be used based on the json nature of the scraped page...

Code:

curl -s "$url/$1" | pup -p 'script#dsdata text{}' | jq '.data."Release:{\"discogsId\":2193760}".statistics.median."converted({\"toCurrency\":\"CAD\"})".amount'

Now if I need to extract multiple values from a single page, I'd have to curl that page multiple times because the piped commands are "drilling down" in the json contents and therefore do not allow to "climb" back up to perform another query.... Is there a better way to perform multiple queries on a single page without "bombing" the remote server with multiple downloads or is there no issue doing so?

boughtonp · 03-15-2022, 08:34 AM

Quote:

Originally Posted by lpallard

Is there a better way to perform multiple queries on a single page without "bombing" the remote server with multiple downloads or is there no issue doing so?

Store the results in a file then perform the query on that file, roughly:

Code:

curl ... | pup ... > filename.json
jq ... filename.json
jq ... filename.json

The > writes the command's stdout to the filename specified. In some instances you would need to use "command ... < filename.json" to subsequently read from stdin, but in this instance jq accepts the filename as arguments so that's not necessary.

When working on the command line, it's very useful to understand piping and redirecting.