LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-31-2022, 04:22 AM   #1
aristosv
Member
 
Registered: Dec 2014
Posts: 263

Rep: Reputation: 3
detect and extract phone numbers


I am using a software integration platform to connect to a Google calendar and read appointment events. Each event has a small description and phone number in the Summary and I need a regular expression to extract those phone numbers.

The problem is that each person entering an event, formats the phone number differently. Some phone numbers have spaces, or country prefix, and in some cases, there are other (irrelevant) numbers in the Summary also. I don’t mind keeping the country prefix if they enter it, but I do need to detect what’s the phone number.

So, here’s some information on the numbers

- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.
- Then 6 more numbers will follow.
- There could also be spaces between the numbers

Eventually a correct phone number will be

99123456 or 35799123456 or +35799123456
99 can also be 94 or 95 or 96 or 97

I don’t know if this is too complicated or if it’s even possible to create a regex to correctly extract the phone numbers. Any help is appreciated.

Thanks
 
Old 03-31-2022, 04:56 AM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,153

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
Anything is possible - but dragons be there.

See what you think after reading this
 
1 members found this post helpful.
Old 03-31-2022, 06:18 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
I am with syg00, "anything" is always possible and normally only limited to how complicated you make it.

So on top of the link provided, it is great that you have some examples, but what you haven't given is what type of regular expression engine are you using and a few examples of what you may have already tried?

You will find that it is your criteria that will mostly form your expression and that in of itself gives you an idea of how complicated you might make this
 
Old 03-31-2022, 06:35 AM   #4
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,883
Blog Entries: 13

Rep: Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931
Make it an actual challenge. Write it using absolutely no include files whatsoever.

Start right to left parsing the supposed number.

It's a state machine, simply put.
 
1 members found this post helpful.
Old 03-31-2022, 08:58 AM   #5
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,753

Rep: Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983
Quote:
Originally Posted by aristosv View Post
I am using a software integration platform to connect to a Google calendar and read appointment events. Each event has a small description and phone number in the Summary and I need a regular expression to extract those phone numbers.

The problem is that each person entering an event, formats the phone number differently. Some phone numbers have spaces, or country prefix, and in some cases, there are other (irrelevant) numbers in the Summary also. I don’t mind keeping the country prefix if they enter it, but I do need to detect what’s the phone number.

So, here’s some information on the numbers

- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.
- Then 6 more numbers will follow.
- There could also be spaces between the numbers

Eventually a correct phone number will be

99123456 or 35799123456 or +35799123456
99 can also be 94 or 95 or 96 or 97

I don’t know if this is too complicated or if it’s even possible to create a regex to correctly extract the phone numbers. Any help is appreciated.
Very complicated...mainly because you're asking users to enter data, and they'll do whatever they want. rtmistler's approach is the best way to do it given what you have now, but if it were me, I'd address it on the front end. Modify the entry fields to split this stuff up, and make them all required fields; setting those fields to only accept numbers should be fairly easy as well, so (to use the USA number as an example), you'd have:
Code:
COUNTRYCODE (up to 3 digit numeric) AREACODE (3 digit numeric) PREFIX (3 digit numeric) SUFFIX (4 digit numeric).
When outputting the data to whatever else you want, just join things together to get what you need. Don't ALLOW the user to enter junk, and you won't have to deal with junk.
 
2 members found this post helpful.
Old 03-31-2022, 06:31 PM   #6
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,249

Rep: Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323
Hire people to do manual data cleaning and data entry.

No-one wants to hear this, but I guarantee you it will be less costly and more reliable than any technical solution you kludge up to try to avoid that.

How many entries are we talking about here?

Last edited by dugan; 03-31-2022 at 07:47 PM.
 
Old 03-31-2022, 10:35 PM   #7
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,369

Rep: Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753
1. Manual cleaning is possible if both of the following are true:

1.1 limited/reasonable num of recs to fix
1.2. this is a one-off exercise.

Otherwise

2. As per TB0ne, restrict the entry format (if possible !)

3. As per syg00's link to Perlmonks....
Basically get to know your data real well and start coming up with rules that catch the most common and gradually add more for less common formats.

Basic rules for this
3.1 reduce all nums to just a list of digits ie remove non-numeric chars
3.2 make the code very strict, so that it only passes numbers (strings) it recognises and flags all the rest to you for further disposition (and more code).

There is no general answer, but if this is (as it appears to be) a situation with a limited num (sic) of likely formats, it is amenable over time.

HTH
PS I would definitely use Perl for this - data munging is one of its strengths, especially as regexes are a built-in function as per syg00's link.
 
Old 04-01-2022, 08:07 AM   #8
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,691
Blog Entries: 4

Rep: Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947
Most programming libraries have a collection of "standard" regular-expression patterns. Many of these are based on Perl's Regexp::Common library, whose patterns by-the-way can be looked up in the Perl source-code (on line ...), cabbaged and used.

This point bears repeating: Whenever you find yourself trying to figure out how to do something or how to build something, the first thought in your mind – your first reflexive response – ought to be, "Hasn't somebody else out there already done this, and shared it?" GitHub, SourceForge, a polite inquiry on LQ, a search of your programming language's contributed library. You will probably find what you are looking for, so that all that's really left for you to do is a little "kitbashing."

Quote:
Actum Ne Agas: Do Not Do A Thing Already Done.

Last edited by sundialsvcs; 04-01-2022 at 08:13 AM.
 
Old 04-02-2022, 01:42 AM   #9
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Maybe you can do it in 2 steps - recognize & extract strings of digits (potential phone numbers) first, then massage them into a standard format.
Quote:
Originally Posted by aristosv View Post
- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.
I think 00357 or +357 is correct, but 357 is wrong.
But since a phone number can only start with a 9, that shouldn't be a problem.

Of course, in the future you should tell your customers to enter phone numbers into a dedicated calendar entry field with quality control.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
bash filter output - detect phone numbers aristosv Linux - Newbie 9 11-03-2020 08:23 AM
[SOLVED] Print numbers and associated text belonging to an interval of numbers Trd300 Linux - Newbie 27 03-11-2012 05:58 AM
sequence of numbers, how to extract which numbers are missing jonlake Programming 13 06-26-2006 03:28 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:26 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration