detect and extract phone numbers

aristosv · 03-31-2022, 04:22 AM

I am using a software integration platform to connect to a Google calendar and read appointment events. Each event has a small description and phone number in the Summary and I need a regular expression to extract those phone numbers.

The problem is that each person entering an event, formats the phone number differently. Some phone numbers have spaces, or country prefix, and in some cases, there are other (irrelevant) numbers in the Summary also. I don’t mind keeping the country prefix if they enter it, but I do need to detect what’s the phone number.

So, here’s some information on the numbers

- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.
- Then 6 more numbers will follow.
- There could also be spaces between the numbers

Eventually a correct phone number will be

99123456 or 35799123456 or +35799123456
99 can also be 94 or 95 or 96 or 97

I don’t know if this is too complicated or if it’s even possible to create a regex to correctly extract the phone numbers. Any help is appreciated.

Thanks

syg00 · 03-31-2022, 04:56 AM

Anything is possible - but dragons be there.

See what you think after reading this

grail · 03-31-2022, 06:18 AM

I am with syg00, "anything" is always possible and normally only limited to how complicated you make it.

So on top of the link provided, it is great that you have some examples, but what you haven't given is what type of regular expression engine are you using and a few examples of what you may have already tried?

You will find that it is your criteria that will mostly form your expression and that in of itself gives you an idea of how complicated you might make this

rtmistler · 03-31-2022, 06:35 AM

Make it an actual challenge. Write it using absolutely no include files whatsoever.

Start right to left parsing the supposed number.

It's a state machine, simply put.

TB0ne · 03-31-2022, 08:58 AM

Quote:

Originally Posted by aristosv

I am using a software integration platform to connect to a Google calendar and read appointment events. Each event has a small description and phone number in the Summary and I need a regular expression to extract those phone numbers.

The problem is that each person entering an event, formats the phone number differently. Some phone numbers have spaces, or country prefix, and in some cases, there are other (irrelevant) numbers in the Summary also. I don’t mind keeping the country prefix if they enter it, but I do need to detect what’s the phone number.

So, here’s some information on the numbers

- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.
- Then 6 more numbers will follow.
- There could also be spaces between the numbers

Eventually a correct phone number will be

99123456 or 35799123456 or +35799123456
99 can also be 94 or 95 or 96 or 97

I don’t know if this is too complicated or if it’s even possible to create a regex to correctly extract the phone numbers. Any help is appreciated.

Very complicated...mainly because you're asking users to enter data, and they'll do whatever they want. rtmistler's approach is the best way to do it given what you have now, but if it were me, I'd address it on the front end. Modify the entry fields to split this stuff up, and make them all required fields; setting those fields to only accept numbers should be fairly easy as well, so (to use the USA number as an example), you'd have:

Code:

COUNTRYCODE (up to 3 digit numeric) AREACODE (3 digit numeric) PREFIX (3 digit numeric) SUFFIX (4 digit numeric).

When outputting the data to whatever else you want, just join things together to get what you need. Don't ALLOW the user to enter junk, and you won't have to deal with junk.

dugan · 03-31-2022, 06:31 PM

Hire people to do manual data cleaning and data entry.

No-one wants to hear this, but I guarantee you it will be less costly and more reliable than any technical solution you kludge up to try to avoid that.

How many entries are we talking about here?

chrism01 · 03-31-2022, 10:35 PM

1. Manual cleaning is possible if both of the following are true:

1.1 limited/reasonable num of recs to fix
1.2. this is a one-off exercise.

Otherwise

2. As per TB0ne, restrict the entry format (if possible !)

3. As per syg00's link to Perlmonks....
Basically get to know your data real well and start coming up with rules that catch the most common and gradually add more for less common formats.

Basic rules for this
3.1 reduce all nums to just a list of digits ie remove non-numeric chars
3.2 make the code very strict, so that it only passes numbers (strings) it recognises and flags all the rest to you for further disposition (and more code).

There is no general answer, but if this is (as it appears to be) a situation with a limited num (sic) of likely formats, it is amenable over time.

HTH
PS I would definitely use Perl for this - data munging is one of its strengths, especially as regexes are a built-in function as per syg00's link.

sundialsvcs · 04-01-2022, 08:07 AM

Most programming libraries have a collection of "standard" regular-expression patterns. Many of these are based on Perl's Regexp::Common library, whose patterns by-the-way can be looked up in the Perl source-code (on line ...), cabbaged and used.

This point bears repeating: Whenever you find yourself trying to figure out how to do something or how to build something, the first thought in your mind – your first reflexive response – ought to be, "Hasn't somebody else out there already done this, and shared it?" GitHub, SourceForge, a polite inquiry on LQ, a search of your programming language's contributed library. You will probably find what you are looking for, so that all that's really left for you to do is a little "kitbashing."

Quote:

Actum Ne Agas: Do Not Do A Thing Already Done.

ondoho · 04-02-2022, 01:42 AM

Maybe you can do it in 2 steps - recognize & extract strings of digits (potential phone numbers) first, then massage them into a standard format.

Quote:

Originally Posted by aristosv

- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.

I think 00357 or +357 is correct, but 357 is wrong.
But since a phone number can only start with a 9, that shouldn't be a problem.

Of course, in the future you should tell your customers to enter phone numbers into a dedicated calendar entry field with quality control.