ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am using a software integration platform to connect to a Google calendar and read appointment events. Each event has a small description and phone number in the Summary and I need a regular expression to extract those phone numbers.
The problem is that each person entering an event, formats the phone number differently. Some phone numbers have spaces, or country prefix, and in some cases, there are other (irrelevant) numbers in the Summary also. I don’t mind keeping the country prefix if they enter it, but I do need to detect what’s the phone number.
So, here’s some information on the numbers
- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.
- Then 6 more numbers will follow.
- There could also be spaces between the numbers
Eventually a correct phone number will be
99123456 or 35799123456 or +35799123456
99 can also be 94 or 95 or 96 or 97
I don’t know if this is too complicated or if it’s even possible to create a regex to correctly extract the phone numbers. Any help is appreciated.
I am with syg00, "anything" is always possible and normally only limited to how complicated you make it.
So on top of the link provided, it is great that you have some examples, but what you haven't given is what type of regular expression engine are you using and a few examples of what you may have already tried?
You will find that it is your criteria that will mostly form your expression and that in of itself gives you an idea of how complicated you might make this
I am using a software integration platform to connect to a Google calendar and read appointment events. Each event has a small description and phone number in the Summary and I need a regular expression to extract those phone numbers.
The problem is that each person entering an event, formats the phone number differently. Some phone numbers have spaces, or country prefix, and in some cases, there are other (irrelevant) numbers in the Summary also. I don’t mind keeping the country prefix if they enter it, but I do need to detect what’s the phone number.
So, here’s some information on the numbers
- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.
- Then 6 more numbers will follow.
- There could also be spaces between the numbers
Eventually a correct phone number will be
99123456 or 35799123456 or +35799123456
99 can also be 94 or 95 or 96 or 97
I don’t know if this is too complicated or if it’s even possible to create a regex to correctly extract the phone numbers. Any help is appreciated.
Very complicated...mainly because you're asking users to enter data, and they'll do whatever they want. rtmistler's approach is the best way to do it given what you have now, but if it were me, I'd address it on the front end. Modify the entry fields to split this stuff up, and make them all required fields; setting those fields to only accept numbers should be fairly easy as well, so (to use the USA number as an example), you'd have:
When outputting the data to whatever else you want, just join things together to get what you need. Don't ALLOW the user to enter junk, and you won't have to deal with junk.
Hire people to do manual data cleaning and data entry.
No-one wants to hear this, but I guarantee you it will be less costly and more reliable than any technical solution you kludge up to try to avoid that.
1. Manual cleaning is possible if both of the following are true:
1.1 limited/reasonable num of recs to fix
1.2. this is a one-off exercise.
Otherwise
2. As per TB0ne, restrict the entry format (if possible !)
3. As per syg00's link to Perlmonks....
Basically get to know your data real well and start coming up with rules that catch the most common and gradually add more for less common formats.
Basic rules for this
3.1 reduce all nums to just a list of digits ie remove non-numeric chars
3.2 make the code very strict, so that it only passes numbers (strings) it recognises and flags all the rest to you for further disposition (and more code).
There is no general answer, but if this is (as it appears to be) a situation with a limited num (sic) of likely formats, it is amenable over time.
HTH
PS I would definitely use Perl for this - data munging is one of its strengths, especially as regexes are a built-in function as per syg00's link.
Most programming libraries have a collection of "standard" regular-expression patterns. Many of these are based on Perl's Regexp::Common library, whose patterns by-the-way can be looked up in the Perl source-code (on line ...), cabbaged and used.
This point bears repeating: Whenever you find yourself trying to figure out how to do something or how to build something, the first thought in your mind – your first reflexive response – ought to be, "Hasn't somebody else out there already done this, and shared it?" GitHub, SourceForge, a polite inquiry on LQ, a search of your programming language's contributed library. You will probably find what you are looking for, so that all that's really left for you to do is a little "kitbashing."
Quote:
Actum Ne Agas: Do Not Do A Thing Already Done.
Last edited by sundialsvcs; 04-01-2022 at 08:13 AM.
Maybe you can do it in 2 steps - recognize & extract strings of digits (potential phone numbers) first, then massage them into a standard format.
Quote:
Originally Posted by aristosv
- If the user entered a prefix there will be a 357 or 00357 or +357 in the summary.
- After the prefix (if it exists) each phone number will start with 94 or 95 or 96 or 97 or 99.
I think 00357 or +357 is correct, but 357 is wrong.
But since a phone number can only start with a 9, that shouldn't be a problem.
Of course, in the future you should tell your customers to enter phone numbers into a dedicated calendar entry field with quality control.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.