ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm trying to find a regular expression that can validate and parse a string that does not have a fixed number of fields.
Code:
Bruce Willis,Richard Gere
Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana
I've managed to get the validation going while typing this message using the following RE
Code:
^([A-Za-z ]+)([,]([A-Za-z ]+))*$
Unfortunately this RE does not parse properly and I don't know how to get that right.
The current result is (for the second string)
Code:
Total match: 'Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana'
First group: 'Pink Floyd'
Second group: ',Santana'
Third group: 'Santana'
What I want to get out (for the second string) is
Code:
Total match: 'Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana'
First group: 'Pink Floyd'
Second group: 'Deep Purple'
Third group: 'Uriah Heep'
Fourth group: 'Ten Years After'
Fifth group: 'Santana'
1) Is there any way to achieve this using regular expressions?
And another question. I use Tcl and Java (just started) and found out that the result of an regexp can be significantly different.
Tcl returns 'Pink Floyd' for the match and Java returns 'Santana' when using ([A-Za-z ]+) as a regular expression on the second string.
TCL code
Code:
set match [regexp $regexp $text matchstr group1 group2 group3 group4 group5 group6 group7 group8 group9 group10]
2) Is this a coding issue in my code or a difference in implementation in the language (I'm aware that there is something like Posix and Perl implementations).
Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana
if you have structured data like that and all you ever want to get is those words separated by ",", use fields/delimiter method, NOT regular expression. Depending on what language you are using, there will string splitting methods that can split a string into tokens using a delimiter. check your language documentation.
oh my god isn't java UGLY!
ghostdog is right, use split which splits into a tcl list.
Code:
#!/usr/bin/env tclsh
set n 0
while { [gets stdin line] >= 0 } {
set n 0
set list [ split $line ,]
puts "Total match:$line"
foreach name $list {
incr n
puts stdout "\titem $n is: $name"
}
puts ""
}
Code:
$ ./1.tcl < 1
Total match:Bruce Willis,Richard Gere
item 1 is: Bruce Willis
item 2 is: Richard Gere
Total match:Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana
item 1 is: Pink Floyd
item 2 is: Deep Purple
item 3 is: Uriah Heep
item 4 is: Ten Years After
item 5 is: Santana
I like tcl
p.s. showing your age with that music
Last edited by bigearsbilly; 11-18-2009 at 04:12 AM.
Thanks for the replies. The regular expression implementations in Tcl and Java and possibly in other languages make it possible to parse the data into individual 'blobs'. So why not use it if it's possible? After all, I'm a lazy guy
@bigearsbilly
We (or at least I) know you like Tcl and you're not the only one Till now I have managed to write all my applications that needed a GUI in Tcl/Tk. I'm now unfortunately forced to look at Java.
Last edited by Wim Sturkenboom; 11-18-2009 at 09:05 AM.
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails.
Quote:
ghostdog is right, use split which splits into a tcl list.
1. regex variations; it is indeed true that different langs/tools often have slightly different regex engines. The best book that explains regexes and differences is here http://regex.info//
2. interesting music
3. BB you have my sympathy, had 8 mths out during the GFC
(Part of) the problem is my java implementation; resultTextArea.setText clears the textarea which hides data from previous iterations in the while loop I finally figured that out as an 'incorrect' but valid regular expression caused the program to become unresponsive (meaning it ended in an endless loop); so I added a loop counter and with that I only saw the last result.
The revised code
Code:
// clear textarea
resultTextArea.setText(null);
m = p.matcher(text);
int start = 0;
int loopcnt=1;
while (m.find(start) == true)
{
resultTextArea.append("Loop : " + Integer.toString(loopcnt) + "\n");
resultTextArea.append("Group cnt : " + Integer.toString(m.groupCount()) + "\n");
for (int i=0; i<=m.groupCount(); i++) {
if (m.group(i) != null) {
resultTextArea.append("Group " + Integer.toString(i) + " : '" + m.group(i) + "' (" + m.start(i) + "," + m.end(i) + ")\n");
}
}
start = m.end();
resultTextArea.append("----\n");
loopcnt++;
// stop when we have a megabyte of data in the textarea
if (loopcnt>1000) {
resultTextArea.append("Aborting ... ");
break;
}
}
resultTextArea.append("DONE\n");
Using the ([A-Za-z ]+),* as the regular expression will now give the following result for the bands:
Code:
Loop : 1
Group cnt : 1
Group 0 : 'Pink Floyd,' (0,11)
Group 1 : 'Pink Floyd' (0,10)
----
Loop : 2
Group cnt : 1
Group 0 : 'Deep Purple,' (11,23)
Group 1 : 'Deep Purple' (11,22)
----
Loop : 3
Group cnt : 1
Group 0 : 'Uriah Heep,' (23,34)
Group 1 : 'Uriah Heep' (23,33)
----
Loop : 4
Group cnt : 1
Group 0 : 'Ten Years After,' (34,50)
Group 1 : 'Ten Years After' (34,49)
----
Loop : 5
Group cnt : 1
Group 0 : 'Santana' (50,57)
Group 1 : 'Santana' (50,57)
----
DONE
Knowing that group 0 is always the actual match and group 1 (and higher) are the groups, I think that this issue is solvable in Java for my purposes.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.