Regular Expressions

Wim Sturkenboom · 11-18-2009, 03:15 AM

I'm trying to find a regular expression that can validate and parse a string that does not have a fixed number of fields.

Code:

Bruce Willis,Richard Gere
Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana

I've managed to get the validation going while typing this message using the following RE

Code:

^([A-Za-z ]+)([,]([A-Za-z ]+))*$

Unfortunately this RE does not parse properly and I don't know how to get that right.
The current result is (for the second string)

Code:

Total match:  'Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana'
First group:  'Pink Floyd'
Second group: ',Santana'
Third group:  'Santana'

What I want to get out (for the second string) is

Code:

Total match:  'Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana'
First group:  'Pink Floyd'
Second group: 'Deep Purple'
Third group:  'Uriah Heep'
Fourth group: 'Ten Years After'
Fifth group:  'Santana'

1) Is there any way to achieve this using regular expressions?

And another question. I use Tcl and Java (just started) and found out that the result of an regexp can be significantly different.

Tcl returns 'Pink Floyd' for the match and Java returns 'Santana' when using ([A-Za-z ]+) as a regular expression on the second string.

TCL code

Code:

set match [regexp $regexp $text matchstr group1 group2 group3 group4 group5 group6 group7 group8 group9 group10]

Java code

Code:

        Pattern p;
        Matcher m;
        try {
             p = Pattern.compile(regexp);
        }
        catch (PatternSyntaxException ePatternSyntaxException) {
            String Error = "" + ePatternSyntaxException;
            JOptionPane.showMessageDialog(null, Error, "Regular expression", JOptionPane.ERROR_MESSAGE);
            return;
        }

        m = p.matcher(text);
        int start = 0;
        while (m.find(start) == true)
        {
            resultTextArea.setText("Group cnt : " + Integer.toString(m.groupCount()) + "\n");
            for (int i=0; i<=m.groupCount(); i++) {
                if (m.group(i) != null) {
                    resultTextArea.append("Group " + Integer.toString(i) + " : '" + m.group(i) + "' (" + m.start(i) + "," + m.end(i) + ")\n");
                }
            }
            start = m.end();
            resultTextArea.append("----\n");
        }

2) Is this a coding issue in my code or a difference in implementation in the language (I'm aware that there is something like Posix and Perl implementations).

ghostdog74 · 11-18-2009, 03:31 AM

Code:

Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana

if you have structured data like that and all you ever want to get is those words separated by ",", use fields/delimiter method, NOT regular expression. Depending on what language you are using, there will string splitting methods that can split a string into tokens using a delimiter. check your language documentation.

bigearsbilly · 11-18-2009, 04:11 AM

oh my god isn't java UGLY!
ghostdog is right, use split which splits into a tcl list.

Code:

#!/usr/bin/env tclsh

set n 0

while { [gets stdin line] >= 0 }  {
    set n 0
    set list [ split $line ,]
    puts "Total match:$line"
    foreach name $list {
        incr n
        puts stdout "\titem $n is: $name"
    }
    puts ""
}

Code:

$ ./1.tcl < 1
Total match:Bruce Willis,Richard Gere
        item 1 is: Bruce Willis
        item 2 is: Richard Gere

Total match:Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana
        item 1 is: Pink Floyd
        item 2 is: Deep Purple
        item 3 is: Uriah Heep
        item 4 is: Ten Years After
        item 5 is: Santana

I like tcl

p.s. showing your age with that music

syg00 · 11-18-2009, 04:42 AM

Yeah - the kids of today ...

Wim Sturkenboom · 11-18-2009, 09:04 AM

Thanks for the replies. The regular expression implementations in Tcl and Java and possibly in other languages make it possible to parse the data into individual 'blobs'. So why not use it if it's possible? After all, I'm a lazy guy

@bigearsbilly
We (or at least I) know you like Tcl and you're not the only one

Till now I have managed to write all my applications that needed a GUI in Tcl/Tk. I'm now unfortunately forced to look at Java.

bigearsbilly · 11-18-2009, 09:25 AM

that's progress (haha)

at least you have a job I guess.
I ain't had anything since february
:-(

Wim Sturkenboom · 11-18-2009, 09:55 AM

Sorry to hear about the job. I have one but can't be paid; living on my savings for the last 4 months.

ntubski · 11-18-2009, 04:38 PM

The java api can't give a variable number of blobs, see Groups and capturing.

Quote:

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails.

Quote:

ghostdog is right, use split which splits into a tcl list.

Java has split also.

chrism01 · 11-18-2009, 07:54 PM

1. regex variations; it is indeed true that different langs/tools often have slightly different regex engines. The best book that explains regexes and differences is here http://regex.info//
2. interesting music

3. BB you have my sympathy, had 8 mths out during the GFC

Wim Sturkenboom · 11-19-2009, 12:30 AM

(Part of) the problem is my java implementation; resultTextArea.setText clears the textarea which hides data from previous iterations in the while loop

I finally figured that out as an 'incorrect' but valid regular expression caused the program to become unresponsive (meaning it ended in an endless loop); so I added a loop counter and with that I only saw the last result.

The revised code

Code:

        // clear textarea
        resultTextArea.setText(null);

        m = p.matcher(text);
        int start = 0;
        int loopcnt=1;
        while (m.find(start) == true)
        {
            resultTextArea.append("Loop : " + Integer.toString(loopcnt) + "\n");
            resultTextArea.append("Group cnt : " + Integer.toString(m.groupCount()) + "\n");
            for (int i=0; i<=m.groupCount(); i++) {
                if (m.group(i) != null) {
                    resultTextArea.append("Group " + Integer.toString(i) + " : '" + m.group(i) + "' (" + m.start(i) + "," + m.end(i) + ")\n");
                }
            }
            start = m.end();
            resultTextArea.append("----\n");
            loopcnt++;
            // stop when we have a megabyte of data in the textarea
            if (loopcnt>1000) {
                resultTextArea.append("Aborting ... ");
                break;
            }
        }
        resultTextArea.append("DONE\n");

Using the ([A-Za-z ]+),* as the regular expression will now give the following result for the bands:

Code:

Loop : 1
Group cnt : 1
Group 0 : 'Pink Floyd,' (0,11)
Group 1 : 'Pink Floyd' (0,10)
----
Loop : 2
Group cnt : 1
Group 0 : 'Deep Purple,' (11,23)
Group 1 : 'Deep Purple' (11,22)
----
Loop : 3
Group cnt : 1
Group 0 : 'Uriah Heep,' (23,34)
Group 1 : 'Uriah Heep' (23,33)
----
Loop : 4
Group cnt : 1
Group 0 : 'Ten Years After,' (34,50)
Group 1 : 'Ten Years After' (34,49)
----
Loop : 5
Group cnt : 1
Group 0 : 'Santana' (50,57)
Group 1 : 'Santana' (50,57)
----
DONE

Knowing that group 0 is always the actual match and group 1 (and higher) are the groups, I think that this issue is solvable in Java for my purposes.

I like to thank everybody for their replies.

A possibly a useful link: Regular Expression Playground

And the lesson learned: what you see is not what you get.

Wim Sturkenboom · 11-19-2009, 01:21 AM

OOPS, spoke slightly to early. It works as a parser but no longer as a validator