Embedded regex matching in Perl

GATTACA · 01-16-2007, 04:48 PM

Hello.

I need some perl REGEX help.

Lets say I have 6 strings as follows:
1. AAA
2. AAABBB
3. AAABBBCC
4. AAABBBCCDDD
5. AAABBBCCDDDE
6. AAABBBFFF

What I want to do is write a perl script that, given a set of strings (like the ones given above for example) it selects the _longest_ string that still contains all of the other preceding strings.

So in this example, the regex should return string #5 since it contains strings 1-4, and then string #6 as a standalone string (strings #1 and #2 were encapsulated by #5 and are not counted again even though they technically appear in #6).

I could do this iteratively with a while-loop (if 2 contains 1 then keep 2; if 3 contains 2 keep 3; etc..) but this is slow for lots of patterns.

I think there might be a regex expression that will do the same thing much faster.
Perhaps there is a perl module that does this.

Can any one help?!
If any of this is unclear please let me know.

Thanks.

matthewg42 · 01-16-2007, 08:46 PM

Not sure what you mean. Can you provide an example input which, when matched against the RE should extract #5 in your list?

makyo · 01-16-2007, 10:25 PM

Hi.

Perl function index comes to mind. I don't see that you need to work with regular expressions because all the strings you mentioned are constants, and you appear to be looking for sub-strings in longer strings.

If I understand you, then processing the list above should result in strings 5 and 6 being "unique" in that they are not contained in other strings ... cheers, makyo

( edit 1: clarify )

GATTACA · 01-17-2007, 07:50 AM

Thanks for the quick replies!

Quote:

Originally Posted by matthewg42

Not sure what you mean. Can you provide an example input which, when matched against the RE should extract #5 in your list?

Actually the example I gave is what the input file looks like:

Code:

1. AAA
2. AAABBB
3. AAABBBCC
4. AAABBBCCDDD
5. AAABBBCCDDDE
6. AAABBBFFF

The lengths of the strings will vary from 10 to 150 alphabetic characters.

makyo: You are correct, the script should return strings 5 and 6 as unique. I'll give the index function a try. Thanks.

matthewg42 · 01-17-2007, 08:27 AM

I must have a wire crossed - I don't understand your request at all.

GATTACA · 01-17-2007, 09:16 AM

You don't have any wires crossed. I'm just terrible at explaining myself.

Sorry about that.

I think the index() idea might work so I'm giving it a test right now.