[SOLVED] Gawk - regexp [A-Z] matches [a-z]. How is this possible?

b.lundblad@fabula.se · 08-31-2012, 04:47 AM

Hi.

I'm having a problem with a regexp i Gawk giving an unpredicted result.

I'm using the following simple code:
--------------------------------
BEGIN {
s="version"
r="[A-Z]"

if ( match(s, r) ) {
printf "%s %s %s\n", s, r, substr(s, RSTART, RLENGTH)
} else {
printf "NO MATCH\n"
}
}
--------------------------------
When I run this I get a match for the first letter "v" in "version"!!!! How is this possible???

I'm running it under the following circumstances:
Operating sys: Linux CENTOS 5.2
Shell: GNU bash ver. 3.2.25
Env.setting: LANG=en_US.UTF-8
Filesetting: Code written with editor Kwrite with encoding both "utf-8" AND "Central European cp 1250" with the same result

When I run the same code under windows gawk 3.1.5 I get the anticipated result of "NO MATCH".

I suspect it has to do with encoding of the file but I cannot figure out how. I did "man gawk" but failed to locate an answer.
I'm grateful for any lead on this "mystery".

I'm a newbie to this forum - hope I'm at the right place.
Thanks
/Bertil

grail · 08-31-2012, 05:36 AM

What version of gawk are you using in your example? A direct copy n paste of your code gave me the 'NO MATCH' desired result.

b.lundblad@fabula.se · 08-31-2012, 07:46 AM

Hi Grail!

Thanks for your time! My linux version of Gawk is the same as for my windows version, 3.1.5

What do you get if you do the following?

set | grep "LANG"

Is it utf-8?.

/Bertil

firstfire · 08-31-2012, 08:05 AM

Hi.

Quote from info gawk character lists:

Quote:

2.4 Using Character Lists
=========================

Within a character list, a "range expression" consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, using the locale's collating sequence
and character set. For example, in the default C locale, `[a-dx-z]' is
equivalent to `[abcdxyz]'. Many locales sort characters in dictionary
order, and in these locales, `[a-dx-z]' is typically not equivalent to
`[abcdxyz]'; instead it might be equivalent to `[aBbCcDdxXyYz]', for
example. To obtain the traditional interpretation of bracket
expressions, you can use the C locale by setting the `LC_ALL'
environment variable to the value `C'.

Code:

$ echo $LANG
en_US.UTF-8
$ gawk -f test.awk
version [A-Z] v
$ LC_ALL=C gawk -f test.awk
NO MATCH
$ mawk -f test.awk
NO MATCH

grail · 08-31-2012, 08:58 AM

Code:

$ set | grep "LANG"
LANG=en_US.UTF-8
$ gawk --version
GNU Awk 4.0.1

b.lundblad@fabula.se · 08-31-2012, 11:15 AM

Thanks to you both Grail and Firstfire for your interest!

I tried it and it worked exactly as firstfire pointed out by using

LC_ALL=C gawk -f test.awk

I think I now have learned something about the regexp's and the info system, especially "info gawk"! My only excuse for this ignorance is that things have worked so smoothly until now so I never really had a reason (this is almost true :-) .....

Also it might be a good thing to upgrade my gawk

Thanks again guys for helping me out!
/Bertil