[SOLVED] Gawk - regexp [A-Z] matches [a-z]. How is this possible?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Gawk - regexp [A-Z] matches [a-z]. How is this possible?
Hi.
I'm having a problem with a regexp i Gawk giving an unpredicted result.
I'm using the following simple code:
--------------------------------
BEGIN {
s="version"
r="[A-Z]"
if ( match(s, r) ) {
printf "%s %s %s\n", s, r, substr(s, RSTART, RLENGTH)
} else {
printf "NO MATCH\n"
}
}
--------------------------------
When I run this I get a match for the first letter "v" in "version"!!!! How is this possible???
I'm running it under the following circumstances:
Operating sys: Linux CENTOS 5.2
Shell: GNU bash ver. 3.2.25
Env.setting: LANG=en_US.UTF-8
Filesetting: Code written with editor Kwrite with encoding both "utf-8" AND "Central European cp 1250" with the same result
When I run the same code under windows gawk 3.1.5 I get the anticipated result of "NO MATCH".
I suspect it has to do with encoding of the file but I cannot figure out how. I did "man gawk" but failed to locate an answer.
I'm grateful for any lead on this "mystery".
I'm a newbie to this forum - hope I'm at the right place.
Thanks
/Bertil
2.4 Using Character Lists
=========================
Within a character list, a "range expression" consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, using the locale's collating sequence
and character set. For example, in the default C locale, `[a-dx-z]' is
equivalent to `[abcdxyz]'. Many locales sort characters in dictionary
order, and in these locales, `[a-dx-z]' is typically not equivalent to
`[abcdxyz]'; instead it might be equivalent to `[aBbCcDdxXyYz]', for
example. To obtain the traditional interpretation of bracket
expressions, you can use the C locale by setting the `LC_ALL'
environment variable to the value `C'.
Code:
$ echo $LANG
en_US.UTF-8
$ gawk -f test.awk
version [A-Z] v
$ LC_ALL=C gawk -f test.awk
NO MATCH
$ mawk -f test.awk
NO MATCH
Thanks to you both Grail and Firstfire for your interest!
I tried it and it worked exactly as firstfire pointed out by using
LC_ALL=C gawk -f test.awk
I think I now have learned something about the regexp's and the info system, especially "info gawk"! My only excuse for this ignorance is that things have worked so smoothly until now so I never really had a reason (this is almost true :-) .....
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.