Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a UTF-8 file that I wish to process as ISO-8859-1.
Discounting for a moment the fact 'characters' may be single or multi-byte each record in this sample file has a fixed pre-determined format and fixed definition based on character offsets.
e.g.
Field 1 Alphanumeric position 1, length 30
Field 2 Alphanumeric position 31, length 20
etc
etc
So when reading in the data, the first 30 characters relate to Field 1, and Field 2 starts at the 31st character.
Assume for the sake of argument that all are single byte in this example.
When I use iconv to convert this UTF-8 file to ISO-8859-1 it will drop whatever it is unable to convert. The net result of this is that Field 1 may now contain 29 characters and Field 2 may start at the 30th character not the 31st.
e.g. EXAMPLE may become EXMPLE (if for arguments sake the 'A' were a character iconv was unable to convert)
Can I switch this behaviour off so that EXAMPLE becomes EX MPLE? And invalid or non-convertible characters are set to a control char, say 'space'? Or maybe there is a combination of commands that can be used to similar effect and can preserve a file's fixed position-based structure while still converting its character encoding.
ENCODINGS
The values permitted for --from-code and --to-code can be listed
by the iconv --list command, and all combinations of the listed
values are supported. Furthermore the following two suffixes are
supported:
//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code,
transliteration is activated. This means that when a
character cannot be represented in the target character
set, it can be approximated through one or several
similarly looking characters.
//IGNORE
When the string "//IGNORE" is appended to --to-code,
characters that cannot be represented in the target
character set will be silently discarded.
It seems to me that if you use the //TRANSLIT suffix, your missing characters will be substituted with something similar, instead of dropped.
Can I switch this behaviour off so that EXAMPLE becomes EX MPLE? And invalid or non-convertible characters are set to a control char, say 'space'? Or maybe there is a combination of commands that can be used to similar effect and can preserve a file's fixed position-based structure while still converting its character encoding.
I'm afraid it doesn't. I'd recommend looking into perl for the
resolution of that issue, there's a "piconv" command (iconv re-
invented in perl) that you may be able to modify to do your bidding...
$ iconv --version
iconv (GNU libc) 2.7
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
RH:
Code:
~> iconv --version
iconv (GNU libc) 2.5
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
Heh ... never mind, if debians man-page describes the proper behaviour
it's all for the better ;}
Thanks for your guidance. Found on Solaris that with its default behaviour (i.e. no switches) iconv will replace the non-convertible characters with a ? or 0x3F. This seems to preserve the formatting from a fixed file definition perspective, and is therefore manageable for what I need to do. I can cope with a question mark. This is actually OK for me and much better - in this particular case - than the silent dropping of the character from the output.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.