iconv usage query

davelms · 02-25-2009, 12:06 PM

I have a UTF-8 file that I wish to process as ISO-8859-1.

Discounting for a moment the fact 'characters' may be single or multi-byte each record in this sample file has a fixed pre-determined format and fixed definition based on character offsets.

e.g.

Field 1 Alphanumeric position 1, length 30
Field 2 Alphanumeric position 31, length 20
etc
etc

So when reading in the data, the first 30 characters relate to Field 1, and Field 2 starts at the 31st character.

Assume for the sake of argument that all are single byte in this example.

When I use iconv to convert this UTF-8 file to ISO-8859-1 it will drop whatever it is unable to convert. The net result of this is that Field 1 may now contain 29 characters and Field 2 may start at the 30th character not the 31st.

e.g. EXAMPLE may become EXMPLE (if for arguments sake the 'A' were a character iconv was unable to convert)

Can I switch this behaviour off so that EXAMPLE becomes EX MPLE? And invalid or non-convertible characters are set to a control char, say 'space'? Or maybe there is a combination of commands that can be used to similar effect and can preserve a file's fixed position-based structure while still converting its character encoding.

Thanks.

David the H. · 02-25-2009, 01:13 PM

"man iconv" has a section that says this:

Quote:

ENCODINGS
The values permitted for --from-code and --to-code can be listed
by the iconv --list command, and all combinations of the listed
values are supported. Furthermore the following two suffixes are
supported:

//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code,
transliteration is activated. This means that when a
character cannot be represented in the target character
set, it can be approximated through one or several
similarly looking characters.

//IGNORE
When the string "//IGNORE" is appended to --to-code,
characters that cannot be represented in the target
character set will be silently discarded.

It seems to me that if you use the //TRANSLIT suffix, your missing characters will be substituted with something similar, instead of dropped.

Tinkster · 02-25-2009, 01:19 PM

Hi, welcome to LQ!

Quote:

Can I switch this behaviour off so that EXAMPLE becomes EX MPLE? And invalid or non-convertible characters are set to a control char, say 'space'? Or maybe there is a combination of commands that can be used to similar effect and can preserve a file's fixed position-based structure while still converting its character encoding.

I'm afraid it doesn't. I'd recommend looking into perl for the
resolution of that issue, there's a "piconv" command (iconv re-
invented in perl) that you may be able to modify to do your bidding...

Cheers,
Tink

Tinkster · 02-25-2009, 01:24 PM

Quote:

Originally Posted by David the H.

"man iconv" has a section that says this:

It seems to me that if you use the //TRANSLIT suffix, your missing characters will be substituted with something similar, instead of dropped.

Interesting :)

'man iconv' in slackware 12.1 and RH as5 doesn't have those
sections ... what distro are you looking it up in?

Cheers,
Tink

David the H. · 02-25-2009, 01:35 PM

Quote:

Originally Posted by Tinkster

Interesting

'man iconv' in slackware 12.1 and RH as5 doesn't have those
sections ... what distro are you looking it up in?

Cheers,
Tink

This is the straight stock GNU iconv provided by Debian.

Code:

~/$ iconv --version

iconv (GNU libc) 2.7
Copyright (C) 2007 Free Software Foundation, Inc.

But the man page says this:

Quote:

AUTHOR
iconv was written by Ulrich Drepper as part of the GNU C Library.

This man page was written by Joel Klecker <espy@debian.org>, for the Debian GNU/Linux system.

3rd Berkeley Distribution lenny ICONV(1)

Debian does sometime provide man pages when the programs don't provide satisfactory ones of their own. Not sure what's going on here though.

Tinkster · 02-25-2009, 01:45 PM

Slack:

Code:

$ iconv --version                                            
iconv (GNU libc) 2.7
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.

RH:

Code:

~> iconv --version
iconv (GNU libc) 2.5
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.

Heh ... never mind, if debians man-page describes the proper behaviour
it's all for the better ;}

Cheers,
Tink

davelms · 02-25-2009, 03:40 PM

Thanks for the tips, I'll give each a go and see what I come up with (and report back)

Thanks again.

davelms · 02-26-2009, 01:37 PM

Thanks for your guidance. Found on Solaris that with its default behaviour (i.e. no switches) iconv will replace the non-convertible characters with a ? or 0x3F. This seems to preserve the formatting from a fixed file definition perspective, and is therefore manageable for what I need to do. I can cope with a question mark. This is actually OK for me and much better - in this particular case - than the silent dropping of the character from the output.