LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-25-2009, 12:06 PM   #1
davelms
LQ Newbie
 
Registered: Feb 2009
Posts: 3

Rep: Reputation: 0
iconv usage query


I have a UTF-8 file that I wish to process as ISO-8859-1.

Discounting for a moment the fact 'characters' may be single or multi-byte each record in this sample file has a fixed pre-determined format and fixed definition based on character offsets.

e.g.

Field 1 Alphanumeric position 1, length 30
Field 2 Alphanumeric position 31, length 20
etc
etc

So when reading in the data, the first 30 characters relate to Field 1, and Field 2 starts at the 31st character.

Assume for the sake of argument that all are single byte in this example.

When I use iconv to convert this UTF-8 file to ISO-8859-1 it will drop whatever it is unable to convert. The net result of this is that Field 1 may now contain 29 characters and Field 2 may start at the 30th character not the 31st.

e.g. EXAMPLE may become EXMPLE (if for arguments sake the 'A' were a character iconv was unable to convert)

Can I switch this behaviour off so that EXAMPLE becomes EX MPLE? And invalid or non-convertible characters are set to a control char, say 'space'? Or maybe there is a combination of commands that can be used to similar effect and can preserve a file's fixed position-based structure while still converting its character encoding.

Thanks.
 
Old 02-25-2009, 01:13 PM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
"man iconv" has a section that says this:

Quote:
ENCODINGS
The values permitted for --from-code and --to-code can be listed
by the iconv --list command, and all combinations of the listed
values are supported. Furthermore the following two suffixes are
supported:

//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code,
transliteration is activated. This means that when a
character cannot be represented in the target character
set, it can be approximated through one or several
similarly looking characters.

//IGNORE
When the string "//IGNORE" is appended to --to-code,
characters that cannot be represented in the target
character set will be silently discarded.
It seems to me that if you use the //TRANSLIT suffix, your missing characters will be substituted with something similar, instead of dropped.
 
Old 02-25-2009, 01:19 PM   #3
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Hi, welcome to LQ!
Quote:
Can I switch this behaviour off so that EXAMPLE becomes EX MPLE? And invalid or non-convertible characters are set to a control char, say 'space'? Or maybe there is a combination of commands that can be used to similar effect and can preserve a file's fixed position-based structure while still converting its character encoding.
I'm afraid it doesn't. I'd recommend looking into perl for the
resolution of that issue, there's a "piconv" command (iconv re-
invented in perl) that you may be able to modify to do your bidding...



Cheers,
Tink

Last edited by Tinkster; 02-25-2009 at 01:24 PM.
 
Old 02-25-2009, 01:24 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by David the H. View Post
"man iconv" has a section that says this:



It seems to me that if you use the //TRANSLIT suffix, your missing characters will be substituted with something similar, instead of dropped.
Interesting :)

'man iconv' in slackware 12.1 and RH as5 doesn't have those
sections ... what distro are you looking it up in?


Cheers,
Tink
 
Old 02-25-2009, 01:35 PM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Quote:
Originally Posted by Tinkster View Post
Interesting

'man iconv' in slackware 12.1 and RH as5 doesn't have those
sections ... what distro are you looking it up in?


Cheers,
Tink
This is the straight stock GNU iconv provided by Debian.

Code:
~/$ iconv --version

iconv (GNU libc) 2.7
Copyright (C) 2007 Free Software Foundation, Inc.
But the man page says this:

Quote:

AUTHOR
iconv was written by Ulrich Drepper as part of the GNU C Library.

This man page was written by Joel Klecker <espy@debian.org>, for the Debian GNU/Linux system.

3rd Berkeley Distribution lenny ICONV(1)
Debian does sometime provide man pages when the programs don't provide satisfactory ones of their own. Not sure what's going on here though.
 
Old 02-25-2009, 01:45 PM   #6
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Slack:
Code:
$ iconv --version                                            
iconv (GNU libc) 2.7
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
RH:
Code:
~> iconv --version
iconv (GNU libc) 2.5
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.

Heh ... never mind, if debians man-page describes the proper behaviour
it's all for the better ;}



Cheers,
Tink
 
Old 02-25-2009, 03:40 PM   #7
davelms
LQ Newbie
 
Registered: Feb 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Thanks for the tips, I'll give each a go and see what I come up with (and report back)

Thanks again.

Last edited by davelms; 02-25-2009 at 03:51 PM.
 
Old 02-26-2009, 01:37 PM   #8
davelms
LQ Newbie
 
Registered: Feb 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Thanks for your guidance. Found on Solaris that with its default behaviour (i.e. no switches) iconv will replace the non-convertible characters with a ? or 0x3F. This seems to preserve the formatting from a fixed file definition perspective, and is therefore manageable for what I need to do. I can cope with a question mark. This is actually OK for me and much better - in this particular case - than the silent dropping of the character from the output.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Installing iconv? Zeno McDohl Linux - Software 1 01-24-2009 05:29 AM
Iconv troubles ppr:kut Linux - Software 1 10-19-2007 05:24 AM
QUERY: concurrent usage of multi-WLAN cards kevingpo Linux - Software 0 12-21-2004 06:49 PM
how to determine cpu usage, memory usage, I/O usage by a particular user logged on li rags2k Programming 4 08-21-2004 04:45 AM
iconv command saravanan1979 Linux - Software 1 07-06-2002 11:55 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 05:56 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration