LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-19-2008, 03:26 PM   #1
babag
Member
 
Registered: Aug 2003
Posts: 419

Rep: Reputation: 31
c++ - parsing windows textfile: how to strip extra characters?


i've run into a problem reading a windows-generated textfile
onto my linux (mandriva 2007.1) system using c++. it took me
a long time and lots of help from the good folks here, but
i've finally figured out that the issue i've run into seems
to be one of extra, hidden characters in the original text file.

i started out by processing one variable read from the textfile
and had a lot of problems. i finally got around them by using
substr to parse only the first three characters of the line
read into my variable. that made things work.

my thinking is, however, that the likelihood is that every line
in the text file probably has this same issue. that would argue
in favor of addressing the issue, not at the individual variable
level, but at the file level. in other words, when the text file
is first parsed into my script. either that or by somehow
processing the textfile before it is read.

so, there i have two ideas to pursue: preprocessing the text
file, or processing it as it is read.

i'm using vector to read the text file. how would i strip extra
characters at that stage?

alternately, how would i strip the extra characters before the
text file comes into the script?

the program is below.

thanks,
BabaG
Code:
#include <fstream>
#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
#include <assert.h>

using namespace std;

int main()
{
   int count = 0

   ifstream infile("file_to_be_parsed.txt");

   if (!infile)
   {
      cerr << "Could not open file." << endl;

      return 1;
   }

   vector<string> ScriptVariables;
   string line;

   while (getline(infile, line))
   {
      ScriptVariables.push_back(line);
   }

   infile.close();

// lots of variables assigned from text file
// this is the one that's been a problem in another thread

   string capformat = ScriptVariables[8]; 

// perform operations

   int cr2W = 4368; 
   int cr2H = 2912; 

   int nefW = 3872; 
   int nefH = 2592; 

   double CtrX = 0; 
   double CtrY = 0; 

   string capformatTrimmed = capformat.substr(0,3);

   if (capformatTrimmed == "cr2")
      {
      double CtrX = cr2W/2.0; 
      double CtrY = cr2H/2.0;
      } 
   else if (capformatTrimmed == "nef")
      {
      double CtrX = nefW/2.0; 
      double CtrY = nefH/2.0;
      } 
   else
      {
         cout << "something is wrong with cr2/nef line." << endl;
      }

   cout << CtrX << endl; 
   cout << CtrY << endl;

   return 0;
}

Last edited by babag; 05-19-2008 at 03:29 PM.
 
Old 05-19-2008, 05:40 PM   #2
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,786

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
You're file probably has dos line endings. dos2unix <filename> should fix it. If that's not installed sed -i 's/\r\n/\n/' <filename> should work too.

From inside your program, you can remove '\r' characters from the string, using standard C++ string functions:
Code:
while (getline(infile, line))
   {
      string::size cr_idx = line.find('\r', 0);
      if (cr_idx != string::npos) {
         ScriptVariables.push_back(line.substring(0, cr_idx));
      } else {
         ScriptVariables.push_back(line);
      }
   }
 
Old 05-19-2008, 06:17 PM   #3
babag
Member
 
Registered: Aug 2003
Posts: 419

Original Poster
Rep: Reputation: 31
great! thanks, man. will try as soon as i get back in front of the box
that has this stuff on it.

this program is for processing a bunch of files which have been moved
over from a windows box to a linux box. in that move i'll be also moving
the ScriptVariables.txt file. should be simple enough to run dos2unix
as a part of the bash script that moves all the files.

thanks again,
BabaG
 
Old 05-19-2008, 06:21 PM   #4
daniel.santos
LQ Newbie
 
Registered: Jun 2006
Location: Dallas, TX, USA
Distribution: Gentoo
Posts: 9

Rep: Reputation: 0
CRLF shouldn't be the problem

since you are creating your ifstream object without "ifstream::binary", it should open the file in text mode, which will automatically translate CRLF sequences into the native format, which on Linux, would be CR, although you could certainly test this theory with a bit of debug output, trying printing the value of each character and comparing?
 
Old 05-19-2008, 07:57 PM   #5
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 78
Quote:
Originally Posted by daniel.santos View Post
since you are creating your ifstream object without "ifstream::binary", it should open the file in text mode, which will automatically translate CRLF sequences into the native format, which on Linux, would be CR, although you could certainly test this theory with a bit of debug output, trying printing the value of each character and comparing?
Close, but no cigar. Since the file is opened in text mode, ‘\n’ is automatically translated to and from the line terminating format native to the machine running the code. On linux, this is just LF, whereas on windows it is CRLF. So in linux, a getline (which reads until a ‘\n’ is encountered) will only match the LF (so if the file happens to have a CR it will just be second-to-last character on the line). In windows, on the other hand, a getline will match CRLF as ‘\n’ and the expected behavior will occur.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
search unknown special characters on a textfile bad_jaye Linux - Newbie 6 05-02-2008 09:55 PM
ls > temp has extra characters djeikyb Linux - Newbie 3 03-30-2008 02:20 PM
Help w/ sed parsing special characters clem_c_rock General 8 08-31-2007 04:06 PM
Strip special characters from filenames General Linux - Software 1 05-14-2006 03:49 AM
Extra characters in CuteCom jdupre Linux - Software 0 06-17-2005 07:02 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:00 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration