LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-19-2013, 10:13 AM   #1
devnull10
Member
 
Registered: Jan 2010
Location: Lancashire
Distribution: Slackware Stable
Posts: 572

Rep: Reputation: 120Reputation: 120
Parsing a ..csv file


Hi,
I have a requirement to parse (that is, load each "Value" in to a data structure, say an array) a .csv file in either C or C++ (or optionally Java however trying to stay away from that).
The file may however contain commas however any field with these in will be wrapped by "".
For example:

Code:
This,is,an,example of some,data
Some,lines,might contain,"commas, however",they will be quoted
A field may possible be,quoted,even,without,commas present
this,is,a,simple,record
If a field,contains,"a ""quote",then it ,is doubled up
Any suggestions or tips on this? Ideally I want to stay away from utilities like awk because the target app needs to run on Windows machines and ideally not have such dependencies.
 
Old 06-19-2013, 10:49 AM   #2
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,880
Blog Entries: 1

Rep: Reputation: 1871Reputation: 1871Reputation: 1871Reputation: 1871Reputation: 1871Reputation: 1871Reputation: 1871Reputation: 1871Reputation: 1871Reputation: 1871Reputation: 1871
Create a simple parser. Basically it is a Finite State Machine with rules like these:

Code:
State InputChar Action
----- --------- ------
START \n        End of line, no (more) data in this line
START ,         Empty field
START "         Start of a field: goto State-B
START other     Start of a field: store character; goto State-A

A     \n        End of field and line
A     ,         End of field
A     other     store character

B     "         Goto State-C
B     other     store character

C     "         store character
C     \n        End of field and line     
C     ,         End of field
C     other     Invalid input

Last edited by NevemTeve; 06-19-2013 at 10:50 AM.
 
3 members found this post helpful.
Old 06-19-2013, 11:09 AM   #3
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,883
Blog Entries: 13

Rep: Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931Reputation: 4931
I've done a lot of character and string parsing in C and C++. You just do it within the program and you use character pointers, and strstr().

The MAIN thing I've found out is that if you do:
Code:
sscanf(input_string, "%s,%s", frag_1, frag_2);
what you end up with is a return of 1 from the sscanf and it will contain all of:
Quote:
frag_1[] = "this was, my test string"
instead of
Quote:
frag_1[] = "this was"
frag_2[] = "my test string"
.

Therefore I use strstr() or strstr_r() to find the next COMMA, or double quote, and have flags or state variables to indicate whether or not I'm looking for the next COMMA, end of line, or quote.

An example:
Code:
// input: "this is, my test string" - which ends with a \0
// Note I wrote live and didn't compile, please excuse any subtle syntax errors, they're not intentional

char *test_input[] = "this is, my test string";

void my_parse_function()
{
    char *pStart, *p1, *p2, frag_1[128], frag_2[128];

    pStart = test_input;

    if((p1 = strstr(",", pStart)) != NULL) {
        if((p2 = strsstr(",", p1)) != NULL) {
            memset(frag_1, 0, sizeof(frag_1);
            memcpy(frag_1, p1, (p2 - p1));
            p1 = p2 + 1; // Point to next character beyond the comma
            memset(frag_2, 0, sizeof(frag_2));
            if((p2 = strstr(",", p1)) != NULL) {
                memcpy(frag_2, p1, (p2 - p1));
            }
            else if((p2 = strstr("\0", p1)) != NULL) {
                memcpy(frag_2, p1, (p2 - p1));
            }
        }
    }
}
I do a lot of parsing of NMEA formatted messages where not all of the content is required, however I need to get position fix attributes out of a comma separated line. The benefit I have is that all lines follow a syntax of starting with $ and ending with *<checksum>, therefore I check that the syntax and checksum are correct before parsing the comma separated attributes.
 
1 members found this post helpful.
Old 06-19-2013, 11:50 AM   #4
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,249

Rep: Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323Reputation: 5323
Why don't you just use one of the CSV parsing libraries that are available for C or C++?

Here are two that came up on a google search:

http://libcsv.sourceforge.net/
http://code.google.com/p/csv-parser-cplusplus/
 
Old 06-19-2013, 10:29 PM   #5
jason_m
Member
 
Registered: Jun 2009
Posts: 33

Rep: Reputation: 12
Parsing CSV data can be a little more nuanced than it seems at first glance. You already addressed one issue - a comma being part of the field value. Here's another issue - are quotes allowed inside field values? A common convention for handling quotes is to prefix a quote with another quote

Code:
this,is,valid,csv  ->  this | is | valid | csv
this,is,"valid, csv"  ->  this | is | valid, csv
this,"is ""valid"", csv"  ->  this | is "valid", csv
I second the state machine approach. If you want/(need) to allow for quotes, you'll need to re-work the state machine that NevemTeve laid out a bit.

Edit: My bad - The last line in the original post says quotes will be double-quoted. I read that too fast. AND the state machine from NevemTeve also handles that upon a second look. That's what I get for posting at night.

Last edited by jason_m; 06-19-2013 at 10:39 PM.
 
Old 06-20-2013, 03:14 AM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,369

Rep: Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753
If you really want C/C++, then use a library; it'll be quicker to implement and robust (pick a good one).
FWIW, Perl has a module (or several) to do that and it runs on MS & *nix & ...
http://search.cpan.org/search?query=...3Acsv&mode=all
 
Old 06-20-2013, 03:15 AM   #7
devnull10
Member
 
Registered: Jan 2010
Location: Lancashire
Distribution: Slackware Stable
Posts: 572

Original Poster
Rep: Reputation: 120Reputation: 120
Hi,
Thanks for all the suggestions above, I'll have a work on them over the weekend and see where I get to. I was hoping to have it as simple as possible, however it looks like the FSM could be a relatively easy and pain-free option.
The users of the app I am writing will be using Excel to save as .csv so the format will be in line with what excel produces.

The reason I had stayed away from the pre-written libraries is that they all seemed a little over-complicated for what I am doing. I am ok with programming C, my background comes from Java however I am looking to use c/c++ so I can use QT for the gui (although I believe there is a java port available).

Last edited by devnull10; 06-20-2013 at 03:18 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Parsing text file with passwords and converting to CSV list Jumpingmushroom Programming 10 04-15-2013 03:11 AM
Help needed parsing a csv log file. tenaciousbob Programming 2 04-28-2012 10:04 AM
Parsing data and generating a CSV file Striketh Programming 4 11-04-2011 07:15 AM
Parsing a comma separated CSV file where fields have commas in to trickyflash Linux - General 7 03-26-2009 03:30 PM
Parsing a pseudo CSV file. sharky Programming 8 11-03-2008 10:47 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration