Parsing a ..csv file

devnull10 · 06-19-2013, 10:13 AM

Hi,
I have a requirement to parse (that is, load each "Value" in to a data structure, say an array) a .csv file in either C or C++ (or optionally Java however trying to stay away from that).
The file may however contain commas however any field with these in will be wrapped by "".
For example:

Code:

This,is,an,example of some,data
Some,lines,might contain,"commas, however",they will be quoted
A field may possible be,quoted,even,without,commas present
this,is,a,simple,record
If a field,contains,"a ""quote",then it ,is doubled up

Any suggestions or tips on this? Ideally I want to stay away from utilities like awk because the target app needs to run on Windows machines and ideally not have such dependencies.

NevemTeve · 06-19-2013, 10:49 AM

Create a simple parser. Basically it is a Finite State Machine with rules like these:

Code:

State InputChar Action
----- --------- ------
START \n        End of line, no (more) data in this line
START ,         Empty field
START "         Start of a field: goto State-B
START other     Start of a field: store character; goto State-A

A     \n        End of field and line
A     ,         End of field
A     other     store character

B     "         Goto State-C
B     other     store character

C     "         store character
C     \n        End of field and line     
C     ,         End of field
C     other     Invalid input

rtmistler · 06-19-2013, 11:09 AM

I've done a lot of character and string parsing in C and C++. You just do it within the program and you use character pointers, and strstr().

The MAIN thing I've found out is that if you do:

Code:

sscanf(input_string, "%s,%s", frag_1, frag_2);

what you end up with is a return of 1 from the sscanf and it will contain all of:

Quote:

frag_1[] = "this was, my test string"

instead of

Quote:

frag_1[] = "this was"
frag_2[] = "my test string"

.

Therefore I use strstr() or strstr_r() to find the next COMMA, or double quote, and have flags or state variables to indicate whether or not I'm looking for the next COMMA, end of line, or quote.

An example:

Code:

// input: "this is, my test string" - which ends with a \0
// Note I wrote live and didn't compile, please excuse any subtle syntax errors, they're not intentional

char *test_input[] = "this is, my test string";

void my_parse_function()
{
    char *pStart, *p1, *p2, frag_1[128], frag_2[128];

    pStart = test_input;

    if((p1 = strstr(",", pStart)) != NULL) {
        if((p2 = strsstr(",", p1)) != NULL) {
            memset(frag_1, 0, sizeof(frag_1);
            memcpy(frag_1, p1, (p2 - p1));
            p1 = p2 + 1; // Point to next character beyond the comma
            memset(frag_2, 0, sizeof(frag_2));
            if((p2 = strstr(",", p1)) != NULL) {
                memcpy(frag_2, p1, (p2 - p1));
            }
            else if((p2 = strstr("\0", p1)) != NULL) {
                memcpy(frag_2, p1, (p2 - p1));
            }
        }
    }
}

I do a lot of parsing of NMEA formatted messages where not all of the content is required, however I need to get position fix attributes out of a comma separated line. The benefit I have is that all lines follow a syntax of starting with $ and ending with *<checksum>, therefore I check that the syntax and checksum are correct before parsing the comma separated attributes.

dugan · 06-19-2013, 11:50 AM

Why don't you just use one of the CSV parsing libraries that are available for C or C++?

Here are two that came up on a google search:

http://libcsv.sourceforge.net/
http://code.google.com/p/csv-parser-cplusplus/

jason_m · 06-19-2013, 10:29 PM

Parsing CSV data can be a little more nuanced than it seems at first glance. You already addressed one issue - a comma being part of the field value. Here's another issue - are quotes allowed inside field values? A common convention for handling quotes is to prefix a quote with another quote

Code:

this,is,valid,csv  ->  this | is | valid | csv
this,is,"valid, csv"  ->  this | is | valid, csv
this,"is ""valid"", csv"  ->  this | is "valid", csv

I second the state machine approach. If you want/(need) to allow for quotes, you'll need to re-work the state machine that NevemTeve laid out a bit.

Edit: My bad - The last line in the original post says quotes will be double-quoted. I read that too fast. AND the state machine from NevemTeve also handles that upon a second look. That's what I get for posting at night.

chrism01 · 06-20-2013, 03:14 AM

If you really want C/C++, then use a library; it'll be quicker to implement and robust (pick a good one).
FWIW, Perl has a module (or several) to do that and it runs on MS & *nix & ...
http://search.cpan.org/search?query=...3Acsv&mode=all

devnull10 · 06-20-2013, 03:15 AM

Hi,
Thanks for all the suggestions above, I'll have a work on them over the weekend and see where I get to. I was hoping to have it as simple as possible, however it looks like the FSM could be a relatively easy and pain-free option.
The users of the app I am writing will be using Excel to save as .csv so the format will be in line with what excel produces.

The reason I had stayed away from the pre-written libraries is that they all seemed a little over-complicated for what I am doing. I am ok with programming C, my background comes from Java however I am looking to use c/c++ so I can use QT for the gui (although I believe there is a java port available).