ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi,
I have a requirement to parse (that is, load each "Value" in to a data structure, say an array) a .csv file in either C or C++ (or optionally Java however trying to stay away from that).
The file may however contain commas however any field with these in will be wrapped by "".
For example:
Code:
This,is,an,example of some,data
Some,lines,might contain,"commas, however",they will be quoted
A field may possible be,quoted,even,without,commas present
this,is,a,simple,record
If a field,contains,"a ""quote",then it ,is doubled up
Any suggestions or tips on this? Ideally I want to stay away from utilities like awk because the target app needs to run on Windows machines and ideally not have such dependencies.
Create a simple parser. Basically it is a Finite State Machine with rules like these:
Code:
State InputChar Action
----- --------- ------
START \n End of line, no (more) data in this line
START , Empty field
START " Start of a field: goto State-B
START other Start of a field: store character; goto State-A
A \n End of field and line
A , End of field
A other store character
B " Goto State-C
B other store character
C " store character
C \n End of field and line
C , End of field
C other Invalid input
I've done a lot of character and string parsing in C and C++. You just do it within the program and you use character pointers, and strstr().
The MAIN thing I've found out is that if you do:
Code:
sscanf(input_string, "%s,%s", frag_1, frag_2);
what you end up with is a return of 1 from the sscanf and it will contain all of:
Quote:
frag_1[] = "this was, my test string"
instead of
Quote:
frag_1[] = "this was"
frag_2[] = "my test string"
.
Therefore I use strstr() or strstr_r() to find the next COMMA, or double quote, and have flags or state variables to indicate whether or not I'm looking for the next COMMA, end of line, or quote.
An example:
Code:
// input: "this is, my test string" - which ends with a \0
// Note I wrote live and didn't compile, please excuse any subtle syntax errors, they're not intentional
char *test_input[] = "this is, my test string";
void my_parse_function()
{
char *pStart, *p1, *p2, frag_1[128], frag_2[128];
pStart = test_input;
if((p1 = strstr(",", pStart)) != NULL) {
if((p2 = strsstr(",", p1)) != NULL) {
memset(frag_1, 0, sizeof(frag_1);
memcpy(frag_1, p1, (p2 - p1));
p1 = p2 + 1; // Point to next character beyond the comma
memset(frag_2, 0, sizeof(frag_2));
if((p2 = strstr(",", p1)) != NULL) {
memcpy(frag_2, p1, (p2 - p1));
}
else if((p2 = strstr("\0", p1)) != NULL) {
memcpy(frag_2, p1, (p2 - p1));
}
}
}
}
I do a lot of parsing of NMEA formatted messages where not all of the content is required, however I need to get position fix attributes out of a comma separated line. The benefit I have is that all lines follow a syntax of starting with $ and ending with *<checksum>, therefore I check that the syntax and checksum are correct before parsing the comma separated attributes.
Parsing CSV data can be a little more nuanced than it seems at first glance. You already addressed one issue - a comma being part of the field value. Here's another issue - are quotes allowed inside field values? A common convention for handling quotes is to prefix a quote with another quote
Code:
this,is,valid,csv -> this | is | valid | csv
this,is,"valid, csv" -> this | is | valid, csv
this,"is ""valid"", csv" -> this | is "valid", csv
I second the state machine approach. If you want/(need) to allow for quotes, you'll need to re-work the state machine that NevemTeve laid out a bit.
Edit: My bad - The last line in the original post says quotes will be double-quoted. I read that too fast. AND the state machine from NevemTeve also handles that upon a second look. That's what I get for posting at night.
If you really want C/C++, then use a library; it'll be quicker to implement and robust (pick a good one).
FWIW, Perl has a module (or several) to do that and it runs on MS & *nix & ... http://search.cpan.org/search?query=...3Acsv&mode=all
Hi,
Thanks for all the suggestions above, I'll have a work on them over the weekend and see where I get to. I was hoping to have it as simple as possible, however it looks like the FSM could be a relatively easy and pain-free option.
The users of the app I am writing will be using Excel to save as .csv so the format will be in line with what excel produces.
The reason I had stayed away from the pre-written libraries is that they all seemed a little over-complicated for what I am doing. I am ok with programming C, my background comes from Java however I am looking to use c/c++ so I can use QT for the gui (although I believe there is a java port available).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.