query with split

manickaraja · 03-06-2020, 11:20 AM

Hi,
I have a large CSV file which I need to split into multiple parts. This line might have line break with a record, however each line ends with logical pattern. Example provided below

line1 = "1","This is multi line.
THis is second line","123456"
line2 = "2","This is single line","123456"

when I use -l option to split I am seeing single record splitted into two files. So I would like to split with fixed size (say 256MB) but end of the line should be end of record (all records in above case ends with "123456"). Ours is RHEL. Please let me know how to acheive the same

Thanks,
Manick.

pan64 · 03-06-2020, 12:06 PM

I would use probably awk where you can set any line separator and can also split by N lines (for example).

TB0ne · 03-06-2020, 12:42 PM

Quote:

Originally Posted by manickaraja

Hi,
I have a large CSV file which I need to split into multiple parts. This line might have line break with a record, however each line ends with logical pattern. Example provided below

line1 = "1","This is multi line.
THis is second line","123456"
line2 = "2","This is single line","123456"

when I use -l option to split I am seeing single record splitted into two files. So I would like to split with fixed size (say 256MB) but end of the line should be end of record (all records in above case ends with "123456"). Ours is RHEL. Please let me know how to acheive the same

You've been asking about shell scripts for TEN YEARS now:
https://www.linuxquestions.org/quest...script-788655/

..and you have also been advised before that we are happy to help, but we aren't going to write your scripts for you. You need to post what YOU have written/done/tried and tell us where you're stuck. You also don't say what language you want to do this in...as noted, awk can do it, but so can a bash script, perl, and python. Personally, I'd use perl, but that is preference only. From a script-logic standpoint, I'd check the first variable coming in, to see if it matched a certain pattern (such as looking for it being in double-quotes, numeric, or the next-in-sequence), and if not, you'd know to append the data to the middle-variable. You could also simply count the commas, providing your data was sanitized.

There are MANY ways to do this...so now you need to go and TRY to do it for yourself. Again, if you're stuck, post your efforts here and we'll try to help. Otherwise, I'd suggest you look at any of the MANY tutorials and sample scripts you could find with a brief search.

manickaraja · 03-06-2020, 12:57 PM

May be my question was unclear. I can do open file, scan line by line, check if line ends with a pattern and then save it a separate file. For this I can use awk or any other language (like perl, python etc.). This was not my question. My question was linux has native function called "split". Can that be extended to split file to multiple parts , but each part should have logical end.

for example If I do

split -d -b 512m "large_file" large_file_part-

it is spliting large_file to multiple chuncks with each chunck of 512MB. However a single record is written in parts to two files. So my request was can the same/similar command be extended to perform the split, but a single record should not get splitted to two files.

IN other words. we have one large csv file which we load to table. If table has 10 columns then each logical line in the file will have 10 column. If I do one single file it is taking lot of time to load into table. So what I am trying to do is split this large file to multiple files and load in parallel. If a single line with 10 column got splits into two files with 5 column on one file and remaining 5 on other, when I try to load to table it will fail because record is incomplete.

Please let me know if still my request is un clear.

Thanks,
Manick

TB0ne · 03-06-2020, 12:59 PM

Quote:

Originally Posted by manickaraja

May be my question was unclear. I can do open file, scan line by line, check if line ends with a pattern and then save it a separate file. For this I can use awk or any other language (like perl, python etc.). This was not my question. My question was linux has native function called "split". Can that be extended to split file to multiple parts , but each part should have logical end.

for example If I do

split -d -b 512m "large_file" large_file_part-

it is spliting large_file to multiple chuncks with each chunck of 512MB. However a single record is written in parts to two files. So my request was can the same/similar command be extended to perform the split, but a single record should not get splitted to two files.

IN other words. we have one large csv file which we load to table. If table has 10 columns then each logical line in the file will have 10 column. If I do one single file it is taking lot of time to load into table. So what I am trying to do is split this large file to multiple files and load in parallel. If a single line with 10 column got splits into two files with 5 column on one file and remaining 5 on other, when I try to load to table it will fail because record is incomplete.

No, it can't, because split isn't designed/written to do that. It doesn't parse data, it does exactly what it says it does: splits a file into pieces. The only thing it can REMOTELY do, is have a custom newline separator, but that's not going to help in your case.

Parse your data correctly, and get everything on one line. Then you won't have to worry about it.

manickaraja · 03-06-2020, 01:26 PM

CSV file is generated by querying a table. Basically we are extracting data from one database and loading it to another database. So if a column in table has multi line data when we extract that will come in multiple lines. So we cannot have all data in one line. Each line in data file ends with batch id which is constant. Thats why I am trying to find a method to split it easily and have the line ends with batch id. If no other way I may have to write a custom program to perform this task.

boughtonp · 03-06-2020, 01:28 PM

To process a CSV file, use a tool designed to handle CSV files.

Here's three examples...
* xsv has split: "Split one CSV file into many CSV files of N chunks."
* csvtool has head: "take the first 'n' rows", and drop: "Drop the first 'rows' rows and return the rest (if any)."
* Python's csv module can be used to perform whatever logic you want.

TB0ne · 03-06-2020, 01:39 PM

Quote:

Originally Posted by manickaraja

CSV file is generated by querying a table. Basically we are extracting data from one database and loading it to another database. So if a column in table has multi line data when we extract that will come in multiple lines. So we cannot have all data in one line. Each line in data file ends with batch id which is constant. Thats why I am trying to find a method to split it easily and have the line ends with batch id. If no other way I may have to write a custom program to perform this task.

Sorry, that makes no sense at all. If you're querying a table, the data will, by default, come out on one line. The only way it won't, is if your database contains newlines/CRLF characters, and even then you can easily run that through sed to remove them, before writing the CSV file out. Or sanitize your data first (as suggested), and remove them with a replace command easily.

And why even bother with a CSV file, if you're going from one database to another?? You can directly import data easily.

manickaraja · 03-06-2020, 02:45 PM

We are moving across diversified platform. From Oracle to Cloud DB. So import wont work. Database data does has Newline/CRLF character. We need to maintain the same while porting data. So new line cannot be removed/replaced.

TB0ne · 03-06-2020, 03:47 PM

Quote:

Originally Posted by manickaraja

We are moving across diversified platform. From Oracle to Cloud DB. So import wont work. Database data does has Newline/CRLF character. We need to maintain the same while porting data. So new line cannot be removed/replaced.

Sorry, still no. Oracle and (assuming; you don't say), Azure Cloud DB are both well supported, and BOTH have import tools to work with different data sources. You are doing things that will possibly corrupt your data, and just making more work for yourself. Contact your support persons and ask them what THEY recommend, and I can almost guarantee there are well-developed import and migration tools.

Past that, the answer to your question is still "No, you cannot tell split to do something on a certain line".

boughtonp · 03-06-2020, 04:58 PM

Quote:

Originally Posted by TB0ne

If you're querying a table, the data will, by default, come out on one line. The only way it won't, is if your database contains newlines/CRLF characters, and even then you can easily run that through sed to remove them, before writing the CSV file out. Or sanitize your data first (as suggested), and remove them with a replace command easily.

Newlines are a valid character for a database field to contain, and a valid character for the CSV file format to include - there is no issue with newlines if using appropriate tools, and they shouldn't be blindly removed.

Quote:

And why even bother with a CSV file, if you're going from one database to another??

If the source database has a robust CSV export process and the target has a robust CSV import process, then it's a valid approach.

The fact that the OP wants to split the files, presumably because the target database system can't do its own batching, does suggest that they should consider a different approach, such as generating appropriate SQL insert statements for the target system.

Oh, I just spotted this...

Quote:

Originally Posted by manickaraja

If I do one single file it is taking lot of time to load into table. So what I am trying to do is split this large file to multiple files and load in parallel.

I'm not familiar with any of multiple things that "Cloud DB" might refer to, but I would be surprised if trying to do multiple inserts to the same table at the same time is going to be any faster - I would generally expect the opposite to be true.

Whilst moving away from Oracle is definitely something most sane people would want to do, if the target system doesn't have a decent import process then maybe the wrong target system was chosen.

manickaraja · 03-06-2020, 08:17 PM

Cloud DB referred here is Snowflake and it doesnt have native import. Snowflake is built on Amazon S3 and it has COPY command which copies data in S3 blocks. So multiple copy commands on same table can be run concurrently with no issue.

TB0ne · 03-07-2020, 10:03 AM

Quote:

Originally Posted by manickaraja

Cloud DB referred here is Snowflake and it doesnt have native import. Snowflake is built on Amazon S3 and it has COPY command which copies data in S3 blocks. So multiple copy commands on same table can be run concurrently with no issue.

Ok, now that you've finally said what database you're going to, have you looked into migration utilities?? Pentaho has a 30 day free trial, and it is designed to do exactly what you're after; Oracle to Snowflake migration. There are also OTHER tools that perform the same function as well, including some Python scripts that you can download and use.

Again: the split command won't split a file on a specific character/pattern.