LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software > Linux - Kernel
User Name
Password
Linux - Kernel This forum is for all discussion relating to the Linux kernel.

Notices


Reply
  Search this Thread
Old 03-06-2020, 11:20 AM   #1
manickaraja
Member
 
Registered: Sep 2003
Location: India
Posts: 36

Rep: Reputation: 0
query with split


Hi,
I have a large CSV file which I need to split into multiple parts. This line might have line break with a record, however each line ends with logical pattern. Example provided below

line1 = "1","This is multi line.
THis is second line","123456"
line2 = "2","This is single line","123456"

when I use -l option to split I am seeing single record splitted into two files. So I would like to split with fixed size (say 256MB) but end of the line should be end of record (all records in above case ends with "123456"). Ours is RHEL. Please let me know how to acheive the same

Thanks,
Manick.
 
Old 03-06-2020, 12:06 PM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,041

Rep: Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348
I would use probably awk where you can set any line separator and can also split by N lines (for example).
 
Old 03-06-2020, 12:42 PM   #3
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,754

Rep: Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983
Quote:
Originally Posted by manickaraja View Post
Hi,
I have a large CSV file which I need to split into multiple parts. This line might have line break with a record, however each line ends with logical pattern. Example provided below

line1 = "1","This is multi line.
THis is second line","123456"
line2 = "2","This is single line","123456"

when I use -l option to split I am seeing single record splitted into two files. So I would like to split with fixed size (say 256MB) but end of the line should be end of record (all records in above case ends with "123456"). Ours is RHEL. Please let me know how to acheive the same
You've been asking about shell scripts for TEN YEARS now:
https://www.linuxquestions.org/quest...script-788655/

..and you have also been advised before that we are happy to help, but we aren't going to write your scripts for you. You need to post what YOU have written/done/tried and tell us where you're stuck. You also don't say what language you want to do this in...as noted, awk can do it, but so can a bash script, perl, and python. Personally, I'd use perl, but that is preference only. From a script-logic standpoint, I'd check the first variable coming in, to see if it matched a certain pattern (such as looking for it being in double-quotes, numeric, or the next-in-sequence), and if not, you'd know to append the data to the middle-variable. You could also simply count the commas, providing your data was sanitized.

There are MANY ways to do this...so now you need to go and TRY to do it for yourself. Again, if you're stuck, post your efforts here and we'll try to help. Otherwise, I'd suggest you look at any of the MANY tutorials and sample scripts you could find with a brief search.
 
Old 03-06-2020, 12:57 PM   #4
manickaraja
Member
 
Registered: Sep 2003
Location: India
Posts: 36

Original Poster
Rep: Reputation: 0
May be my question was unclear. I can do open file, scan line by line, check if line ends with a pattern and then save it a separate file. For this I can use awk or any other language (like perl, python etc.). This was not my question. My question was linux has native function called "split". Can that be extended to split file to multiple parts , but each part should have logical end.

for example If I do

split -d -b 512m "large_file" large_file_part-

it is spliting large_file to multiple chuncks with each chunck of 512MB. However a single record is written in parts to two files. So my request was can the same/similar command be extended to perform the split, but a single record should not get splitted to two files.

IN other words. we have one large csv file which we load to table. If table has 10 columns then each logical line in the file will have 10 column. If I do one single file it is taking lot of time to load into table. So what I am trying to do is split this large file to multiple files and load in parallel. If a single line with 10 column got splits into two files with 5 column on one file and remaining 5 on other, when I try to load to table it will fail because record is incomplete.

Please let me know if still my request is un clear.

Thanks,
Manick
 
Old 03-06-2020, 12:59 PM   #5
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,754

Rep: Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983
Quote:
Originally Posted by manickaraja View Post
May be my question was unclear. I can do open file, scan line by line, check if line ends with a pattern and then save it a separate file. For this I can use awk or any other language (like perl, python etc.). This was not my question. My question was linux has native function called "split". Can that be extended to split file to multiple parts , but each part should have logical end.

for example If I do

split -d -b 512m "large_file" large_file_part-

it is spliting large_file to multiple chuncks with each chunck of 512MB. However a single record is written in parts to two files. So my request was can the same/similar command be extended to perform the split, but a single record should not get splitted to two files.

IN other words. we have one large csv file which we load to table. If table has 10 columns then each logical line in the file will have 10 column. If I do one single file it is taking lot of time to load into table. So what I am trying to do is split this large file to multiple files and load in parallel. If a single line with 10 column got splits into two files with 5 column on one file and remaining 5 on other, when I try to load to table it will fail because record is incomplete.
No, it can't, because split isn't designed/written to do that. It doesn't parse data, it does exactly what it says it does: splits a file into pieces. The only thing it can REMOTELY do, is have a custom newline separator, but that's not going to help in your case.

Parse your data correctly, and get everything on one line. Then you won't have to worry about it.
 
Old 03-06-2020, 01:26 PM   #6
manickaraja
Member
 
Registered: Sep 2003
Location: India
Posts: 36

Original Poster
Rep: Reputation: 0
CSV file is generated by querying a table. Basically we are extracting data from one database and loading it to another database. So if a column in table has multi line data when we extract that will come in multiple lines. So we cannot have all data in one line. Each line in data file ends with batch id which is constant. Thats why I am trying to find a method to split it easily and have the line ends with batch id. If no other way I may have to write a custom program to perform this task.
 
Old 03-06-2020, 01:28 PM   #7
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,628

Rep: Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557

To process a CSV file, use a tool designed to handle CSV files.

Here's three examples...
* xsv has split: "Split one CSV file into many CSV files of N chunks."
* csvtool has head: "take the first 'n' rows", and drop: "Drop the first 'rows' rows and return the rest (if any)."
* Python's csv module can be used to perform whatever logic you want.

 
1 members found this post helpful.
Old 03-06-2020, 01:39 PM   #8
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,754

Rep: Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983
Quote:
Originally Posted by manickaraja View Post
CSV file is generated by querying a table. Basically we are extracting data from one database and loading it to another database. So if a column in table has multi line data when we extract that will come in multiple lines. So we cannot have all data in one line. Each line in data file ends with batch id which is constant. Thats why I am trying to find a method to split it easily and have the line ends with batch id. If no other way I may have to write a custom program to perform this task.
Sorry, that makes no sense at all. If you're querying a table, the data will, by default, come out on one line. The only way it won't, is if your database contains newlines/CRLF characters, and even then you can easily run that through sed to remove them, before writing the CSV file out. Or sanitize your data first (as suggested), and remove them with a replace command easily.

And why even bother with a CSV file, if you're going from one database to another?? You can directly import data easily.
 
1 members found this post helpful.
Old 03-06-2020, 02:45 PM   #9
manickaraja
Member
 
Registered: Sep 2003
Location: India
Posts: 36

Original Poster
Rep: Reputation: 0
We are moving across diversified platform. From Oracle to Cloud DB. So import wont work. Database data does has Newline/CRLF character. We need to maintain the same while porting data. So new line cannot be removed/replaced.
 
Old 03-06-2020, 03:47 PM   #10
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,754

Rep: Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983
Quote:
Originally Posted by manickaraja View Post
We are moving across diversified platform. From Oracle to Cloud DB. So import wont work. Database data does has Newline/CRLF character. We need to maintain the same while porting data. So new line cannot be removed/replaced.
Sorry, still no. Oracle and (assuming; you don't say), Azure Cloud DB are both well supported, and BOTH have import tools to work with different data sources. You are doing things that will possibly corrupt your data, and just making more work for yourself. Contact your support persons and ask them what THEY recommend, and I can almost guarantee there are well-developed import and migration tools.

Past that, the answer to your question is still "No, you cannot tell split to do something on a certain line".
 
Old 03-06-2020, 04:58 PM   #11
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,628

Rep: Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557Reputation: 2557

Quote:
Originally Posted by TB0ne View Post
If you're querying a table, the data will, by default, come out on one line. The only way it won't, is if your database contains newlines/CRLF characters, and even then you can easily run that through sed to remove them, before writing the CSV file out. Or sanitize your data first (as suggested), and remove them with a replace command easily.
Newlines are a valid character for a database field to contain, and a valid character for the CSV file format to include - there is no issue with newlines if using appropriate tools, and they shouldn't be blindly removed.


Quote:
And why even bother with a CSV file, if you're going from one database to another??
If the source database has a robust CSV export process and the target has a robust CSV import process, then it's a valid approach.

The fact that the OP wants to split the files, presumably because the target database system can't do its own batching, does suggest that they should consider a different approach, such as generating appropriate SQL insert statements for the target system.


Oh, I just spotted this...
Quote:
Originally Posted by manickaraja View Post
If I do one single file it is taking lot of time to load into table. So what I am trying to do is split this large file to multiple files and load in parallel.
I'm not familiar with any of multiple things that "Cloud DB" might refer to, but I would be surprised if trying to do multiple inserts to the same table at the same time is going to be any faster - I would generally expect the opposite to be true.

Whilst moving away from Oracle is definitely something most sane people would want to do, if the target system doesn't have a decent import process then maybe the wrong target system was chosen.

 
Old 03-06-2020, 08:17 PM   #12
manickaraja
Member
 
Registered: Sep 2003
Location: India
Posts: 36

Original Poster
Rep: Reputation: 0
Cloud DB referred here is Snowflake and it doesnt have native import. Snowflake is built on Amazon S3 and it has COPY command which copies data in S3 blocks. So multiple copy commands on same table can be run concurrently with no issue.
 
Old 03-07-2020, 10:03 AM   #13
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,754

Rep: Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983
Quote:
Originally Posted by manickaraja View Post
Cloud DB referred here is Snowflake and it doesnt have native import. Snowflake is built on Amazon S3 and it has COPY command which copies data in S3 blocks. So multiple copy commands on same table can be run concurrently with no issue.
Ok, now that you've finally said what database you're going to, have you looked into migration utilities?? Pentaho has a 30 day free trial, and it is designed to do exactly what you're after; Oracle to Snowflake migration. There are also OTHER tools that perform the same function as well, including some Python scripts that you can download and use.

Again: the split command won't split a file on a specific character/pattern.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to split a file using 'split ' without destroying lines? CaptainDerp Linux - Software 6 09-20-2016 04:08 AM
[SOLVED] How can I split a file, without using 'split'? szboardstretcher Linux - Software 11 05-20-2011 02:43 PM
package management (to split or not to split packages) frischi Linux - Desktop 2 05-06-2010 03:15 PM
How to split file , .. awk or split ERBRMN Linux - General 9 08-15-2006 12:02 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software > Linux - Kernel

All times are GMT -5. The time now is 05:19 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration