File system optimized for very large file parallel HDD read
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I copied files to empty HDD so they start from from HDD start and are 100% saved one after other, and as read-ahead helped me a lot i can say it defo work, I might test in real use next week with 64MB buffer and see speed and compare vs 32MB
I have already set cluster size as 32 and 64MB but that have not helped me at all, but it will stay as its done
yes, if you look at how ext4 allocates blocks, you will be surprised. Otherwise it is optimized for a multi-user, multi-task environment (parallel access to a lot of small files), not for a single file access. https://www.kernel.org/doc/html/late...xt4/index.html
I won't bet on that, disk allocation strategy may be weird. But you may check this with 'hdparm --fibmap'
its quite fine as i minitor disk read and HDD is 16TB only ~3.5T stored and it has max speed on read for that part, it do go from 210 to 250B/s on 7TB used but speed at offset 8TB+ is under 200MB/s on 3th HDD
one file:
filesystem blocksize 4096, begins at LBA 2048; assuming 512 byte sectors.
byte_offset begin_LBA end_LBA sectors
0 4310435840 4310697983 262144
134217728 4310697984 4318824447 8126464
At first I had data saved over network in multiple tasks and then it was mess (read time would take over 50% more), files where on end of disk even if less then 50% space used, but cp or mv them from disk1 to disk2 helped, at least when I format 2nd hdd to be sure
I just don't understand, you tried a lot of things, but why don't you use ssd, and you can have much better results.
Great point.
The thing is that you cannot optimize a file system for parallel operation, because that is not how the hardware works. You can tune for best performance OF THE SYSTEM for your use case, but the real optimization is to make the hardware fit the intended function. To optimize for parallel I/O you need a storage controller and channels that can operate in parallel, or multiple controllers: and in any case accessing multiple storage devices that can be read independently and in parallel.
If you are not willing to modify the hardware, you are only tuning the software and system for the optimal performance THAT HARDWARE can achieve under that kind of use given your restrictions. (Which is still not a bad thing to do and may be sufficient for your needs. One can hope.)
Getting access that APPEARS parallel is achieved by loading as much as possible from the slow (nonparallel rotating rust) storage into faster and more parallel (ram) storage. That is not achieved by changes directly to the file system, although a file system with good performance certainly helps. Changes to how you load from storage into ram cache and buffers makes a bigger difference, and of course there must be more than adequate ram to hold all of the data you need to access in parallel.
Were I building a system for this kind of operation, it would involve a stack of 7 to 12 SSD devices on multiple channels in a RAID-5 array, because that would give you the fastest parallel performance.I would have to sit down and do the math on the memory, but twice what it would take to hold all of the buffered data would be a base, then operational memory atop that, and about 20% for spare to reduce swapping. A smart engineer would then double that. (Compared to production impact, even expensive ram is cheap!)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.