File system optimized for very large file parallel HDD read

reb0rn · 12-03-2023, 10:01 AM

I have 7 folders each with 4GB files in total 3.5TB they need to be read as fast as possible with 7 nodes at same time, it`s a spacemesh project (the HDD is only used for this and nothing else also files are copied to it so there is no defregmentation and they are at start of disk where its fastest)
My idea is to setup file system or mount option to force kernel to read as big chunks as possible before seeking HDD to next position
So far I tested:
ext4 with cluster size 64M and 128MB and I got similar results so I am not sure if it helped
the total read time was about 9h 30min

That mean HDD still seek a lot, as data are on start on disk, if doing this one folder by one it would finish in ~3h
I am not sure how to tweak xfs and try, any help advised, anything a bit faster would be ok as I need to be read under 12h in total

lvm_ · 12-03-2023, 10:33 AM

blockdev --setra <a lot>, maybe? Or you may try different queue schedulers https://access.redhat.com/documentat...disk-scheduler But the best option is to use multiple disks.

reb0rn · 12-03-2023, 11:39 AM

sudo blockdev --setra 65536000 /dev/sdc
Looks like its working I set to read all 6 folders and iostat report ~200MB/s speed

pan64 · 12-03-2023, 11:59 AM

I don't think it can be. If you allow concurrent access to these files, the head starts to dance. You can't really allocate contiguous space for each file on ext4 (and I don't think there's any file system that guarantees that). By the way, do these files change?
In theory, a whole track should be read at once, but since there is no physical track/head/sector information (only logical), you can't really do it. (you ought to implement a low level disk driver for this).
If you really want to speed it up just use multiple disks or ssd.
It depends on your files, but probably you can compress them, and in that case you need to read less and you can uncompress them on the fly. Or you can use a compressed filesystem like btrfs.
Using xfs you might want to use direct i/o (do not use any cache), but that won't help a lot on a slow disk.

jailbait · 12-03-2023, 12:29 PM

Quote:

Originally Posted by reb0rn

they are at start of disk where its fastest

The disk arm moves from where it currently sitting to the next location accessed. It does not go back to the beginning of the disk between movements. Therefore the fastest place on the disk is the middle of the data. You optimize the disk arm movement by placing the busiest file in the middle of the disk and the least accessed files are at the beginning and the end of the disk.

If you are using a SSD then there is no disk arm and file placement is irrelevant. You should use SSDs. Then the speed
will be limited by the transfer speed between the SSD and memory. If you use multiple SSDs with multiple transfer channels then you will get a faster speed than if you only have one transfer channel.

ext4 doesn't stack files one after the other from the beginning of a disk. So your data is not contiguous and not necessarily with the busiest file in the center. You can come somewhat closer to an optimal file placement by creating a series of small partitions and sort the files out into the partitions depending on how busy the files are.

reb0rn · 12-03-2023, 04:17 PM

For spinning HDD fastest read is at start of HDD the inner circle, I do not need seek speed as I want to prevent it as all data is read in secunatal way, issue is only 6 node read heir group as same time

Yeah I failed to make ext4 read huge part in one go, cluster size had no effect

But blockdev --setra 65536000 /dev/sdc
helped a lot as it forced kernel to read ahead at least 64MB so my total read time from 9h dropped to ~4h which is great as even reading one by one I can not get it under 3.5h

wpeckham · 12-03-2023, 05:22 PM

Multiple disks in a raid-6 array would be the fastest option I am aware of. SSD might reach an order of magnitude faster than spinning rust. Without changing the storage system, add memory: you want the greatest possible amount of I/O buffer you can get. If you have the option, tune the buffer scheduler for the highest read-ahead volume you can. You want to load all the tracks into memory spaces at the earliest opportunity so that after those first reads you get ram speed for everything following.

syg00 · 12-03-2023, 10:59 PM

You can't do parallel I/O on a disk. They get queued and re-ordered by the scheduler as per post #2.

You don't mention how you are testing, and how relevant it might be to the task at hand. I've not looked at spacemesh, but if you're allowing multiple network clients to access (particularly update) a non-network aware filesystem you're likely looking at a broken filesystem in short order.
Given it's ambit, hopefully spacemesh handles all the I/O itself - that means direct I/O, but I didn't see any doco on a quick search.
I'd be inclined to move each clients folder to a separate filesystem on a separate partition. Certainly doesn't solve all the issues a single disk presents, but might ameliorate them somewhat.

reb0rn · 12-03-2023, 11:09 PM

I know it does not solve, but I need specific use case, and blockdev --setra 65536000 do work magic, reading 3.5TB from 7 parallel task takes me some 4.5h... which is almost full speed this disk can do, as its specific sequential read just 7 nodes read 7 different parts of hdd, point was to keep seeking at minimal and as I only need a read speed it worked fine

what happened in reality 7 taks ask 7 job, and kernel now read ~64MB till it move to next job... as data is all sequential on disk I lose just a bit at seeking, I might get more with 128MB chunks/block maybe but this is fine, as my need is all 7 folders to be read at specific time under 12h

pan64 · 12-03-2023, 11:24 PM

Quote:

Originally Posted by reb0rn

I know it does not solve, but I need specific use case, and blockdev --setra 65536000 do work magic, reading 3.5TB from 7 parallel task takes me some 4.5h... which is almost full speed this disk can do, as its specific sequential read just 7 nodes read 7 different parts of hdd, point was to keep seeking at minimal and as I only need a read speed it worked fine

what happened in reality 7 taks ask 7 job, and kernel now read ~64MB till it move to next job... as data is all sequential on disk I lose just a bit at seeking, I might get more with 128MB chunks/block maybe but this is fine, as my need is all 7 folders to be read at specific time under 12h

You repeat yourself and does not answer to questions. The solution depends on the available RAM and other things that we don't know about at all.

reb0rn · 12-04-2023, 12:24 AM

What a ram can help in reading 3TB of data at all (other then ram used to read ahead 64MB in my case)?

I said blockdev --setra 65536000 work for me quite fine, I od not need cashing as its only sequential read from point a to end just 7 task at once so kernel seeking in small parts would be issue

look it as this I have 7 movies and I need to play stream them at same time all at once, each movie is 500G... you only need ram to read ahead to minimize seeking from movie 1 to movie 2 etc

pan64 · 12-04-2023, 01:08 AM

Quote:

Originally Posted by reb0rn

What a ram can help in reading 3TB of data at all (other then ram used to read ahead 64MB in my case)?

I said blockdev --setra 65536000 work for me quite fine, I od not need cashing as its only sequential read from point a to end just 7 task at once so kernel seeking in small parts would be issue

look it as this I have 7 movies and I need to play stream them at same time all at once, each movie is 500G... you only need ram to read ahead to minimize seeking from movie 1 to movie 2 etc

it is not the kernel, but the disk seeking. Kernel and filesystem can only handle logical track/sector info, but they are usually mapped to their physical counterparts, which are different. You cannot avoid seeking. Also using blockdev you will eventually set some kind of caching, which may be useful for the directories you have (to avoid re-reading them), but otherwise pointless, if you really want to read and transfer all the 3.5 TB. And if you want to read all of them in the same time.

Theoretically RAM size matters, for example if you want to read all the files at once, the optimal buffer size that can be allocated to the read process(es) is about the size of ram/8. But you can decide.

Anyway, it looks like it helped, so that's all.

lvm_ · 12-04-2023, 01:17 AM

Eh... Actually setra argument is in 512 byte sectors, not bytes, so you set up a 32G buffer, but if it works for you... I think it hits some internal limit on that.

But I don't understand the reaction of others. Yes, you still have to read the data, but if the large read-ahead buffer is enabled, it will be read in larger chunks and so heads will have to be moved from one file to another less frequently - hence the performance increase, logical and expected.

reb0rn · 12-04-2023, 01:36 AM

Yeah I presumed as block size is hard coded and even kernel refuse to mount if I change logical size of it... in a way I only wanted to cut down seek to minimum to my specific need, as I in general have no idea how kernel/file system decide to control the HDD when the HDD alone also has its own controller that do his own work

pan64 · 12-04-2023, 02:41 AM

Quote:

Originally Posted by lvm_

Eh... Actually setra argument is in 512 byte sectors, not bytes, so you set up a 32G buffer, but if it works for you... I think it hits some internal limit on that.

But I don't understand the reaction of others. Yes, you still have to read the data, but if the large read-ahead buffer is enabled, it will be read in larger chunks and so heads will have to be moved from one file to another less frequently - hence the performance increase, logical and expected.

The question is how these files were stored. In general, files are not stored in a huge number of contiguous sectors, therefore using a large read-ahead buffer will not avoid seeking.