Issue to read millions of small files form one directory

purveshp · 11-20-2022, 03:56 AM

I have requirement to read millions of small image files from one directory. Which filesystem / protocol to be used to improve read performance? XFS / Ext4 is not working for millions of files and slows. Any distributed filesystem or any other filesystem where I will get best read performance for large number of small files. I am using ubuntu 22.

Purvesh

Keith Hedger · 11-20-2022, 06:52 AM

Much more info needed:
What have you done so far?
What do you need to do with these files? convert delete, display or what?
How small is 'small'?
GUI or CLI?

We can't help you until you help us understand what you want to do.

Turbocapitalist · 11-20-2022, 07:54 AM

Yes, more details would help.

What block size have you been using?

TB0ne · 11-20-2022, 08:48 AM

Quote:

Originally Posted by purveshp

I have requirement to read millions of small image files from one directory. Which filesystem / protocol to be used to improve read performance? XFS / Ext4 is not working for millions of files and slows. Any distributed filesystem or any other filesystem where I will get best read performance for large number of small files. I am using ubuntu 22.

To add to what others asked, you need to define "read" in this context. "Read" how? Do you mean you need to copy them? Back them up? Or do you mean they need to be 'read' and displayed on a web page? What is your actual goal?

pan64 · 11-20-2022, 09:49 AM

yes, would be nice to know how do you use those files, otherwise hard to speed up that process. Would be nice to find the real bottleneck. But without details hard to say anything. Also would be nice to know what do you mean by slow?

lvm_ · 11-20-2022, 10:02 AM

There is no silver bullet, all filesystems slow down when there are too many files in a single directory, that's why programs which store lots of small files typically create a balanced tree of subdirectories like e.g. squid does. Distributed filesystem will obviously do the same, only in a more roundabout way. That said, however, there are some filesystem tweaks which make filesystems perform better when huge directories are needed: enabling dir_index on ext4/3 or increasing directory block size on xfs (don't remember if it can be done on extN) will definitely help.

wpeckham · 11-20-2022, 10:28 AM

We might be able to provide better targeted advice with better information, the general points that have been made are valid. I just want to add that the underlying hardware plays a part.

I find EXT4 slows down later and less if it is in RAID-5 SSD or high quality enterprise SAN storage. If you have a single disk your options are more limited. Under load using an in-house database with a HUGE number of files of mixed sizes, EXT4 beat XFS and BTRFS on RAID-5 a few years ago, but significant improvements to all file system and BTRFS in particular have been developed since my testing.

BTRFS starts out a bit slower than single device EXT4 or XFS, but does not degrade the same and might perform better for some cases.

sundialsvcs · 11-22-2022, 03:37 PM

For a variety of reasons, I would suggest that you re-structure this process to subdivide the "millions" of files. For example, the first three (say ...) characters of the filename could become a directory-name in which the file is stored.

purveshp · 11-23-2022, 08:10 PM

HI,

Thank you for pointers.

Also, to answer questions,
I have to read small image files of the size of 10 KB to 64 KB (mixed quantity) from CLI in sequential mode. I do it for image processing. I have configured XFS and stored data and read from XFS, but it is somehow slow. Also, when i mount it on multiple servers, the sync. issues do occur and corruptions also occur.

wpeckham · 11-23-2022, 09:45 PM

Quote:

Originally Posted by purveshp

HI,

Thank you for pointers.

Also, to answer questions,
I have to read small image files of the size of 10 KB to 64 KB (mixed quantity) from CLI in sequential mode. I do it for image processing. I have configured XFS and stored data and read from XFS, but it is somehow slow. Also, when i mount it on multiple servers, the sync. issues do occur and corruptions also occur.

Okay, you just introduced a new and IMPORTANT variable. When you say "mount it on multiple servers" I presume you might mean mounting a remote storage volume over network, which then involves multiple I/O queues, a network path and protocol, and sorage drivers of different types on BOTH ends of the connection.

Is it possible to just limit this to troubleshooting at the host that has the native storage directly attached without involving network traffic in any way?

!!! · 11-23-2022, 10:19 PM

https://www.reddit.com/r/sysadmin/co..._to_glusterfs/

rnturn · 11-23-2022, 10:37 PM

Quote:

Originally Posted by lvm_

There is no silver bullet, all filesystems slow down when there are too many files in a single directory, that's why programs which store lots of small files typically create a balanced tree of subdirectories like e.g. squid does. Distributed filesystem will obviously do the same, only in a more roundabout way. That said, however, there are some filesystem tweaks which make filesystems perform better when huge directories are needed: enabling dir_index on ext4/3 or increasing directory block size on xfs (don't remember if it can be done on extN) will definitely help.

This is an age-old problem. A gazillion files in a single directory take a long time to wade through. I recall one OS in particular that would slow down dramatically if the directory was too large to be cached forcing several physical I/O operations to make the initial access to each file (which could be fragmented so back to square one to find the next chunk). Even on Unix, having a directory with 50K+ files would make the nightly backups at one site slow down to a crawl when the backup software hit that directory.

rnturn · 11-23-2022, 11:09 PM

Quote:

Originally Posted by purveshp

I have requirement to read millions of small image files from one directory. Which filesystem / protocol to be used to improve read performance? XFS / Ext4 is not working for millions of files and slows. Any distributed filesystem or any other filesystem where I will get best read performance for large number of small files. I am using ubuntu 22.

Purvesh

"Millions" of files of any size in a single directory is going to be a challenge for any filesystem. But, in my experience, small files are the worst situation---processes are spending more time going back to the directory to retrieve more location information than they are reading from the little files. (I had a situation like this develop some years ago on a production database server and even doing something as simple as issuing "ls -l" in the directory chock full of small files would noticeably slow the system down.)

Questions:

* How big is the directory holding all those millions of files? (Just curious what "ls -dl parent-dir-of-all-those-files" returns.)

* Do you any control over the process that is chucking all these files into that subdirectory? If so...

* Would it be possible to re-engineer whatever process is creating all those files to have placed then in and retrieved them from multiple subdirectories?

Good luck...

wpeckham · 11-24-2022, 10:35 AM

Quote:

Originally Posted by rnturn

"Millions" of files of any size in a single directory is going to be a challenge for any filesystem. But, in my experience, small files are the worst situation---processes are spending more time going back to the directory to retrieve more location information than they are reading from the little files.

The more information we get (in pieces, slowly) the more it strikes me that they system is poorly designed form the start. What the OP really needs is a total re-engineering of the system and processing for efficiency and reliability.

Quote:

(I had a situation like this develop some years ago on a production database server and even doing something as simple as issuing "ls -l" in the directory chock full of small files would noticeably slow the system down.)

I did redesign a system at one time. The company experts had done a wonderful proof of concept operation and proven the process, and it was beautiful: until it hit production data. At that point you had so many files that a simple 'ls' failed, and took MINUTES to fail! File operations that did not require parsing the file system works fine, anything that had to cache , buffer, search, or list failed because the data overran the buffers. Simply segregating the files to different folders resolved the behavior, and speed up processing wonderfully. (And, now that I think about it, I doubt I ever got credit for that!)

chrism01 · 11-24-2022, 06:08 PM

Quote:

(And, now that I think about it, I doubt I ever got credit for that!)

I sympathise - you're not the only to solve an issue and not be recognised. I m sure there's a bunch of us.
You're prob right about a re-design and eg multiple dirs - I was thinking along those lines myself, but without a lot more info from the OP, it's tricky.
...