LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 11-20-2022, 03:56 AM   #1
purveshp
LQ Newbie
 
Registered: Nov 2022
Posts: 3

Rep: Reputation: 0
Issue to read millions of small files form one directory


I have requirement to read millions of small image files from one directory. Which filesystem / protocol to be used to improve read performance? XFS / Ext4 is not working for millions of files and slows. Any distributed filesystem or any other filesystem where I will get best read performance for large number of small files. I am using ubuntu 22.


Purvesh
 
Old 11-20-2022, 06:52 AM   #2
Keith Hedger
Senior Member
 
Registered: Jun 2010
Location: Wiltshire, UK
Distribution: Void, Linux From Scratch, Slackware64
Posts: 3,157

Rep: Reputation: 857Reputation: 857Reputation: 857Reputation: 857Reputation: 857Reputation: 857Reputation: 857
Much more info needed:
What have you done so far?
What do you need to do with these files? convert delete, display or what?
How small is 'small'?
GUI or CLI?

We can't help you until you help us understand what you want to do.
 
Old 11-20-2022, 07:54 AM   #3
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,359
Blog Entries: 3

Rep: Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767Reputation: 3767
Yes, more details would help.

What block size have you been using?
 
Old 11-20-2022, 08:48 AM   #4
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,754

Rep: Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983Reputation: 7983
Quote:
Originally Posted by purveshp View Post
I have requirement to read millions of small image files from one directory. Which filesystem / protocol to be used to improve read performance? XFS / Ext4 is not working for millions of files and slows. Any distributed filesystem or any other filesystem where I will get best read performance for large number of small files. I am using ubuntu 22.
To add to what others asked, you need to define "read" in this context. "Read" how? Do you mean you need to copy them? Back them up? Or do you mean they need to be 'read' and displayed on a web page? What is your actual goal?
 
Old 11-20-2022, 09:49 AM   #5
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,041

Rep: Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348Reputation: 7348
yes, would be nice to know how do you use those files, otherwise hard to speed up that process. Would be nice to find the real bottleneck. But without details hard to say anything. Also would be nice to know what do you mean by slow?
 
Old 11-20-2022, 10:02 AM   #6
lvm_
Member
 
Registered: Jul 2020
Posts: 984

Rep: Reputation: 348Reputation: 348Reputation: 348Reputation: 348
There is no silver bullet, all filesystems slow down when there are too many files in a single directory, that's why programs which store lots of small files typically create a balanced tree of subdirectories like e.g. squid does. Distributed filesystem will obviously do the same, only in a more roundabout way. That said, however, there are some filesystem tweaks which make filesystems perform better when huge directories are needed: enabling dir_index on ext4/3 or increasing directory block size on xfs (don't remember if it can be done on extN) will definitely help.
 
Old 11-20-2022, 10:28 AM   #7
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,767

Rep: Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765
We might be able to provide better targeted advice with better information, the general points that have been made are valid. I just want to add that the underlying hardware plays a part.

I find EXT4 slows down later and less if it is in RAID-5 SSD or high quality enterprise SAN storage. If you have a single disk your options are more limited. Under load using an in-house database with a HUGE number of files of mixed sizes, EXT4 beat XFS and BTRFS on RAID-5 a few years ago, but significant improvements to all file system and BTRFS in particular have been developed since my testing.

BTRFS starts out a bit slower than single device EXT4 or XFS, but does not degrade the same and might perform better for some cases.
 
Old 11-22-2022, 03:37 PM   #8
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,691
Blog Entries: 4

Rep: Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947
For a variety of reasons, I would suggest that you re-structure this process to subdivide the "millions" of files. For example, the first three (say ...) characters of the filename could become a directory-name in which the file is stored.
 
Old 11-23-2022, 08:10 PM   #9
purveshp
LQ Newbie
 
Registered: Nov 2022
Posts: 3

Original Poster
Rep: Reputation: 0
HI,

Thank you for pointers.

Also, to answer questions,
I have to read small image files of the size of 10 KB to 64 KB (mixed quantity) from CLI in sequential mode. I do it for image processing. I have configured XFS and stored data and read from XFS, but it is somehow slow. Also, when i mount it on multiple servers, the sync. issues do occur and corruptions also occur.
 
Old 11-23-2022, 09:45 PM   #10
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,767

Rep: Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765
Quote:
Originally Posted by purveshp View Post
HI,

Thank you for pointers.

Also, to answer questions,
I have to read small image files of the size of 10 KB to 64 KB (mixed quantity) from CLI in sequential mode. I do it for image processing. I have configured XFS and stored data and read from XFS, but it is somehow slow. Also, when i mount it on multiple servers, the sync. issues do occur and corruptions also occur.
Okay, you just introduced a new and IMPORTANT variable. When you say "mount it on multiple servers" I presume you might mean mounting a remote storage volume over network, which then involves multiple I/O queues, a network path and protocol, and sorage drivers of different types on BOTH ends of the connection.

Is it possible to just limit this to troubleshooting at the host that has the native storage directly attached without involving network traffic in any way?
 
Old 11-23-2022, 10:19 PM   #11
!!!
Member
 
Registered: Jan 2017
Location: Fremont, CA, USA
Distribution: Trying any&ALL on old/minimal
Posts: 997

Rep: Reputation: 382Reputation: 382Reputation: 382Reputation: 382
https://www.reddit.com/r/sysadmin/co..._to_glusterfs/
 
Old 11-23-2022, 10:37 PM   #12
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,818

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by lvm_ View Post
There is no silver bullet, all filesystems slow down when there are too many files in a single directory, that's why programs which store lots of small files typically create a balanced tree of subdirectories like e.g. squid does. Distributed filesystem will obviously do the same, only in a more roundabout way. That said, however, there are some filesystem tweaks which make filesystems perform better when huge directories are needed: enabling dir_index on ext4/3 or increasing directory block size on xfs (don't remember if it can be done on extN) will definitely help.
This is an age-old problem. A gazillion files in a single directory take a long time to wade through. I recall one OS in particular that would slow down dramatically if the directory was too large to be cached forcing several physical I/O operations to make the initial access to each file (which could be fragmented so back to square one to find the next chunk). Even on Unix, having a directory with 50K+ files would make the nightly backups at one site slow down to a crawl when the backup software hit that directory.
 
Old 11-23-2022, 11:09 PM   #13
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,818

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by purveshp View Post
I have requirement to read millions of small image files from one directory. Which filesystem / protocol to be used to improve read performance? XFS / Ext4 is not working for millions of files and slows. Any distributed filesystem or any other filesystem where I will get best read performance for large number of small files. I am using ubuntu 22.


Purvesh
"Millions" of files of any size in a single directory is going to be a challenge for any filesystem. But, in my experience, small files are the worst situation---processes are spending more time going back to the directory to retrieve more location information than they are reading from the little files. (I had a situation like this develop some years ago on a production database server and even doing something as simple as issuing "ls -l" in the directory chock full of small files would noticeably slow the system down.)

Questions:

* How big is the directory holding all those millions of files? (Just curious what "ls -dl parent-dir-of-all-those-files" returns.)

* Do you any control over the process that is chucking all these files into that subdirectory? If so...

* Would it be possible to re-engineer whatever process is creating all those files to have placed then in and retrieved them from multiple subdirectories?

Good luck...
 
Old 11-24-2022, 10:35 AM   #14
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,767

Rep: Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765Reputation: 2765
Quote:
Originally Posted by rnturn View Post
"Millions" of files of any size in a single directory is going to be a challenge for any filesystem. But, in my experience, small files are the worst situation---processes are spending more time going back to the directory to retrieve more location information than they are reading from the little files.
The more information we get (in pieces, slowly) the more it strikes me that they system is poorly designed form the start. What the OP really needs is a total re-engineering of the system and processing for efficiency and reliability.
Quote:
(I had a situation like this develop some years ago on a production database server and even doing something as simple as issuing "ls -l" in the directory chock full of small files would noticeably slow the system down.)
I did redesign a system at one time. The company experts had done a wonderful proof of concept operation and proven the process, and it was beautiful: until it hit production data. At that point you had so many files that a simple 'ls' failed, and took MINUTES to fail! File operations that did not require parsing the file system works fine, anything that had to cache , buffer, search, or list failed because the data overran the buffers. Simply segregating the files to different folders resolved the behavior, and speed up processing wonderfully. (And, now that I think about it, I doubt I ever got credit for that!)
 
Old 11-24-2022, 06:08 PM   #15
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,369

Rep: Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753
Quote:
(And, now that I think about it, I doubt I ever got credit for that!)
I sympathise - you're not the only to solve an issue and not be recognised. I m sure there's a bunch of us.
You're prob right about a re-design and eg multiple dirs - I was thinking along those lines myself, but without a lot more info from the OP, it's tricky.
...
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Delete millions of files from a directory resuni Linux - Software 12 05-04-2016 02:31 PM
How were Linus Torvalds or Guido able to convince millions of people to contribute to a small university project? prahladyeri Programming 11 03-03-2016 11:06 AM
Deleting a Directory with millions of files and sub directories ramsforums Linux - Software 41 08-26-2015 06:34 PM
Creating a large tar ile form a sequence of small tar files and one files is missing jyunker Linux - Newbie 4 03-10-2015 02:56 PM
how do I copy a whoel folder form one directory to another form the command line? zwyrbla Linux - Newbie 8 08-24-2004 06:40 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 05:14 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration