[TriLUG] Files in a directory question

Thu Feb 19 02:18:55 EST 2009

I'm not going to provide any empirical data (go ahead and laugh, Aaron), 
but I deal with file sets of this magnitude on a daily basis at $WORK. 
>From a practical perspective, Winders chokes on anything much more than 
1000 files in a directory, and even when Explorer does finally render, it 
takes a while (think minutes, not seconds).

Our UNIX (Solaris, Linux) systems take almost as long, but are still 
somewhat manageable in the 10k file range.  Over that and performance 
degrades horribly.  I've literally spent hours splitting directories with 
200k+ files in them into subdirectories with 10k files per each using a 
combination of known file naming patterns, shell scripts, perl scripts, 
and ranges (i.e., move the first 10k here, the next 10k there, etc.).

Save yourself the pain by coming up with a good practice.  I'd suggest 
limiting your file-per-directory count to a maximum of 5k (unless you have 
to deal with Winders, in which case, I'd say 1k if you have to look at 
them on any sort of regular basis).

William Sutton

On Wed, 18 Feb 2009, Aaron Joyner wrote:

> Steve really hits the nail on the head when he suggested "I'd suggest
> you test this by writing a perl or ruby program..." (well, except for
> the language choices, but I digress).  I started to type up a message,
> holding forth with a description of inodes, what system calls are
> required to get what data for what types of operations, and
> theoretical descriptions of why certain types of operations will get
> fast, and certain types will get slow.  I even wrote a big long epic
> post about all the details, and then deleted it.  If I'm right, it
> won't do half as much as this message will.  If you really want to get
> a feel for this, play with the attached utilities, and see for
> yourself how different operations scale as the number of files in a
> directory increases.  If for some reason mailman eats the attachments,
> see http://www.joyner.ws/code/
>
> I'll include a description of these utilities.  There are two scripts,
> makefiles, and benchmark.  makefiles creates a specified number of
> files (-n), named with numerically increasing filenames, starting at
> '1', filled with random data.  You can optionally specify the size of
> those files (-s, in bytes), or the directory it should create them in
> (pwd, by default).  benchmark runs an arbitrary command a specified
> number of times (-r, 100 by default), and prints out the average time
> it took to run the command.  If you don't specify a command, it does a
> set of various commands geared towards the original question in this
> thread.
>
> It's easiest to showcase this by example:
> $ makefiles -n 100; benchmark
> Created 100 files of 1024 bytes in 0.0151019096375 seconds.
> Executing "/bin/ls 32" 100 times, mean 0.00438050985336 seconds per run
> Executing "/bin/ls" 100 times, mean 0.00542480945587 seconds per run
> Executing "/bin/ls -l" 100 times, mean 0.00832781076431 seconds per run
> Executing "/bin/ls --color=always 32" 100 times, mean 0.00449644088745
> seconds per run
> Executing "/bin/ls --color=always" 100 times, mean 0.00568922996521
> seconds per run
> Executing "/bin/ls --color=always -l" 100 times, mean 0.00846176147461
> seconds per run
> Executing "/usr/bin/md5sum 32" 100 times, mean 0.00361362934113 seconds per run
>
> There's also a -s switch, for the shorter output, which looks like this:
> $ makefiles -n 100; benchmark -s
> Created 100 files of 1024 bytes in 0.01358294487 seconds.
>  0.004: /bin/ls 97
>  0.005: /bin/ls
>  0.008: /bin/ls -l
>  0.004: /bin/ls --color=always 97
>  0.006: /bin/ls --color=always
>  0.009: /bin/ls --color=always -l
>  0.004: /usr/bin/md5sum 97
>
> And a really abbreviated version suitable for gathering data to craft
> chart api URLs:
> makefiles -n 100; benchmark -m
> Created 100 files of 1024 bytes in 0.0098090171814 seconds.
>  0.005    0.013    0.036    0.004    0.014    0.017    0.004
>
>
> So, go forth, and test your own hypothesis about what's fast, and
> what's slow.  Use chart api to graph the results of your tests:
> http://chart.apis.google.com/chart?cht=lc&chs=300x125&chxt=x,x,y&chxl=0:|1|10|100|1000|10000|100000|1:|||number%20of%20files|||2:|0|0.04133414&chd=t:0.00504967,0.00506450,0.00529583,0.00792509,0.04133414&chds=0,0.04133414
>
> Then, post back some interesting data and analysis.  To get you
> started, here's some fun questions.  Anything not backed up with proof
> in the form of reproducible data will be summarily laughed off the
> thread.
>
> Will Steve Kuekes need to shard into directories?  Will he be
> satisfied with the performance of his app with 30k files in a single
> directory?  Do you need any additional information about his app in
> order to give an informed answer, and if so, what?  If he couldn't
> split up the data, and had to keep 10 million files in one directory,
> what techniques could he use to keep his app fast?  What must he avoid
> in that case?
>
> The only benefit is that you get smarter.  That is not to be underestimated.
> Aaron S. Joyner
>
> PS - In case anyone wants to get particular, yes, there's some *very*
> minimal Python over head in the measurements made by 'benchmark'.  For
> our purposes, we're interested in how they compare, and since all of
> the measurements have the same Python over head in the numbers, they
> still make for good relative comparisons.  For reference, the time
> results from the benchmark are very comparable with the output of the
> 'time' command for the same command.
> PPS - If you get any wise ideas about creating <dr evil> "1 million
> files" </dr evil> in a directory... be prepared for fun problems like:
> bash: /bin/rm: Argument list too long
> Resoling them is left as an exercise for the reader who was warned and
> didn't listen.
> PPPS - See Jon, Python is good.
>
>
> On Sat, Feb 14, 2009 at 10:04 AM, Steve Litt <slitt at troubleshooters.com> wrote:
>> On Friday 13 February 2009 02:23:43 pm Steve Kuekes wrote:
>>> Fellow triluggers,
>>>
>>> Here's a question, how many files should I put in one directory on an
>>> ext3 file system?  I've got an app that needs a bunch of jpg image
>>> files.  Right now its ~8,000-10,000 but later it could be 2 or 3 times
>>> that many.  So for performance reasons, should I split these files up
>>> into multiple sub-directories.  I can split them up according to the
>>> names and put just 100 or 1000 in each directory so all the files that
>>> start with a,b,c,d are in one folder called abcd, etc.  or I can just
>>> put them all in one folder.
>>>
>>> I know that theoretically it's probably a very large file number limit
>>> before it breaks, but for practical performance reasons in accessing
>>> files in the folder how many is too many?
>>
>> This is just one man's opinion.
>>
>> Unless you access this directory very, very rarely, you've already passed the
>> point where you should make it into a directory tree rather than a single
>> directory. If I remember correctly, 10,000 files yields a several second
>> latency.
>>
>> I'd suggest you test this by writing a perl or ruby program that creates
>> 10,000 different-named small files. Then cat one of the files, and see how
>> long it takes to cat the thing. To give you an idea of how fast a file can
>> load when unencumbered by massive directories, my UMENU program, which has a
>> separate file for each submenu, displays before your keypressing finger has
>> left the key, even on files that aren't cached. On a human scale, it's
>> instant.
>>
>> HTH
>>
>> SteveT
>>
>> Steve Litt
>> Recession Relief Package
>> http://www.recession-relief.US
>>
>> --
>> TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
>> TriLUG FAQ  : http://www.trilug.org/wiki/Frequently_Asked_Questions
>>
>