[TriLUG] Files in a directory question

Wed Feb 18 22:33:52 EST 2009

Steve really hits the nail on the head when he suggested "I'd suggest
you test this by writing a perl or ruby program..." (well, except for
the language choices, but I digress).  I started to type up a message,
holding forth with a description of inodes, what system calls are
required to get what data for what types of operations, and
theoretical descriptions of why certain types of operations will get
fast, and certain types will get slow.  I even wrote a big long epic
post about all the details, and then deleted it.  If I'm right, it
won't do half as much as this message will.  If you really want to get
a feel for this, play with the attached utilities, and see for
yourself how different operations scale as the number of files in a
directory increases.  If for some reason mailman eats the attachments,
see http://www.joyner.ws/code/

I'll include a description of these utilities.  There are two scripts,
makefiles, and benchmark.  makefiles creates a specified number of
files (-n), named with numerically increasing filenames, starting at
'1', filled with random data.  You can optionally specify the size of
those files (-s, in bytes), or the directory it should create them in
(pwd, by default).  benchmark runs an arbitrary command a specified
number of times (-r, 100 by default), and prints out the average time
it took to run the command.  If you don't specify a command, it does a
set of various commands geared towards the original question in this
thread.

It's easiest to showcase this by example:
$ makefiles -n 100; benchmark
Created 100 files of 1024 bytes in 0.0151019096375 seconds.
Executing "/bin/ls 32" 100 times, mean 0.00438050985336 seconds per run
Executing "/bin/ls" 100 times, mean 0.00542480945587 seconds per run
Executing "/bin/ls -l" 100 times, mean 0.00832781076431 seconds per run
Executing "/bin/ls --color=always 32" 100 times, mean 0.00449644088745
seconds per run
Executing "/bin/ls --color=always" 100 times, mean 0.00568922996521
seconds per run
Executing "/bin/ls --color=always -l" 100 times, mean 0.00846176147461
seconds per run
Executing "/usr/bin/md5sum 32" 100 times, mean 0.00361362934113 seconds per run

There's also a -s switch, for the shorter output, which looks like this:
$ makefiles -n 100; benchmark -s
Created 100 files of 1024 bytes in 0.01358294487 seconds.
  0.004: /bin/ls 97
  0.005: /bin/ls
  0.008: /bin/ls -l
  0.004: /bin/ls --color=always 97
  0.006: /bin/ls --color=always
  0.009: /bin/ls --color=always -l
  0.004: /usr/bin/md5sum 97

And a really abbreviated version suitable for gathering data to craft
chart api URLs:
 makefiles -n 100; benchmark -m
Created 100 files of 1024 bytes in 0.0098090171814 seconds.
  0.005    0.013    0.036    0.004    0.014    0.017    0.004

So, go forth, and test your own hypothesis about what's fast, and
what's slow.  Use chart api to graph the results of your tests:
http://chart.apis.google.com/chart?cht=lc&chs=300x125&chxt=x,x,y&chxl=0:|1|10|100|1000|10000|100000|1:|||number%20of%20files|||2:|0|0.04133414&chd=t:0.00504967,0.00506450,0.00529583,0.00792509,0.04133414&chds=0,0.04133414

Then, post back some interesting data and analysis.  To get you
started, here's some fun questions.  Anything not backed up with proof
in the form of reproducible data will be summarily laughed off the
thread.

Will Steve Kuekes need to shard into directories?  Will he be
satisfied with the performance of his app with 30k files in a single
directory?  Do you need any additional information about his app in
order to give an informed answer, and if so, what?  If he couldn't
split up the data, and had to keep 10 million files in one directory,
what techniques could he use to keep his app fast?  What must he avoid
in that case?

The only benefit is that you get smarter.  That is not to be underestimated.
Aaron S. Joyner

PS - In case anyone wants to get particular, yes, there's some *very*
minimal Python over head in the measurements made by 'benchmark'.  For
our purposes, we're interested in how they compare, and since all of
the measurements have the same Python over head in the numbers, they
still make for good relative comparisons.  For reference, the time
results from the benchmark are very comparable with the output of the
'time' command for the same command.
PPS - If you get any wise ideas about creating <dr evil> "1 million
files" </dr evil> in a directory... be prepared for fun problems like:
bash: /bin/rm: Argument list too long
Resoling them is left as an exercise for the reader who was warned and
didn't listen.
PPPS - See Jon, Python is good.

On Sat, Feb 14, 2009 at 10:04 AM, Steve Litt <slitt at troubleshooters.com> wrote:
> On Friday 13 February 2009 02:23:43 pm Steve Kuekes wrote:
>> Fellow triluggers,
>>
>> Here's a question, how many files should I put in one directory on an
>> ext3 file system?  I've got an app that needs a bunch of jpg image
>> files.  Right now its ~8,000-10,000 but later it could be 2 or 3 times
>> that many.  So for performance reasons, should I split these files up
>> into multiple sub-directories.  I can split them up according to the
>> names and put just 100 or 1000 in each directory so all the files that
>> start with a,b,c,d are in one folder called abcd, etc.  or I can just
>> put them all in one folder.
>>
>> I know that theoretically it's probably a very large file number limit
>> before it breaks, but for practical performance reasons in accessing
>> files in the folder how many is too many?
>
> This is just one man's opinion.
>
> Unless you access this directory very, very rarely, you've already passed the
> point where you should make it into a directory tree rather than a single
> directory. If I remember correctly, 10,000 files yields a several second
> latency.
>
> I'd suggest you test this by writing a perl or ruby program that creates
> 10,000 different-named small files. Then cat one of the files, and see how
> long it takes to cat the thing. To give you an idea of how fast a file can
> load when unencumbered by massive directories, my UMENU program, which has a
> separate file for each submenu, displays before your keypressing finger has
> left the key, even on files that aren't cached. On a human scale, it's
> instant.
>
> HTH
>
> SteveT
>
> Steve Litt
> Recession Relief Package
> http://www.recession-relief.US
>
> --
> TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> TriLUG FAQ  : http://www.trilug.org/wiki/Frequently_Asked_Questions
>