[TriLUG] Data manipulation over Samba

Robert Dale robdale at gmail.com
Tue May 22 08:43:22 EDT 2007


On 5/22/07, Jim Tuttle <jtuttle at prairienet.org> wrote:
> So, this has been bothering me. I'm hoping someone has an answer and,
> perhaps, a reference.
>
> Ok, I'm running some python data processing scripts against an
> orthophoto collection residing on a disk array in the basement.  There
> are about 4,300 images each about 76MB.  There are several smaller files
> with each image.  Part of the processing includes copying each file to
> another partition on our 14TB ATABeast.  The question is this: Is any of
> this data moving over the network to my machine?
>
> The processing is taking forever.  215 images in 8 hours.  I wondered if
> the images are being read into memory by my machine then written to the
> other partition on the array.  I have this fantasy that python tells the
> processor on the disk array to do the copying, but I imagine that isn't
> true.  To make matters worse, there are several connections through
> which this data traverses.  The array is mounted via fiber channel to a
> Solaris cluster which offers it to a linux machine in the cube next to
> me via NFS and I'm mounting that via samba on my desktop.
>
> I could have and probably should have run this on the intermediate
> machine, but wasn't thinking last night.  Neither the ATABeast nor the
> Solaris cluster have python installed and that's a non-starter.

Yes, it's going over the network.  And if you're not storing a local
copy for whatever processing you're doing, then you're copying twice -
once for copying, again for processing.

Here's your current round-trip:
disk -> fiber channel -> solaris -> nfs-> linux -> samba -> desktop ->
samba -> linux -> nfs -> solaris -> fiber channel -> disk

What I have to wonder is why you're not taking advantage of that
solaris cluster.
Processing images is a perfect case for distributed, parallel
processing and considering that the cluster sits right on the fiber
channel, it would be lightning fast.  I wouldn't be surprised if your
time was reduced to a few minutes (of course this depends on how many
and of what type of machines are in your cluster ;-).

If you do a mv on the same mount point and/or partition, then it would
probably do the smart thing and issue a local mv command.  However,
even on a local machine, when you move data across partitions, it has
to copy the data over since it's a completely different filesystem.

Looking at some packets, samba does issue a local move command.  But
if you copy a file to another file even in the same directory, the
remote machine reads all the data from the source file and writes it
back to the dest file all over the network.

-- 
Robert Dale



More information about the TriLUG mailing list