[TriLUG] Website Directory Listing via HTTP?

Fri Aug 26 12:53:51 EDT 2005

On 8/25/05, Tanner Lovelace <clubjuggler at gmail.com> wrote:
> On 8/25/05, Shane O'Donnell <shaneodonnell at gmail.com> wrote:
> > So here's the real use case...
> >
> > A web server, for which I have ONLY HTTP access, has an MP3
> > repository.  So does my laptop.  I want to be able to pull a complete
> > listing of available MP3s, including directory information (which
> > contains artist, album info) and compare it to the MP3s I have on my
> > laptop so I can determine what I need to push to the web server and
> > what is available that I don't have on my laptop.
> >
> > So is "a manual rsync via HTTP" a better description?
> 
> Shane,
> 
> Correct me if I'm wrong, but it appears you want to do
> two things.
> 
> 1. Copy any files you don't have from the http server to your laptop
> 2. Copy any files that are only on the laptop to the server.
> 
> With only http access I can't help you with #2. (I mean, if we only
> had a wheelbarrow, that would be something.)  But, with #1,
> you should be able to do it with wget.  Look up the docs for
> "mirroring" and for restricting what it gets by file extension.
> Then, as long as the directory layouts are the same, have
> it mirror and only get mp3 files.
> 

Ultimately, I want to have a single repository on a third server
(which I have full access to).

In the interim, I'm trying to figure out which files I can safely
delete from my laptop (because I have access to another copy
elsewhere).

This effort is about comparing the files that are available, not about
copying or downloading.  I already have used wget quite successfully
to help get me into the current state I'm in.  :-)  Problem is, I've
cleaned up file and directory names, so wget won't serve it's
"mirroring" function appropriately anymore.

So I'm trying to come up with text file listings of everything that's
on the server (which is 180+GB or so) without having to download it
all.  The previous "links -dump" suggestion comes close, but doesn't
recurse.  wget recurses, but either downloads the file or provides
back a very verbose message that I'm trying not to parse to hack the
info out of it.  curl -l would do it, but only for an ftp server.

Currently, it appears the easiest answer is a quick/dirty script
around links, manually recursing the directories in a loop.  However,
I also need to do basic auth, which wget handles well and links
doesn't appear to.

So, is a script around wget my only option?

And yes, by now it would have been easier to have just shut up and
written the script, but there's got to be a better way...

And I'm not left-handed, either.

Shane O.