[geeks] script language advice
Nadine Miller
velociraptor at gmail.com
Sat Feb 2 12:12:07 CST 2008
Shannon Hendrix wrote:
> On Feb 1, 2008, at 10:53 PM, Nadine Miller wrote:
>
>> What language would the collective brain recommend for a script to
>> parse
>> lines of up to 7500 chars in length? I'm leaning towards shell or php
>> since I've been doing a lot of tinkering with those of late, and my
>> perl
>> is very weak.
>
> PHP is the Microsoft of programming languages, but it's your brain... :)
Well, I'm no genius at any scripting language, but PHP is pretty much a
necessity if you work with OSS web apps, which I've been doing lately
for my own sites. So it's syntax is "fresh" in my mind. I haven't
touched any perl in a long time.
> Shell is very powerful, but also very slow and looks line line noise.
>
> Do you care what the names of the duplicate file listings are?
>
> The basic algorithm:
>
[snip pseudo code]
>> Aside from the line lengths, the biggest bear is that the filesystems
>> are fat32, so there's a lot of unusual characters (rsync choked on "?"
>> for example) and spaces in the file paths.
>
> How did you get such long lengths from fat32?
>
> I thought it had a 256 character total limit?
Based on your and Jonathan's responses, apparently I didn't make it
clear that I have the list of duplicates--fslint generated that already.
Now I need to parse the output of fslint to get down to one copy per file.
The dup list is all on one line. Thus the need to parse such long
lines. A pseudo-example:
num of dupes * filesize /path/to/file/filename /path/to/file/filename2
/path/to/filename3 [...]
So, for a file that has 1632 copies (yes, even I !! at that, and I'm a
pack rat :) ), even if the file's path is short, the line length is
humongous.
I also have to be very careful, as this is me cleaning up my dad's
computer data--he passed away recently, and I'm the only one equipped to
handle this task. Unfortunately, his backup plan was "make copies
everywhere".
I guess I should have explicitly asked, what language and/or command
line utilities can parse a line length of ~7500 chars without choking?
Can (g)awk handle a line 7500 chars long?
Now that I think about the process though, I realize I'm trying to
reinvent the wheel. I could run fslint with the delete option on
smaller, related subsets of directories to remove duplicates before
running the tool over the entire fileset.
I had never used fslint before so wanted to make sure of the method it
was using to find dupes, since the documentation is lacking. I just
generated the list of dupes on the first run. I monitored it while it
was running and it generates both md5 sums as well as SHA-1 to determine
duplicate files. I don't think running the gui over the entire fileset
is wise, since the box it's on is somewhat memory restricted, and
there's in excess of 135K duplicate files.
Sorry for the noise.
=Nadine=
More information about the geeks
mailing list