Re: Using database to find file doublettes in my computer

Поиск
Список
Период
Сортировка
От Eus
Тема Re: Using database to find file doublettes in my computer
Дата
Msg-id 849157.43436.qm@web37603.mail.mud.yahoo.com
обсуждение исходный текст
Ответ на Using database to find file doublettes in my computer  (Lothar Behrens <lothar.behrens@lollisoft.de>)
Список pgsql-general
Hi Ho!

--- On Tue, 11/18/08, Lothar Behrens <lothar.behrens@lollisoft.de> wrote:

> Hi,
>
> I have a problem to find as fast as possible files that are
> double or
> in other words, identical.
> Also identifying those files that are not identical.
>
> My approach was to use dir /s and an awk script to convert
> it to a sql
> script to be imported into a table.
> That done, I could start issuing queries.
>
> But how to query for files to display a 'left / right
> view' for each
> file that is on multible places ?
>
> I mean this:
>
> This File;Also here
> C:\some.txt;C:\backup\some.txt
> C:\some.txt;C:\backup1\some.txt
> C:\some.txt;C:\backup2\some.txt
>
> but have only this list:
>
> C:\some.txt
> C:\backup\some.txt
> C:\backup1\some.txt
> C:\backup2\some.txt
>
>
> The reason for this is because I am faced with the problem
> of ECAD
> projects that are copied around
> many times and I have to identify what files are here
> missing and what
> files are there.
>
> So a manual approach is as follows:
>
> 1)   Identify one file (schematic1.sch) and see, where are
> copies of
> it.
> 2)   Compare the files of both directories and make a
> desision about
> what files to use further.
> 3)   Determine conflicts, thus these files can't be
> copied together
> for a cleanup.
>
> Are there any approaches or help ?

I also have been in this kind of circumstance before, but I work under GNU/Linux as always.

1. At that time, I used `md5sum' to generate the fingerprint of all files in a given directory to be cleaned up.

2. Later, I created a simple Java program to group the names of all files that had the same fingerprint (i.e., MD5
hash).

3. I simply deleted the files with the same MD5 hash but one file with a good filename (in my case, the filename
couldn'tbe relied on to perform a comparison since it differed by small additions like date, author's name, and the
like).

4. After that, I used my brain to find related files based on the filenames (e.g., `[2006-05-23] Jeff - x.txt' should
bethe same as `Jenny - x.txt'). Of course, the Java program also helped me in grouping the files that I thought to be
related.

5. Next, I perused the related files to see whether most of the contents were the same. If yes, I took the latest one
basedon the modified time. 

> This is a very time consuming job and I am searching for
> any solution
> that helps me save time :-)

Well, I think I saved a lot of time at that time to be able to eliminate about 7,000 files out of 15,000 files in about
twoweeks. 

> I know that those problems did not arise when the projects
> are well
> structured and in a version management system. But that
> isn't here :-)

I hope you employ such a system ASAP :-)

> Thanks
>
> Lothar

Best regards,

Eus (FSF member #4445)



In this digital era, where computing technology is pervasive,

your freedom depends on the software controlling those computing devices.



Join free software movement today!

It is free as in freedom, not as in free beer!



Join: http://www.fsf.org/jf?referrer=4445




В списке pgsql-general по дате отправления:

Предыдущее
От: "Scott Marlowe"
Дата:
Сообщение: Re: In memory Database for postgres
Следующее
От: "Webb Sprague"
Дата:
Сообщение: "INNER JOIN .... USING " in an UPDATE