Re: Using database to find file doublettes in my computer

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: Using database to find file doublettes in my computer
Дата
Msg-id 4925439A.2010601@postnewspapers.com.au
обсуждение исходный текст
Ответ на Re: Using database to find file doublettes in my computer  (Lothar Behrens <lothar.behrens@lollisoft.de>)
Список pgsql-general
Lothar Behrens wrote:

> I have expected a smaller amount of records due to the fact that for 4
> files each available 2 times (sum = 8) I
> have 8 records in ECADFiles, but must have 4 in the above result.
>
> So for an average of 2 doubles I expected half the files from
> ECADFiles, because one is exactly right and the other
> is on the left.

It's a combinatorial problem. If you have 3 duplicate files, eg:

INSERT INTO paths (path) VALUES
(E'C:\\path\\file1.txt'),
(E'C:\\path2\\file1.txt'),
(E'/path/file1.txt');

then the query process I described above will output the matches:

  C:\path\file1.txt       | C:\path2\file1.txt
  /path/file1.txt         | C:\path2\file1.txt
  /path/file1.txt         | C:\path\file1.txt

because while it avoids showing both (A,B) and (B,A) pairs, for any A, B
and C it'll show:

(A,B)
(A,C)
(B,C)

I've attached test SQL that does the above.

Presumably, you want to only show, say:

(A,B)
(A,C)

or maybe:

(filename, A)
(filename, B)
(filename, C)

If that's what you want, you need to work a little differently. The
attached SQL in dups_test2.sql shows one way you might do it, by
generating a list of files with duplicates then listing all the
locations each appears in. Again, you can do it without the temp table,
it'll probably just be slower. None of what I've written is particularly
fast anyway - it evaluates those regular expressions many more times
than should be required, for example.

--
Craig Ringer
BEGIN;

CREATE TABLE paths (
   path TEXT PRIMARY KEY
);

CREATE OR REPLACE FUNCTION get_filename(text, text) RETURNS VARCHAR AS $$
SELECT (regexp_split_to_array($1, $2))[array_upper(regexp_split_to_array($1, $2),1)];
$$ LANGUAGE SQL IMMUTABLE;

COMMENT ON FUNCTION get_filename(text, text) IS 'Extract filename part from path $1 using path separator $2';

CREATE OR REPLACE FUNCTION get_filename(text) RETURNS VARCHAR AS $$
SELECT get_filename($1, E'[/\\\\]');
$$ LANGUAGE SQL IMMUTABLE;

COMMENT ON FUNCTION get_filename(text, text) IS E'Extract filename part from path $1 using path separator / or \\';

INSERT INTO paths (path) VALUES
(E'C:\\path\\file1.txt'),
(E'C:\\path2\\file1.txt'),
(E'/path/file1.txt'),
(E'C:\\somewhere\\file2.txt'),
(E'/random/place/file2.txt'),
(E'/orphans/file3.blah');

COMMIT;
BEGIN;

SELECT
get_filename(path) AS fn, count(path) AS n
INTO TEMPORARY TABLE dup_files
FROM paths
GROUP BY get_filename(path)
HAVING count(path) > 1;

SELECT * FROM dup_files;

-- Creates UNIQUE index on PATH as well
ALTER TABLE dup_files ADD CONSTRAINT dup_files_path_pkey PRIMARY KEY (fn);

-- Now build your side-by-side table of duplicates:

SELECT p1.path, p2.path
FROM paths p1 INNER JOIN paths p2
     ON (get_filename(p1.path) = get_filename(p2.path))
WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
  AND p1.path > p2.path
ORDER BY get_filename(p1.path), p1.path, p2.path;

ROLLBACK;
SELECT get_filename(path) AS fn, min(path) AS path, count(path) AS dupcount
INTO dups
FROM paths
GROUP BY get_filename(path)
HAVING count(path) > 1;

SELECT * FROM dups;

SELECT dups.fn, paths.path
FROM dups INNER JOIN paths ON (dups.fn = get_filename(paths.path));

В списке pgsql-general по дате отправления:

Предыдущее
От: Richard Huxton
Дата:
Сообщение: Re: Serial - last value
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: Res: Res: Res: Archive files growth!!!