Re: Import large data set into a table and resolve duplicates?

Поиск

Список

Период

Сортировка

От	Paul A Jungwirth
Тема	Re: Import large data set into a table and resolve duplicates?
Дата	15 февраля 2015 г. 18:19:52
Msg-id	CA+renyVAexDNedKWgpGKEOXO3un_0MGCwFZPNUy5dkbf4kNepQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Import large data set into a table and resolve duplicates? (Eugene Dzhurinsky <jdevelop@gmail.com>)
Список	pgsql-general

Дерево обсуждения

Hi Eugene,

> Now I need to import the patch into the database, and produce another file as
> - if the passed "series" field exists in the database, then return ID:series
> - otherwise insert a new row to the table and generate new ID and return ID:series
> for each row in the source file.

I think Francisco's approach is good, and I agree that ~200k rows is
hardly anything. My approach is similar but uses CTEs to combine a lot
of Francisco's queries into one. I still have a separate COPY command
though. (It'd be great if you could COPY into a CTE, but I guess
COPYing into a temporary table is pretty close.) Anyway, when I run
this on my machine, the import finishes in a few seconds:

# Makefile
database=dictionary
port=5432
words=/usr/share/dict/american-english
SHELL=/bin/bash

initial.txt:
  for i in {1..3}; do \
    cat "${words}" | while read line; do \
      echo $$i "$$line"; \
    done; \
  done > initial.txt

tables: initial.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f tables.sql < initial.txt'

a.txt:
  for i in {1,4}; do \
    cat "${words}" | while read line; do \
      echo $$i "$$line"; \
    done; \
  done > a.txt

b.txt:
  for i in {4,5}; do \
    cat "${words}" | while read line; do \
      echo $$i "$$line"; \
    done; \
  done > b.txt

a: a.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f import.sql < a.txt'

b: b.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f import.sql < b.txt'

clean:
  rm -f initial.txt a.txt b.txt

.PHONY: tables a b clean


# tables.sql
DROP TABLE IF EXISTS dictionary;
CREATE TABLE dictionary (id SERIAL PRIMARY KEY, series VARCHAR NOT NULL);
\copy dictionary (series) from pstdin
CREATE UNIQUE INDEX idx_series ON dictionary (series);


# import.sql
CREATE TEMPORARY TABLE staging (
  series VARCHAR NOT NULL
);
\copy staging (series) from pstdin
CREATE INDEX idx_staging_series ON staging (series);

WITH already AS (
  SELECT  id, staging.series
  FROM    staging
  LEFT OUTER JOIN dictionary
  ON      dictionary.series = staging.series
),
adding as (
  INSERT INTO dictionary
  (series)
  SELECT  series::text
  FROM    already
  WHERE   id IS NULL
  RETURNING id, series
)
SELECT  id, series
FROM    adding
UNION
SELECT  id, series
FROM    already WHERE id IS NOT NULL
;


Good luck!
Paul

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Import large data set into a table and resolve duplicates?