Re: SELECT INTO large FKyed table is slow

Поиск
Список
Период
Сортировка
От Mladen Gogala
Тема Re: SELECT INTO large FKyed table is slow
Дата
Msg-id 4CF5259C.8000306@vmsinfo.com
обсуждение исходный текст
Ответ на SELECT INTO large FKyed table is slow  (Mario Splivalo <mario.splivalo@megafon.hr>)
Ответы Re: SELECT INTO large FKyed table is slow
Список pgsql-performance
I'm just back from vacation, so I apologize in advance if I missed
anything of importance. Here is something to consider:

Instead of using the statement you used to create the table, try the
following:

CREATE TABLE drones_history (
    drone_id integer not null,
    sample_id integer not null,
    drone_log_notice character varying,
    drone_temperature float,
    drone_pressure float,
    constraint drones_history_pk primary key (drone_id, sample_id),
    constraint drones_history_fk__samples foreign key (sample_id)
references samples(sample_id),
    constraint drones_history_fk__drones foreign key (drone_id) references  drones(drone_id) deferrable
);


At the beginning of the load, you should defer all of the deferrable
constraints, setting constraints deferred and  issuing the copy
statement within a transaction block, like this:

    scott=# begin;
    BEGIN
    Time: 0.203 ms
    scott=# set constraints all deferred;
    SET CONSTRAINTS
    Time: 0.201 ms
    scott=# copy test1 from '/tmp/test1.csv';
    COPY 100
    Time: 11.939 ms
    scott=# commit;
    ERROR:  insert or update on table "test1" violates foreign key
    constraint "fk_tst1_deptno"
    DETAIL:  Key (col1)=(1) is not present in table "dept".


Of course, that will require complete rewrite of your load script,
because the errors will be checked at the commit time and transaction
can either fail as a whole or succeed as a whole. It's all or nothing
situation. How frequently do you see records with an incorrect drone_id?
If that happens only once in a blue moon, you may need no stinkin'
foreign keys in the first place, you may be able
to have a batch job that will flag all the records with an invalid
drone_id instead.
Furthermore, you can make sure that you have enough shared buffers to
cache the entire "drones" table. Also, do "strace" on the postgres
process handling your session and see whether the time is spent writing
to WAL archives. If that is slowing you down, you should consider buying
a SSD or a high end disk drive. I have never had such problem, but you
should also check whether pg_loader can do anything for you.

As far as speed is concerned, inserting with deferred foreign keys is
almost as fast as inserting without foreign keys:


scott=# alter table test1 drop constraint fk_tst1_deptno;
ALTER TABLE
Time: 16.219 ms
scott=# copy test1 from '/tmp/test1.csv';
COPY 100
Time: 10.418 ms

If you take a look at the example above, you will see that inserting
with a deferred FK took 11.939 milliseconds while inserting into  the
same table without the FK took 10.418 milliseconds, the difference of
1.5 milliseconds per 100 rows. The timing of 2 seconds per 100
rows looks suspiciously high.  Me thinks that your problem is not just
the foreign key, there must be something else devouring the time. You
should have a test instance, compiled with "-g" option and do profiling.

Mario Splivalo wrote:
> The database for monitoring certain drone statuses is quite simple:
>
> CREATE TABLE samples (
>     sample_id integer not null primary key,
>     sample_timestamp timestamp not null default now()
> );
>
> CREATE TABLE drones (
>     drone_id integer not null primary key,
>     drone_log_notice character varying,
>     crone_coordinates point not null,
>     drone_temperature float,
>     drone_pressure float
> );
>
> CREATE TABLE drones_history (
>     drone_id integer not null,
>     sample_id integer not null,
>     drone_log_notice character varying,
>     drone_temperature float,
>     drone_pressure float,
>     constraint drones_history_pk primary key (drone_id, sample_id),
>     constraint drones_history_fk__samples foreign key (sample_id)
> references samples(sample_id),
>     constraint drones_history_fk__drones foreign key (drone_id) references
> drones(drone_id)
> );
>
> Every ten to twenty minutes I receive CSV file with most of the drones
> statuses. CSV file includes data for new drones, if they're put into
> use. When I receive new data I load whole CSV file to a database, then
> call stored procedure that 'deals' with that data.
>
> So far I have around 6000 samples, around 160k drones and drones_history
> is around 25M rows.
>
> The CSV file contains around 15k-20k of 'rows', mostly data about old
> drones. Every now and then (on every 5th - 10th CSV-insert) there is
> data with around 1000-5000 new drones.
>
> Here is what I do in stored procedure, after i COPYed the data from the
> CSV to temporary.drones table:
>
> First, I create temporary table, inside the procedure, that holds rows
> for the new drones:
>
> CREATE TEMPORARY TABLE tmpNew ON COMMIT DROP AS
> SELECT drone_id, log_notice, coord_x, coord_y, temp, press
> FROM temp.drones WHERE NOT EXISTS (SELECT 1 FROM public.drones WHERE
> public.drones.drone_id = temporary.drone.drone_id);
>
> This is done in miliseconds, even if the count for the new drones is
> large (i've tested it with 10k new drones although I real-life action
> I'd never get more thatn 5k new drones per CSV).
>
> INSERT INTO public.drones (drone_id, drone_log_notice, coordinates,
> drone_temperature, drone_temperature)
> SELECT drone_id, log_notice, point(coord_x, coord_y) as coordinates,
> temp, press FROM tmpNew;
> INSERT INTO public.drones_history (sample_id, drone_id,
> drone_log_notice, drone_temperature, drone_pressure)
> SELECT a_sample_id, drone_id, log_notice, temp, pressue FROM tmpNew;
>
> This is also done in miliseconds.
>
> Now, I 'update' data for the existing drones, and fill in the history
> table on those drones. First I create temporary table with just the
> changed rows:
>
> CREATE TEMPORARY TABLE tmpUpdate ON COMMIT DROP AS
> SELECT a_batch_id, t.drone_id, t.log_notice, t.temp, t.press
>     FROM temporary.drones t
>     JOIN public.drones p
>     ON t.drone_id = p.drone_id
> WHERE p.drone_log_notice != t.log_notice OR p.temp != t.temp OR p.press
> != t.press;
>
> Now, that part is also fast. I usualy have around 100-1000 drones that
> changed 'state', but sometimes I get even half of the drones change
> states (around 50k) and creation of the tmpUpdate takes no more than ten
> to twenty milliseconds.
>
> This is the slow part:
> INSERT INTO drones_history (sample_id, drone_id, drone_log_notice,
> drone_temperature, drone_pressure)
> SELECT * FROM tmpUpdate;
>
> For 100 rows this takes around 2 seconds. For 1000 rows this takes
> around 40 seconds. For 5000 rows this takes around 5 minutes.
> For 50k rows this takes around 30 minutes! Now this is where I start lag
> because I get new CSV every 10 minutes or so.
>
> And the last part is to upadte the actual drones table:
> UPDATE public.drones p
> SET drone_log_notice = t.log_notice, drone_temperature = t.temp,
> drone_pressure = t.press
> FROM temporary.drones t
> WHERE t.drone_id = p.drone_id
> AND (t.log_notice != p.drone_log_notice OR t.temp != p.drone_temperature
> OR p.press != t.drone_pressure);
>
> This is also very fast, even when almost half the table is updated the
> UPDATE takes around 10 seconds. Usualy it's around 30-50 ms.
>
> The machine I'm doing this has 4 GB of RAM, dual-Xeon something (3GHz).
> Two SAS drives in mirror, capable of around 100 MB/s in sequential r/w
> (i know it means nothing, but just to get an idea).
>
> Database is around 2 GB is size (pg_database_size). When I dump/recreate
> the database I can speedup things a bit, but after half day of
> operations the INSERTs are slow again.
> When I do dump/restore of the database I get around 40/50 MB/sec
> reding/writing from the disk (COPYing data, PK/FK constraints creation),
> but when that INSERT gets stuck io-wait goes to skies - iostat shows
> that Postgres is mainly reading from the disks, around 800k/sec - 1024k/sec.
>
> I've set shared_buffers to 256M, work_mem to 96M, wal_buffers to 16M and
> checkpoint_segments to 16. I've turned off autovaccum, I do
> analyze/vacuum after each insert-job is done, after TRUNCATEing
> temporary.drones table.
>
> Out of despair I tried to set fsync=off, but that gave me just a small
> performance improvement.
>
> When I remove foreign constraints (drones_history_fk__samples and
> drones_history_fk__drones) (I leave the primary key on drones_history)
> than that INSERT, even for 50k rows, takes no more than a second.
>
> So, my question is - is there anything I can do to make INSERTS with PK
> faster? Or, since all the reference checking is done inside the
> procedure for loading data, shall I abandon those constraints entirely?
>
>     Mario
>
>
>


--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




В списке pgsql-performance по дате отправления:

Предыдущее
От: bakkiya
Дата:
Сообщение: postgresql statements are waiting
Следующее
От: "Kevin Grittner"
Дата:
Сообщение: Re: tidscan not work ? Pg 8.4.5 + WinXP