Re: BUG #13846: INSERT ONCONFLICTconsumessequencersonconflicts

Поиск
Список
Период
Сортировка
От Paul
Тема Re: BUG #13846: INSERT ONCONFLICTconsumessequencersonconflicts
Дата
Msg-id 572d0f00.e33a320a.57866.ffffdf07@mx.google.com
обсуждение исходный текст
Ответ на Re: BUG #13846: INSERT ON CONFLICTconsumessequencersonconflicts  (Paul <paul@salesintel.com>)
Список pgsql-bugs
Just for clarity on the real financial problem of usability,

In our situation, we have =E2=80=98event=E2=80=99 tables, and related =E2=
=80=98category=E2=80=99 tables; I=E2=80=99m simplifying and generalizing a =
bit. The event tables capture streams of activity. Each record has several =
columns that categorize each event. Sometimes new categories come along fro=
m the raw event data source, but several categories would never have more t=
han a couple hundred discreet values, if that. We also need to filter, grou=
p, and aggregate the events along some of those categories.

So structure is something like this (although with more category columns)

CREATE TABLE event (
  id BIGSERAL NOT NULL,
  cat_a_id SMALLINT NOT NULL,  /* would have to be BIGINT if using IOC */
  cat_b_id SMALLINT NOT NULL,  /* would have to be BIGINT if using IOC */
  measure FLOAT,
  CONSTRAINT pk_event PRIMARY KEY (id)
);
CREATE INDEX ix_event1 ON event USING BTREE(cat_a_id);
CREATE INDEX ix_event2 ON event USING BTREE(cat_b_id, cat_a_id);

CREATE TABLE cat_a (
  id SMALLSERIAL NOT NULL,  /* would have to be BIGSERIAL if using IOC */
  name TEXT NOT NULL,
  CONSTRAINT pk_cat_a PRIMARY KEY (id)
);
CREATE UNIQUE INDEX ux_cat_a ON EVENT USING BTREE(name);

CREATE TABLE cat_b (
  id SMALLSERIAL NOT NULL,  /* would have to be BIGSERIAL if using IOC */
  name TEXT NOT NULL,
  CONSTRAINT pk_cat_b PRIMARY KEY (id)
);
CREATE UNIQUE INDEX ux_cat_b ON EVENT USING BTREE(name);

Some =E2=80=98raw=E2=80=99 input source records might look like this
=E2=80=9Cso-reg=E2=80=9D, =E2=80=9Copen=E2=80=9D, 10
=E2=80=9Cso-reg=E2=80=9D, =E2=80=9Cstalled=E2=80=9D, 1
=E2=80=9Cno-reg=E2=80=9D, =E2=80=9Cstalled=E2=80=9D, 1
=E2=80=9Cunknown=E2=80=99, =E2=80=9Congoing=E2=80=9D, 1

As part of ingesting the raw event source into the event table, we do an up=
sert on the category tables, get the category=E2=80=99s id, then store the =
id in the respective category column in the event table.

Just in a few months, one of our event tables, having 9 category type colum=
ns, is already up over 200M records, and may hit a 1B in a few more months,=
 and maybe in a year or so over 4B.

If we had continued to use INSERT ON CONFLICT [IOC], the cat_a_id and cat_b=
_id columns in the event table would have to be changed from SMALLINT to BI=
GINT, just because the related id sequences from the cat_a and cat_b tables=
 are getting consumed on every insert into the event table, even though 99.=
9% of the time nothing is actually being inserted into the cat_a and cat_b =
tables.

We=E2=80=99re running our stuff in a cloud host, so we=E2=80=99re paying fo=
r memory, cpu, SSD storage, HDD storage, and network bandwidth (backups) on=
 a monthly basis.

So if using IOC, our one event table with 9 category columns at 200M record=
s goes from 9 * 2 =3D 18 bytes-per-record to  9 * 8 =3D 72 bytes-per-record=
. Or, from 3.6GB to 14.4GB! At a 1B events 18GB to 72GB!! And that doesn=E2=
=80=99t include the indexes on the event table for certain category columns=
 and combinations, and that=E2=80=99s just one table in one replica.

56GB extra means we can fit way less of the event rows into memory, it take=
s up way more expensive SSD space (and HDD for backup), uses way more bandw=
idth and just takes longer for backups and replication, and the CPU when do=
ing sorts and grouping etc has to compare lots more bits. This can add up t=
o thousands of dollars or more over time.

Also, if we had kept using IOC and made the category columns just INT, woul=
d we have had to convert 4B records from INT to BIGINT sometime next year. =
Ouch!!

So, are problem with using IOC isn=E2=80=99t at all technical, it=E2=80=99s=
 financial! Very expensive just to save a few lines of SQL=E2=80=A6

I=E2=80=99m kinda curious what the initial use cases for IOC were? If it ac=
tually was for UPSERTs, under what conditions? A huge amount of data is jus=
t events and categories; i.e. tweets, emails, sensor readings, purchase ord=
ers, etc., and region, person, device, product-brand, etc. Ingesting this k=
ind of data is a prime use case for UPSERTs.. Just curious what problem IOC=
 is solving?

Anyways, none of this is meant to be at all a knock on anyone contributing =
to PG.. It=E2=80=99s an absolutely amazing amount of high quality value for=
 free.. Thanks to all of you.

From: Paul
Sent: Friday, May 6, 2016 2:06 PM
To: David G. Johnston
Cc: Andres Freund; Peter Geoghegan; pgsql-bugs
Subject: RE: [BUGS] BUG #13846: INSERT ONCONFLICTconsumessequencersonconfli=
cts

Hi Dave,=20
Thanks for the informative reply. We absolutely agree it=E2=80=99s not a bu=
g in implementation, but maybe a bug in conceptualization with regard to us=
ability. Not knowing the initial motive for the feature, we assumed it was =
meant for an intuitively simpler form of the rather common upsert pattern, =
but it seems INSERT ON CONFLICT was meant for some other purpose that we do=
n=E2=80=99t quite understand. We stopped using and went back to a manual ap=
proach, which also works great because PostgreSQL is just kinda great =F0=
=9F=98=8A.

If any related enhancements where to be done, our =E2=80=98holy grail=E2=80=
=99 would be an implementation of the MERGE statement, like this, or this, =
or this=E2=80=A6 We can at least ask for it right?? =F0=9F=98=89

Thanks for all your guys=E2=80=99 hard work

-p

From: David G. Johnston
Sent: Friday, May 6, 2016 1:02 PM
To: Paul
Cc: Andres Freund; Peter Geoghegan; pgsql-bugs
Subject: Re: [BUGS] BUG #13846: INSERT ON CONFLICTconsumessequencersonconfl=
icts

On Wed, Jan 6, 2016 at 5:14 PM, Paul <paul@salesintel.com> wrote:
If I had a BEFORE INSERT trigger, it would only execute for records that we=
re actually going to be inserted. However, you=E2=80=99re now stating that =
the INSERT ON CONFLICT will execute a BEFORE INSERT trigger even for record=
s that are not inserted?? That doesn=E2=80=99t seem quite logical, and cont=
rary to =E2=80=98old fashioned=E2=80=99 upsert logic. Does this also mean t=
hat the BEFORE UPDATE trigger is always called as well, or is it never call=
ed?

=E2=80=8BThis seems to boil down to the two possible ways of manually imple=
menting UPSERT:

UPDATE, if not present, INSERT
INSERT, if failing, UPDATE

In the later the before insert trigger fires and influences whether the ins=
ert=E2=80=8B
=C2=A0
=E2=80=8Bfails.=C2=A0 In the former you are already pretty certain the inse=
rt will =E2=80=8Bsucceed because the UPDATE found no records.

We've implemented INSERT, if failing UPDATE.=C2=A0 The insert has to be att=
empted and right now there is no concept of targeted partial deferrability =
when constructing the record to be inserted.

To solve this situation it is likely that some form of "UPDATE ON MISSING I=
NSERT" would need to be designed.=C2=A0 The insert portion would specify "D=
EFAULT" for sequence columns and would execute nextval() only if the ON MIS=
SING portion is executed.

Fundamentally, the difference is that ON MISSING is considerably less compl=
icated than ON CONFLICT.=C2=A0 What is wanted here is an ON MISSING interpr=
etation but what we've implemented is ON CONFLICT.=C2=A0 It seems that the =
hackers are in agreement that our implementation of ON CONFLICT is consiste=
nt with its definition.=C2=A0 That it doesn't efficiently solve problems be=
tter handled by ON MISSING - while unfortunate - doesn't constitute a bug: =
only an opportunity for future enhancement.

David J.


=E2=80=8B

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Paul
Дата:
Сообщение: Re: BUG #13846: INSERT ON CONFLICTconsumessequencersonconflicts
Следующее
От: Casey Wireman
Дата:
Сообщение: Re: BUG #14126: INSERT ON CONFLICT DO NOTHING auto increments serial primary key when no insert happens.