Обсуждение: ADD COLUMN ts tsvector GENERATED too slow
I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around50M rows. These are text fields extracted from 4-5 page pdfs each. I’m adding the following generated col to keep up with tsvectors ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and there’sa smarter way. I thought about incremental updates and/or triggers but a generated col is a cleaner solution.
On 06.07.22 10:42, Florents Tselai wrote: > I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around50M rows. > These are text fields extracted from 4-5 page pdfs each. > > I’m adding the following generated col to keep up with tsvectors > > ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED > > I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and there’sa smarter way. Maybe it's stuck on a lock? ALTER TABLE / ADD COLUMN requires an exclusive lock on the table.
> On 6 Jul 2022, at 12:38 PM, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 06.07.22 10:42, Florents Tselai wrote: >> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around50M rows. >> These are text fields extracted from 4-5 page pdfs each. >> I’m adding the following generated col to keep up with tsvectors >> ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED >> I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and there’sa smarter way. > > Maybe it's stuck on a lock? ALTER TABLE / ADD COLUMN requires an exclusive lock on the table. Looking at pg_locks and pg_stat_activity doesn’t seem like it’s stuck behind something like ACCESS SHARE or similar.
Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work. > On 6 Jul 2022, at 12:48 PM, Florents Tselai <florents.tselai@gmail.com> wrote: > > > >> On 6 Jul 2022, at 12:38 PM, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: >> >> On 06.07.22 10:42, Florents Tselai wrote: >>> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around50M rows. >>> These are text fields extracted from 4-5 page pdfs each. >>> I’m adding the following generated col to keep up with tsvectors >>> ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED >>> I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong andthere’s a smarter way. >> >> Maybe it's stuck on a lock? ALTER TABLE / ADD COLUMN requires an exclusive lock on the table. > > Looking at pg_locks and pg_stat_activity doesn’t seem like it’s stuck behind something like ACCESS SHARE or similar. > > >
On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote: > Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work. ... > >> On 06.07.22 10:42, Florents Tselai wrote: > >>> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around50M rows. > >>> These are text fields extracted from 4-5 page pdfs each. How big is yout table? from your query it seems you expect more than 1M-1 ( left... ), but if you have very big text columns it may be spending a lot of time fully decompressing / reading them ( I'm not sure if it left(..) on toasted values is optimized to stop after reading enough ). Also, it has to rewrite a lot of data to insert the columns, it it takes some ms per row which I would not discard 50M rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at 2 ms ( which may be right for reading a big row, calculating the vector and writing an even bigger row ) it would take more than a day to finish, which I would not discard given you are asking for a heavy thing. If you have stopped it I would try doing a 1000 row sample in a copied table to get an speed idea. Otherwise, with this query, I would normally monitor disk usage of disk files as an indication of progress, I'm not sure there is another thing you could look at without disturbing it. FWIW, I would consider high mem usage normal in these kind of query, hi cpu would depend on what you call it, but it wouldn't surprise me if it has at least one cpu running at full detoasting and doing vectors, I do not know if alter table can go paralell.. Francisco Olarte.
> On 6 Jul 2022, at 1:11 PM, Francisco Olarte <folarte@peoplecall.com> wrote: > > On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote: >> Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work. > ... >>>> On 06.07.22 10:42, Florents Tselai wrote: >>>>> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around50M rows. >>>>> These are text fields extracted from 4-5 page pdfs each. > > How big is yout table? from your query it seems you expect more than > 1M-1 ( left... ), but if you have very big text columns it may be > spending a lot of time fully decompressing / reading them ( I'm not > sure if it left(..) on toasted values is optimized to stop after > reading enough ). Also, it has to rewrite a lot of data to insert the > columns, it it takes some ms per row which I would not discard 50M > rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at > 2 ms ( which may be right for reading a big row, calculating the > vector and writing an even bigger row ) it would take more than a day > to finish, which I would not discard given you are asking for a heavy > thing. 50M+ rows and iirc pg_relation_size was north of 80GB or so. > > If you have stopped it I would try doing a 1000 row sample in a copied Haven’t stopped it as I’m not convinced there’s an alternative to just waiting For it to complete :/ > table to get an speed idea. Otherwise, with this query, I would > normally monitor disk usage of disk files as an indication of > progress, I'm not sure there is another thing you could look at > without disturbing it. > > FWIW, I would consider high mem usage normal in these kind of query, > hi cpu would depend on what you call it, but it wouldn't surprise me > if it has at least one cpu running at full detoasting and doing > vectors, I do not know if alter table can go paralell.. > You’re probably right, a lot of the CPU usage could be detoasting. > Francisco Olarte. > >
On 6 Jul 2022, at 1:35 PM, Florents Tselai <florents.tselai@gmail.com> wrote:On 6 Jul 2022, at 1:11 PM, Francisco Olarte <folarte@peoplecall.com> wrote:
On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote:Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work....On 06.07.22 10:42, Florents Tselai wrote:I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.
How big is yout table? from your query it seems you expect more than
1M-1 ( left... ), but if you have very big text columns it may be
spending a lot of time fully decompressing / reading them ( I'm not
sure if it left(..) on toasted values is optimized to stop after
reading enough ). Also, it has to rewrite a lot of data to insert the
columns, it it takes some ms per row which I would not discard 50M
rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at
2 ms ( which may be right for reading a big row, calculating the
vector and writing an even bigger row ) it would take more than a day
to finish, which I would not discard given you are asking for a heavy
thing.
50M+ rows and iirc pg_relation_size was north of 80GB or so.
If you have stopped it I would try doing a 1000 row sample in a copied
Haven’t stopped it as I’m not convinced there’s an alternative to just waiting
For it to complete :/table to get an speed idea. Otherwise, with this query, I would
normally monitor disk usage of disk files as an indication of
It had already eaten up an additional 30% of my disk capacity.
Thus, I’ll have to fall back on my initial solution and use GIN indexes
To get ts_vectors on the fly.
progress, I'm not sure there is another thing you could look at
without disturbing it.
FWIW, I would consider high mem usage normal in these kind of query,
hi cpu would depend on what you call it, but it wouldn't surprise me
if it has at least one cpu running at full detoasting and doing
vectors, I do not know if alter table can go paralell..
You’re probably right, a lot of the CPU usage could be detoasting.Francisco Olarte.
Thanks everyone for your comments.
You can consider this solved.
On 2022-Jul-06, Florents Tselai wrote: > Actually, I monitored my disk usage and it was **definitely** working as > It had already eaten up an additional 30% of my disk capacity. Adding a column like this requires creating a second copy of the table, copying all the contents from the old table (plus the new column) into the new one, then recreating all indexes from scratch on the new copy of the table. If you have a lot of indexes, this can be significant. Also, AFAIU all data has to be uncompressed on read, then compressed back on write. Note: the 80 GB from pg_relation_size() does *not* include the size of TOAST data. You're copying a lot of additional data. See pg_table_size(). -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)
Hi Florents: On Wed, 6 Jul 2022 at 12:35, Florents Tselai <florents.tselai@gmail.com> wrote: > 50M+ rows and iirc pg_relation_size was north of 80GB or so. Somebody already mentioned pg_table_size, but anyway this gives you 1.6Kb per row, which I would suspect is extremely low given your pdf content and the 1M truncation you preventively applied. When working with this kind of sizes it really pays to do some good ole "back of the envelope" calculations and/or some tests with small batches. Bear in mind if you have, say, 10k per row ( which I would label as really low for pdf content ) you are going to have 500Gb of data, if you manage to process this at 100Mb per second you will have more than an hour ( and I think I'm estimating really low ). FOS