Обсуждение: Index speeds up one row table (why)?

Поиск
Список
Период
Сортировка

Index speeds up one row table (why)?

От
Dave E Martin XXIII
Дата:
version: 7.3.2

Ok, not really sure if this a bug per se, but its non-intuitive, and it
goes against the advice stated in the user guide (page 150, "...there is
no plan that can beat sequentially fetching 1 page...")

I have an application that performs many inserts in a second (its doing
real-time data collection from other hardware), in the process of these
inserts, it is sometimes necessary to consult the following with:

select next_id from unique_ids where name=whatever for update;
update unique_ids set next_id=next_id+1 where name=whatever;
pass on value of old next_id to other code...

where unique_ids is:

create table unique_ids (
  name text not null,
  next_id bigint not null
) without oids;

Currently this table has one row in it, where name is 15 unicode
characters long. It would seem that there would be no need for an index
on name. However, doing:

create index unique_ids__name on unique_ids(name);

resulted in literally an order-of-magnatude increase in the speed of the
application. (it went from 10-20 seconds to handle approximately 30
records, to 1/2-3/4 second, and this was the only change). Presumably I
would have never discovered this had I remembered to declare name as a
primary key, which would have created the index. Experimenting around,
and doing a vacuum full without the index didn't make any difference (I
suspected that perhaps seq_scan had to go through a bunch of "dead"
records). For some reason, postgresql is significantly slower doing the
sequential scan than the index (I checked with explain and it is using
the index when its present) in spite of there only being one row.

So, it appears that there is some performance problem in the seq_scan
logic, or in caching (like, maybe its willing to cache an index, but it
always goes to the disk for a seq_scan? Even so, I would think the OS
would cache it.), or something really non-intuitive is happening that
should be documented (the present documentation implied that I should
*not* create that index, but doing so was a significant improvement).

p.s. You may be wondering why i'm not using serial or sequences. I need
this application to be database agnostic.

Re: Index speeds up one row table (why)?

От
Bruno Wolff III
Дата:
On Sat, May 31, 2003 at 00:14:18 -0600,
  Dave E Martin XXIII <postgresql-to.dave@dave.to> wrote:
> version: 7.3.2
>
> Currently this table has one row in it, where name is 15 unicode
> characters long. It would seem that there would be no need for an index
> on name. However, doing:

It probably has one visible row in it. If it can changed a lot, there
may be lots of deleted tuples in a row. That would explain why an
index scan speeds things up.

Re: Index speeds up one row table (why)?

От
Stephan Szabo
Дата:
On Sat, 31 May 2003, Dave E Martin XXIII wrote:

> select next_id from unique_ids where name=whatever for update;
> update unique_ids set next_id=next_id+1 where name=whatever;
> pass on value of old next_id to other code...
>
> where unique_ids is:
>
> create table unique_ids (
>   name text not null,
>   next_id bigint not null
> ) without oids;
>
> Currently this table has one row in it, where name is 15 unicode
> characters long. It would seem that there would be no need for an index
> on name. However, doing:
>
> create index unique_ids__name on unique_ids(name);
>
> resulted in literally an order-of-magnatude increase in the speed of the
> application. (it went from 10-20 seconds to handle approximately 30
> records, to 1/2-3/4 second, and this was the only change). Presumably I
> would have never discovered this had I remembered to declare name as a
> primary key, which would have created the index. Experimenting around,
> and doing a vacuum full without the index didn't make any difference (I
> suspected that perhaps seq_scan had to go through a bunch of "dead"
> records). For some reason, postgresql is significantly slower doing the
> sequential scan than the index (I checked with explain and it is using
> the index when its present) in spite of there only being one row.

It may be just be a question of plan choice, but we'd need to see explain
analyze output to really make a reasonable guess.

Re: Index speeds up one row table (why)?

От
Tom Lane
Дата:
Bruno Wolff III <bruno@wolff.to> writes:
> It probably has one visible row in it. If it can changed a lot, there
> may be lots of deleted tuples in a row. That would explain why an
> index scan speeds things up.

Right, every UPDATE on unique_ids generates a dead row, and a seqscan
has no alternative but to wade through them all.  When a unique index is
present, the indexscan code knows that after it's fetched one live tuple
there can be no more matching the same index key, and so it need not
keep examining index entries.  Furthermore, due to the way that btree
handles equal keys, it is likely (not certain, just likely) that
more-recent and hence more-likely-to-be-live tuples will be seen first.

However, the above-described optimization for unique keys is new in
7.3.*, and it's buggy.  It's disabled as of 7.3.3, so the performance
improvement you're seeing will go away as soon as you update (which you
should).  There's a fresh try at it in 7.4 CVS.

More-frequent vacuums would be a much more reliable solution, in any
case.  If you are updating the single row once a second, then a cron job
to vacuum (not full, just plain "vacuum") that particular table every
couple of minutes would not be a bad idea.  A hundred dead rows will
still fit in one disk block (unless there's lots more in the row than
you've mentioned), and as long as you can keep the table to one disk
block you shouldn't notice any performance degradation.

You might care to use contrib/pgstattuple to check out the contents of
the table, but I'm pretty sure what you'll find ...

            regards, tom lane

Re: Index speeds up one row table (why)?

От
Dave E Martin XXIII
Дата:
Tom Lane wrote:

>Bruno Wolff III <bruno@wolff.to> writes:
>
>
>>It probably has one visible row in it. If it can changed a lot, there
>>may be lots of deleted tuples in a row. That would explain why an
>>index scan speeds things up.
>>
>>
>
>Right, every UPDATE on unique_ids generates a dead row, and a seqscan
>has no alternative but to wade through them all.  When a unique index is
>
Speaking of which, since the row is locked with select for update (so it
can't be involved in any other transactions anyway) and the change
doesn't change the length of the row, can't it just be updated in-place,
or would that break something else? (pardon if this is answered already,
me thinks its time to go reread the todo's and the architecture
documents...)

Re: Index speeds up one row table (why)?

От
Dave E Martin XXIII
Дата:
Tom Lane Writes:

>Bruno Wolff III <bruno@wolff.to> writes:
>> It probably has one visible row in it. If it can changed a lot, there
>> may be lots of deleted tuples in a row. That would explain why an
>> index scan speeds things up.

>Right, every UPDATE on unique_ids generates a dead row, and a seqscan
>has no alternative but to wade through them all.  When a unique index is
>present, the indexscan code knows that after it's fetched one live tuple
...
>More-frequent vacuums would be a much more reliable solution,


The index I created wasn't unique (though it should have been), but
perhaps much of the same reasoning still applies.

Also, I could have swore I tried a vacuum, and it didn't make a
difference, although experimenting just now, it did. The data collection
rate is considerably slower at the moment though, so perhaps last time
the table simply quickly got "inefficient" very quickly again
during/immediately after the vacuum (or I wasn't where I thought I was
when I vacuumed). I'll have to experiment with this a bit more, when the
data generation is high again.

(ok, experimented a bit more just now)
Hm, it appears that degredation occurs with the index as well, I guess
at the time I created the index, it just initially did better because it
got to skip all the already dead rows at creation time: but this is
disturbing, I do a vacuum, and the access times are better, but still
horrible:

explain analyze select next_id from bigint_unique_ids where
table_name='CONNECTION_DATA';

 Index Scan using bigint_unique_ids__table_name on bigint_unique_ids
(cost=0.00..8.01 rows=1 width=8) (actual time=13.77..844.14 rows=1 loops=1)
   Index Cond: (table_name = 'CONNECTION_DATA'::text)
 Total runtime: 844.36 msec
(3 rows)

vacuum; -- takes about 10 minutes
VACUUM

explain analyze select next_id from bigint_unique_ids where
table_name='CONNECTION_DATA';
Index Scan using bigint_unique_ids__table_name on bigint_unique_ids
(cost=0.00..84.01 rows=1 width=8) (actual time=0.17..99.94 rows=1 loops=1)
   Index Cond: (table_name = 'CONNECTION_DATA'::text)
 Total runtime: 100.09 msec

vacuum; --takes about 2 minutes
Index Scan using bigint_unique_ids__table_name on bigint_unique_ids
(cost=0.00..179.01 rows=1 width=8) (actual time=0.45..219.05 rows=1 loops=1)
   Index Cond: (table_name = 'CONNECTION_DATA'::text)
 Total runtime: 219.20 msec

--ACK, worse, ran twice more, got 212.5 ms, and 394.39

vacuum bigint_unique_ids; -- try specific table only, takes about 5 seconds
Index Scan using bigint_unique_ids__table_name on bigint_unique_ids
(cost=0.00..163.01 rows=1 width=8) (actual time=0.23..143.59 rows=1 loops=1)
   Index Cond: (table_name = 'CONNECTION_DATA'::text)
 Total runtime: 143.72 msec

vacuum full bigint_unique_ids; -- try full, takes about 3 seconds.
Seq Scan on bigint_unique_ids  (cost=0.00..1.02 rows=1 width=8) (actual
time=0.10..0.10 rows=1 loops=1)
   Filter: (table_name = 'CONNECTION_DATA'::text)
 Total runtime: 0.25 msec

-- ah, much much much, better.

So apparently vacuum by itself isn't going to be sufficent, i'm going to
need vacuum fulls? Or if I do vacuum's often enough (that should allow
old rows to be overwritten?) will that do it? I'm a bit hazy on why
vacuum isn't doing just as well as vacuum full, I thought the only
difference was that full released space back to the operating system
(and presumably defragments existing data, but for one row, this
shouldn't matter?).

wait several minutes:

Seq Scan on bigint_unique_ids  (cost=0.00..39.01 rows=1 width=8) (actual
time=2.97..2.98 rows=1 loops=1)
   Filter: (table_name = 'CONNECTION_DATA'::text)
 Total runtime: 3.13 msec
reindex index bigint_unique_ids__table_name;
REINDEX

Index Scan using bigint_unique_ids__table_name on bigint_unique_ids
(cost=0.00..5.97 rows=1 width=8) (actual time=0.11..0.20 rows=1 loops=1)
   Index Cond: (table_name = 'CONNECTION_DATA'::text)
 Total runtime: 0.30 msec

It appears reindex has the same speed up effect. (and in this case made
it switch back from seq_scan to index scan).

Let me throw in this too, if its helpful:

vacuum verbose bigint_unique_ids;
INFO:  --Relation public.bigint_unique_ids--
INFO:  Index bigint_unique_ids__table_name: Pages 29; Tuples 1: Deleted
5354.
    CPU 0.01s/0.04u sec elapsed 0.05 sec.
INFO:  Removed 11348 tuples in 79 pages.
    CPU 0.00s/0.02u sec elapsed 0.02 sec.
INFO:  Pages 79: Changed 1, Empty 0; Tup 1: Vac 11348, Keep 0, UnUsed 0.
    Total CPU 0.03s/0.06u sec elapsed 0.14 sec.
INFO:  --Relation pg_toast.pg_toast_21592--
INFO:  Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
    Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM

vacuum full verbose bigint_unique_ids;
INFO:  --Relation public.bigint_unique_ids--
INFO:  Pages 79: Changed 1, reaped 79, Empty 0, New 0; Tup 1: Vac 297,
Keep/VTL 0/0, UnUsed 11157, MinLen 52, MaxLen 52; Re-using: Free/Avail.
Space 599716/22724; EndEmpty/Avail. Pages 76/3.
    CPU 0.01s/0.00u sec elapsed 0.01 sec.
INFO:  Index bigint_unique_ids__table_name: Pages 29; Tuples 1: Deleted 297.
    CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Rel bigint_unique_ids: Pages: 79 --> 1; Tuple(s) moved: 1.
    CPU 0.00s/0.00u sec elapsed 0.02 sec.
INFO:  Index bigint_unique_ids__table_name: Pages 29; Tuples 1: Deleted 1.
    CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  --Relation pg_toast.pg_toast_21592--
INFO:  Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0,
Keep/VTL 0/0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space
0/0; EndEmpty/Avail. Pages 0/0.
    CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Index pg_toast_21592_index: Pages 1; Tuples 0.
    CPU 0.00s/0.00u sec elapsed 0.01 sec.
VACUUM

Re: Index speeds up one row table (why)?

От
Rod Taylor
Дата:
> vacuum verbose bigint_unique_ids;
> INFO:  --Relation public.bigint_unique_ids--
> INFO:  Index bigint_unique_ids__table_name: Pages 29; Tuples 1: Deleted=
=20
> 5354.
>     CPU 0.01s/0.04u sec elapsed 0.05 sec.
> INFO:  Removed 11348 tuples in 79 pages.
>     CPU 0.00s/0.02u sec elapsed 0.02 sec.
> INFO:  Pages 79: Changed 1, Empty 0; Tup 1: Vac 11348, Keep 0, UnUsed 0.
>     Total CPU 0.03s/0.06u sec elapsed 0.14 sec.

Vacuum (regular, not full) frequently enough that the 'Pages' value
doesn't increase past 1 and you'll be fine.  A sequential scan on a very
small table is what you want to have.

In this particular case, vacuum removed over 11000 dead versions of the
tuple.

An 8 k page will hold approx 140 tuples based on your structure.  So,
for every ~100 updates you'll want to run vacuum (regular, not full) on
the table.

--=20
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Re: Index speeds up one row table (why)?

От
Bruno Wolff III
Дата:
On Sat, May 31, 2003 at 16:56:56 -0600,
  Dave E Martin XXIII <postgresql-to.dave@dave.to> wrote:
>
> (ok, experimented a bit more just now)
> Hm, it appears that degredation occurs with the index as well, I guess
> at the time I created the index, it just initially did better because it
> got to skip all the already dead rows at creation time: but this is
> disturbing, I do a vacuum, and the access times are better, but still
> horrible:

You really don't want to use an index, so this probably doesn't matter
for the current application. The problem is that when data is inserted
into an index that just increases (or decreases) in value space from
deleted entries doesn't get reused. I believe this is fixed in 7.4.
This case would apply to indexes based on counters, dates or times
where new values are added and old values get deleted.

Re: Index speeds up one row table (why)?

От
Bruno Wolff III
Дата:
On Sat, May 31, 2003 at 17:17:38 -0600,
  Dave E Martin XXIII <postgresql-to.dave@dave.to> wrote:
> Speaking of which, since the row is locked with select for update (so it
> can't be involved in any other transactions anyway) and the change
> doesn't change the length of the row, can't it just be updated in-place,
> or would that break something else? (pardon if this is answered already,
> me thinks its time to go reread the todo's and the architecture
> documents...)

No. Select for update only blocks writers, not readers. This has important
performance advantages. You might want read the documentation on MVCC.
Tom Lane also has a copy of a presentation he made on the web somewhere.
I have it read it but it has gotten favorable mention on the lists before.

Re: Index speeds up one row table (why)?

От
Dave E Martin XXIII
Дата:
Rod Taylor wrote:

 >An 8 k page will hold approx 140 tuples based on your structure. So,
 >for every ~100 updates you'll want to run vacuum (regular, not full) on
 >the table

Alas, for this application, that means a vacuum once every 5 seconds or
so. I'll see if I can set up a separate little task to do that (I assume
at this rate, its better to just keep a connection open, than
setup/teardown). I don't suppose there is a way to get a trigger to do a
vacuum (which doesn't want to be in a transaction) (thinking it could
check for id mod 100=0 or something)? I also assume a few pages isn't
going to be that bad (just don't let it get to 11000 8).

Re: Index speeds up one row table (why)?

От
Bruno Wolff III
Дата:
On Sun, Jun 01, 2003 at 01:20:03 -0600,
  Dave E Martin XXIII <postgresql-to.dave@dave.to> wrote:
> Rod Taylor wrote:
>
> >An 8 k page will hold approx 140 tuples based on your structure. So,
> >for every ~100 updates you'll want to run vacuum (regular, not full) on
> >the table
>
> Alas, for this application, that means a vacuum once every 5 seconds or
> so. I'll see if I can set up a separate little task to do that (I assume
> at this rate, its better to just keep a connection open, than
> setup/teardown). I don't suppose there is a way to get a trigger to do a
> vacuum (which doesn't want to be in a transaction) (thinking it could
> check for id mod 100=0 or something)? I also assume a few pages isn't
> going to be that bad (just don't let it get to 11000 8).

Maybe you should reconsider how badly you want the app to be totally database
agnostic? Using a sequence might be less of a contortion than using vacuum
a few times a minute. You are likely to have similar performance issues
with other databases, so this section of code may not turn out to be very
portable in any case.

Re: Index speeds up one row table (why)?

От
Dave E Martin XXIII
Дата:
Bruno Wolff III wrote:

>Maybe you should reconsider how badly you want the app to be totally database
>agnostic? Using a sequence might be less of a contortion than using vacuum
>a few times a minute. You are likely to have similar performance issues
>with other databases, so this section of code may not turn out to be very
>portable in any case.
>
>
Maybe I can further abstract out the generate unique-id portion, Since
unique-id generation does seem to be a pretty common database extension
(for some reason...), and then provide a generic schema definition, and
a postgresql specific one (along with whatever others I can drum up).
The generic one will rely on the software to come up with the unique id
in the fashion I'm currently doing.

Speaking of which, is there a better way than what i'm currently doing
(when the database doesn't have any such support)? I've heard of one
method based on something like "select max(id)+1 from table" but this
seems error prone, at the very least, you'd have to have a unique index,
and be prepared to redo on failure, which could get messy if its a big
transaction, and frequent if there is a lot of concurrent inserting
going on.

Re: Index speeds up one row table (why)?

От
Peter Childs
Дата:
On Mon, 2 Jun 2003, Dave E Martin XXIII wrote:

> Bruno Wolff III wrote:
>
> >Maybe you should reconsider how badly you want the app to be totally database
> >agnostic? Using a sequence might be less of a contortion than using vacuum
> >a few times a minute. You are likely to have similar performance issues
> >with other databases, so this section of code may not turn out to be very
> >portable in any case.
> >
> >
> Maybe I can further abstract out the generate unique-id portion, Since
> unique-id generation does seem to be a pretty common database extension
> (for some reason...), and then provide a generic schema definition, and
> a postgresql specific one (along with whatever others I can drum up).
> The generic one will rely on the software to come up with the unique id
> in the fashion I'm currently doing.
>
> Speaking of which, is there a better way than what i'm currently doing
> (when the database doesn't have any such support)? I've heard of one
> method based on something like "select max(id)+1 from table" but this
> seems error prone, at the very least, you'd have to have a unique index,
> and be prepared to redo on failure, which could get messy if its a big
> transaction, and frequent if there is a lot of concurrent inserting
> going on.
>
    For a generic solution you could have an extra table that fed you
ids and update it every time you took a value. (Maybe a trigger could be
used?) Due to table locking during transactions no two concurrent
requested would get the same answer. Implementation could be
interesting but relatively simple.

BEGIN;
SELECT id from unqid where name='seq_name';
UPDATE unqid set id=id+1 where name='seq_name';
COMMIT;

Peter Childs