Обсуждение: Creating tons of tables to support a query

Поиск
Список
Период
Сортировка

Creating tons of tables to support a query

От
Jan Ploski
Дата:
Hello,

I have apparently solved a certain performance problem and would
like to consult with other PostgreSQL users to figure out if I am
doing things the right way.

I have two tables: "section" and "message". Each message has
a "dateSent" timestamp and belongs to exactly one section, so I have
a foreign key sectionID in message. The task is to efficiently
figure out the number of messages that arrived into a section after
any given point in time. In other words, optimize this query:

select count(*) from message where sectionID = ? and dateSent > ?

At present, there are about 11,000 messages and 227 sections,
with messages distributed rather unevenly across sections.
The table "message" also contains a minor number of private
messages, the sectionID of which is null.

There is an index on message.dateSent, which PostgreSQL decides not to
use for execution of the above query. This gives me a plan like that:

Aggregate  (cost=1419.26..1419.26 rows=1 width=0)
  ->  Seq Scan on message  (cost=0.00..1408.13 rows=4449 width=0)

...with execution time for 100 queries equal to 7.90 seconds.

If I set enable_seqscan=no, the same test takes 5.21 seconds,
which is still much too long.

To improve the performance, I originally decided to add a column
"msgnum" to the table "message", which would be incremented as
each message is inserted. To figure out the number of messages
that arrived between two points in time [t1,t2], I'd find the
lowest message number before t1 and highest message number before
t2, and compute msgnum_high-msgnum_low+1. (When messages are deleted,
renumbering would have to occur.

This alternate approach worked very well on a test table that
included only messages from a single section. I used queries such as:

select msgnum from message where dateSent > ? order by dateSent limit 1

and

select msgnum from message where dateSent < ? order by dateSent desc limit 1

However, when I added the condition "and sectionID = ?", the performance
dropped to worse than of the simple count(*) query that I mentioned first.

To work around that, I decided to create a message number tracking
table for each section (247 tables!), with an index on dateSent
for each table (247 indices!), and use dynamically composed queries.

Querying for a message count that way works roughly 60 times faster
than the previous approach = very well. However, I am concerned about
the need to keep these extra tables up-to-date (with triggers) when
inserting rows into "message", also worried about the overhead of
creating/dropping a table per section and about the overall lack of
elegance of my solution.

I am in particular wondering, why an index on message(sectionID, dateSent)
does not make these queries comparably fast:

    select msgnum from message where
        sectionID = ? and
        dateSent > ?
        order by dateSent
        limit 1;

    select msgnum from scnt_9 where
        dateSent > ?
        order by dateSent
        limit 1;

(scnt_9 is a lookup table which only creates msgnums for messages
with sectionID == 9)

I would be grateful for your advice. Is anyone else creating hundreds
of lookup tables to cope with performance problems? Is this an issue
which would be nicely solved by partitioning tables (or indices) by
column value in a commercial RDBMS like Oracle?

Best regards -
Jan Ploski


Re: Creating tons of tables to support a query

От
Martijn van Oosterhout
Дата:
On Sun, Sep 08, 2002 at 11:58:44PM +0200, Jan Ploski wrote:
> Hello,
>
> I am in particular wondering, why an index on message(sectionID, dateSent)
> does not make these queries comparably fast:
>
>     select msgnum from message where
>         sectionID = ? and
>         dateSent > ?
>         order by dateSent
>         limit 1;
>
>     select msgnum from scnt_9 where
>         dateSent > ?
>         order by dateSent
>         limit 1;
>
> (scnt_9 is a lookup table which only creates msgnums for messages
> with sectionID == 9)
>

Can you send the results of EXPLAIN ANALYZE for both those queries. Thanks.

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Creating tons of tables to support a query

От
Bruno Wolff III
Дата:
On Sun, Sep 08, 2002 at 23:58:44 +0200,
  Jan Ploski <jpljpl@gmx.de> wrote:
>
> select count(*) from message where sectionID = ? and dateSent > ?

If the dateDent is the same for each query, you should probably be doing
it in one query. Something like:
select sectionID, count(*) from message where dateSent > ? group by sectionID

Re: Creating tons of tables to support a query

От
Jan Ploski
Дата:
On Mon, Sep 09, 2002 at 09:36:17AM +1000, Martijn van Oosterhout wrote:
> On Sun, Sep 08, 2002 at 11:58:44PM +0200, Jan Ploski wrote:
> > Hello,
> >
> > I am in particular wondering, why an index on message(sectionID, dateSent)
> > does not make these queries comparably fast:
> >
> >     select msgnum from message where
> >         sectionID = ? and
> >         dateSent > ?
> >         order by dateSent
> >         limit 1;
> >
> >     select msgnum from scnt_9 where
> >         dateSent > ?
> >         order by dateSent
> >         limit 1;
> >
> > (scnt_9 is a lookup table which only creates msgnums for messages
> > with sectionID == 9)
> >
>
> Can you send the results of EXPLAIN ANALYZE for both those queries. Thanks.

Martijn,

I can't run EXPLAIN ANALYZE (using 7.1.3), but here are the results of
EXPLAIN:

Limit  (cost=1677.74..1677.74 rows=1 width=10)
  ->  Sort  (cost=1677.74..1677.74 rows=4449 width=10)
        ->  Seq Scan on message  (cost=0.00..1408.13 rows=4449 width=10)

Limit  (cost=0.00..0.05 rows=1 width=12)
  ->  Index Scan using scnt_idx_9 on scnt_9  (cost=0.00..234.47 rows=4661 width=

The fast query is obviously doing less work. Any ideas why?

Thank you -
JPL


Re: Creating tons of tables to support a query

От
Jan Ploski
Дата:
On Sun, Sep 08, 2002 at 06:51:55PM -0500, Bruno Wolff III wrote:
> On Sun, Sep 08, 2002 at 23:58:44 +0200,
>   Jan Ploski <jpljpl@gmx.de> wrote:
> >
> > select count(*) from message where sectionID = ? and dateSent > ?
>
> If the dateDent is the same for each query, you should probably be doing
> it in one query. Something like:
> select sectionID, count(*) from message where dateSent > ? group by sectionID

Bruno,

The dateSent is not same for each section, though it may be same
for several sections (in my app, sections belong to forums, and dateSent
is same for all sections of the same forum). Anyway, I would like to avoid
counting by traversing a huge number of rows, and I think it's what the
backend does behind the scenes for such a query.

-JPL


Re: Creating tons of tables to support a query

От
Bruno Wolff III
Дата:
On Mon, Sep 09, 2002 at 01:54:08 +0200,
  Jan Ploski <jpljpl@gmx.de> wrote:
>
> The dateSent is not same for each section, though it may be same
> for several sections (in my app, sections belong to forums, and dateSent
> is same for all sections of the same forum). Anyway, I would like to avoid
> counting by traversing a huge number of rows, and I think it's what the
> backend does behind the scenes for such a query.

Well it sounds like group by won't by you all that much.
However, counts aren't saved anywhere so you are going to need to
look at each row that satisfies the condition of the query.

You might want to double check that the time of the two timestamps
(the one in the table and the constant used for a particular query)
are the same type. If not you may need a cast to get postgres to use
the index.

Re: Creating tons of tables to support a query

От
Stephan Szabo
Дата:
On Sun, 8 Sep 2002, Jan Ploski wrote:

> I am in particular wondering, why an index on message(sectionID, dateSent)
> does not make these queries comparably fast:
>
>     select msgnum from message where
>         sectionID = ? and
>         dateSent > ?
>         order by dateSent
>         limit 1;

I don't think that'll use an index on (sectionID, dateSent) for the sort
step.  I think an index on (dateSent,sectionID) might be, however.




Re: Creating tons of tables to support a query

От
Jan Ploski
Дата:
On Sun, Sep 08, 2002 at 07:49:32PM -0700, Stephan Szabo wrote:
> On Sun, 8 Sep 2002, Jan Ploski wrote:
>
> > I am in particular wondering, why an index on message(sectionID, dateSent)
> > does not make these queries comparably fast:
> >
> >     select msgnum from message where
> >         sectionID = ? and
> >         dateSent > ?
> >         order by dateSent
> >         limit 1;
>
> I don't think that'll use an index on (sectionID, dateSent) for the sort
> step.  I think an index on (dateSent,sectionID) might be, however.

Stephan,

Indeed, my mistake. With an index on (dateSent,sectionID), the plan becomes:

Limit  (cost=0.00..2.36 rows=1 width=10)
  ->  Index Scan using test_idx2 on message  (cost=0.00..10482.08 rows=4449 width=10)

Alas, this does not help me further. I did two tests:

Test 1: Section 9 contained 5143 messages.
Test 2: Section 241 contained 0 messages.

The timing results (for 5000 queries) are:

1. Using index on message(dateSent, sectionID):      11 seconds
   Using index on scnt_9(dateSent):                  17 seconds

2. Using index on message(dateSent, sectionID):     320 seconds
   Using index on scnt_241(dateSent):                 2 seconds


The problem is that (apparently?) the whole (dateSent, sectionID) index
must be scanned in the second test, while the scnt_241 index simply
contains no values and yields quick results.

Since the auxiliary tables speed up things so much and behave well
for sections with few messages, I tend to believe that this is the
way to go for me. Two questions remain open: what kind of overheads
do I incur by creating that many tables (hundreds, maybe thousands
in the future)? And, second, since there is no support for pl/pgSQL
"execute select ... into" in 7.1.3, I need to collect results
inserted into a temporary table. Is this kind of "execute" statement
implemented in the newest version of PostgreSQL yet?

Take care -
JPL


Re: Creating tons of tables to support a query

От
Richard Huxton
Дата:
On Monday 09 Sep 2002 11:52 am, Jan Ploski wrote:
> Indeed, my mistake. With an index on (dateSent,sectionID), the plan
> becomes:
>
> Limit  (cost=0.00..2.36 rows=1 width=10)
>   ->  Index Scan using test_idx2 on message  (cost=0.00..10482.08 rows=4449
> width=10)
>
> Alas, this does not help me further. I did two tests:
>
> Test 1: Section 9 contained 5143 messages.
> Test 2: Section 241 contained 0 messages.
>
> The timing results (for 5000 queries) are:
>
> 1. Using index on message(dateSent, sectionID):      11 seconds
>    Using index on scnt_9(dateSent):                  17 seconds
>
> 2. Using index on message(dateSent, sectionID):     320 seconds
>    Using index on scnt_241(dateSent):                 2 seconds
>
>
> The problem is that (apparently?) the whole (dateSent, sectionID) index
> must be scanned in the second test, while the scnt_241 index simply
> contains no values and yields quick results.

Have you considered using partial indexes? You can set up something like:

CREATE INDEX msg_idx_9 ON message (dateSent) WHERE sectionID=9

For each section you have - this should allow for the indexing advantage
without the overhead of separate tables. This feature is non-standard AFAIK
and is covered in section 7.8 of the manual.

- Richard Huxton

Re: Creating tons of tables to support a query

От
Jan Ploski
Дата:
On Mon, Sep 09, 2002 at 02:10:47PM +0100, Richard Huxton wrote:
> On Monday 09 Sep 2002 11:52 am, Jan Ploski wrote:
> > The problem is that (apparently?) the whole (dateSent, sectionID) index
> > must be scanned in the second test, while the scnt_241 index simply
> > contains no values and yields quick results.
>
> Have you considered using partial indexes? You can set up something like:
>
> CREATE INDEX msg_idx_9 ON message (dateSent) WHERE sectionID=9
>
> For each section you have - this should allow for the indexing advantage
> without the overhead of separate tables. This feature is non-standard AFAIK
> and is covered in section 7.8 of the manual.

Hello Richard,

thanks for your hint. I did not know about this feature and it sounds like
a perfect fit for the task. Having created the partial index for section
241 (the empty one), I run into another problem.

EXPLAIN ANALYZE
    select msgnum from message where sectionID = 241 order by dateSent desc limit 1;

Limit  (cost=0.00..56.43 rows=1 width=12) (actual time=0.03..0.03 rows=0 loops=1)
  ->  Index Scan Backward using message_pidx0_241 on message  (cost=0.00..56.43 rows=0 width=12) (actual
time=0.02..0.02rows=0 loops=1) 
Total runtime: 0.14 msec

Looks fine. But: 500 iterations of the following code

    loop
        select msgnum from message into v_cnt where sectionID = 241 order by dateSent desc limit 1;
    end loop

report execution time 00:00:00.03179, while 500 iterations of the following (which
is closer to what I need):

    v_sid := 241;

    loop
        select msgnum from message into v_cnt where sectionID = v_sid order by dateSent desc limit 1;
    end loop

take a whopping 00:00:31.442402. Needless to say, I don't have this problem with
select msgnum from scnt_241 into v_cnt order by dateSent desc limit 1. But I'd really
like to drop all these extra tables.

How can I find out what is going on behind the scenes?

Regards -
JPL


Re: Creating tons of tables to support a query

От
S Dawalt
Дата:
Stephan Szabo said:

>
> On Sun, 8 Sep 2002, Jan Ploski wrote:
>
> > I am in particular wondering, why an index on message(sectionID,
dateSent)
> > does not make these queries comparably fast:
> >
> >     select msgnum from message where
> >         sectionID = ? and
> >         dateSent > ?
> >         order by dateSent
> >         limit 1;
>
> I don't think that'll use an index on (sectionID, dateSent) for the sort
> step.  I think an index on (dateSent,sectionID) might be, however.
>

  I know I've read this before on the list (probably several times).  But
either my skull is too thick or the topic too abstract; why is no index used
for (sectionID, dateSent) but (dateSent, sectionID) does?  They are the same
columns, but just reversed.  I don't see why that would make a difference.
Is there some rule-of-thumb for determining when an index is used and when
it isn't rather than trail and error using EXPLAIN?

  Shane



Re: Creating tons of tables to support a query

От
Tom Lane
Дата:
Jan Ploski <jpljpl@gmx.de> writes:
> On Sun, Sep 08, 2002 at 07:49:32PM -0700, Stephan Szabo wrote:
>> On Sun, 8 Sep 2002, Jan Ploski wrote:
>>> I am in particular wondering, why an index on message(sectionID, dateSent)
>>> does not make these queries comparably fast:
>>>
>>> select msgnum from message where
>>> sectionID = ? and
>>> dateSent > ?
>>> order by dateSent
>>> limit 1;
>>
>> I don't think that'll use an index on (sectionID, dateSent) for the sort
>> step.  I think an index on (dateSent,sectionID) might be, however.

> Alas, this does not help me further. I did two tests:

Yes, it makes sense that for a little-used section that way wouldn't be
very efficient.  I would suggest that you want to use an index on
(sectionID, dateSent), and that the way to make the system do the
right thing is

select msgnum from message where
sectionID = ? and
dateSent > ?
order by sectionID, dateSent
limit 1;

Without the extra ORDER BY clause, the planner is not smart enough to
see that the requested ordering actually matches the index ordering.

Another possible gotcha is that depending on datatype details the
planner might be using only one of the two index columns.  As far
as I noticed, you didn't tell us the exact column datatypes or the
exact form in which the comparison values are supplied?

            regards, tom lane

Re: Creating tons of tables to support a query

От
Richard Huxton
Дата:
On Monday 09 Sep 2002 3:22 pm, Jan Ploski wrote:
[snipped - trying partial indexes]
>  But: 500 iterations of the following code
>
>     loop
>         select msgnum from message into v_cnt where sectionID = 241 order
> by dateSent desc limit 1; end loop
>
> report execution time 00:00:00.03179, while 500 iterations of the following
> (which is closer to what I need):
>
>     v_sid := 241;
>
>     loop
>         select msgnum from message into v_cnt where sectionID = v_sid order
> by dateSent desc limit 1; end loop
>
> take a whopping 00:00:31.442402. Needless to say, I don't have this problem
> with select msgnum from scnt_241 into v_cnt order by dateSent desc limit 1.
> But I'd really like to drop all these extra tables.
>
> How can I find out what is going on behind the scenes?

Look at the section on logging/debugging in your postgresql.conf file (also
the SET/SHOW commands and the manuals).

However, if that code is plpgsql then the query plan is fixed at compile-time
which might well mean that it isn't using the partial indexes. Normally you'd
use EXECUTE to get around this, but you can't do that with SELECT...INTO

There are only two options I can think of at the moment:
1. Rewrite the plpgsql as an sql function/plperl etc (anything dynamic)
2. Take a step back and keep a separate table of messages per
minute/hour/whatever and keep it up to date with triggers.

Something smarter might occur to me later or (more likely) to someone else,
but I've got to finish something off at the moment.

HTH

- Richard Huxton

Re: Creating tons of tables to support a query

От
Jan Ploski
Дата:
On Mon, Sep 09, 2002 at 11:41:38AM -0400, Tom Lane wrote:
> Yes, it makes sense that for a little-used section that way wouldn't be
> very efficient.  I would suggest that you want to use an index on
> (sectionID, dateSent), and that the way to make the system do the
> right thing is
>
> select msgnum from message where
> sectionID = ? and
> dateSent > ?
> order by sectionID, dateSent
> limit 1;
>
> Without the extra ORDER BY clause, the planner is not smart enough to
> see that the requested ordering actually matches the index ordering.

Tom,

thanks you for advice. Now I get performance comparable to that
when using a partial index with "workarounds", i.e.

select msgnum from message where
    sectionID=9 and
    dateSent>'2000-11-12 02:05:35.94'
    order by sectionID,dateSent limit 1;

works equally fast with an index on message(sectionID, dateSent)
as through a partial index on message where sectionID = 9.

When executed on the empty section 241, the statement is way
faster than the partial index based solution (0.67 seconds vs 8.5 seconds),
probably because I had to resort to trickery to make it work
reasonably at all.

What I still cannot grasp is why

select msgnum into v_cnt from message where sectionID = 241
    order by dateSent desc limit 1;

is so much faster than

v_sid := 241;
select msgnum into v_cnt from message where sectionID = v_sid
    order by dateSent desc limit 1;

as I mentioned in an earlier message. In fact, to get the partial index
solution up to decent performance, I had to write something like

delete from tmp;
execute
    ''insert into tmp select msgnum into v_cnt from message where sectionID = '' ||
    v_sid || '' oder by dateSent desc limit 1'';
select max(cnt) into v_cnt from tmp;

Following Richard's suggestion, I turned query debugging on in
postgresql.conf. What jumps into my eyes in the log is

Shared blocks:        132 read      (sectionID = 241)
vs
Shared blocks:        1517 read     (sectionID = v_sid)

in postgres usage stats output after executing the query.

When I wrap the select statement into the awkward execute,
I get "152 read", which is still much, much better than "1517 read",
if that (apart from execution time) can be taken as an indicator
for performance. Do you have any clues?

> Another possible gotcha is that depending on datatype details the
> planner might be using only one of the two index columns.  As far
> as I noticed, you didn't tell us the exact column datatypes or the
> exact form in which the comparison values are supplied?

The column types are integer for sectionID is and timestamp for dateSent.
I am passing parameters of these types into a PL/pgSQL procedure, which then
executes a "select into" with these parameters in the where clause.

Take care -
JPL


Re: Creating tons of tables to support a query

От
Tom Lane
Дата:
Jan Ploski <jpljpl@gmx.de> writes:
> What I still cannot grasp is why

> select msgnum into v_cnt from message where sectionID = 241
>     order by dateSent desc limit 1;

> is so much faster than

> v_sid := 241;
> select msgnum into v_cnt from message where sectionID = v_sid
>     order by dateSent desc limit 1;

The latter cannot use a partial index because the sectionID parameter
is a parameter, not a literal constant.  The system has no way to
know that the SELECT won't be re-executed with a different value of
v_sid, so it can't generate a query plan that relies on the specific
value of v_sid.  Thus, no partial-index-using plan will be produced.

You can get around that by judicious use of EXECUTE, because it doesn't
cache a query plan.  But I see no need to; the partial-index approach is
going to be inferior to a correctly used single index anyway, because
the sheer number of indexes will bog things down (especially updates).

>> Another possible gotcha is that depending on datatype details the
>> planner might be using only one of the two index columns.  As far
>> as I noticed, you didn't tell us the exact column datatypes or the
>> exact form in which the comparison values are supplied?

> The column types are integer for sectionID is and timestamp for dateSent.
> I am passing parameters of these types into a PL/pgSQL procedure, which then
> executes a "select into" with these parameters in the where clause.

That should be okay.  People tend to get burnt with int2 and int8
columns ...

            regards, tom lane

Re: Creating tons of tables to support a query

От
Tom Lane
Дата:
S Dawalt <shane.dawalt@wright.edu> writes:
> select msgnum from message where
> sectionID = ? and
> dateSent > ?
> order by dateSent
> limit 1;
>>
>> I don't think that'll use an index on (sectionID, dateSent) for the sort
>> step.  I think an index on (dateSent,sectionID) might be, however.

>   I know I've read this before on the list (probably several times).  But
> either my skull is too thick or the topic too abstract; why is no index used
> for (sectionID, dateSent) but (dateSent, sectionID) does?

The issue is whether the indexscan satisfies the ORDER BY condition or
just the WHERE conditions.  If the planner thinks it needs both an
indexscan and a subsequent SORT step, it is much less likely to choose
the indexscan-based plan --- and rightfully so in this case, since the
LIMIT doesn't help if you have to sort the data before you know which
is the single output row you should return.  That is,
    LIMIT
        INDEXSCAN
can be a very cheap plan, but
    LIMIT
        SORT
            INDEXSCAN
is not likely to be cheap, because the LIMIT helps not at all for
aborting the indexscan or the sort short of completion.

Now, you know and I know that given the constraint "WHERE sectionID = ?"
it would actually be okay to pretend that indexscanning an index on
(sectionID, dateSent) yields data ordered simply by dateSent.  The
planner will not currently make that deduction, however, and so you have
to help it along by asking for your data "ORDERED BY sectionID,
dateSent".  The system is able to match that to the sort ordering of the
two-column index and realize that it needs no SORT step.

            regards, tom lane

Re: Creating tons of tables to support a query

От
Martijn van Oosterhout
Дата:
On Mon, Sep 09, 2002 at 11:13:01AM -0400, S Dawalt wrote:
>
> Stephan Szabo said:
>
> >
> > On Sun, 8 Sep 2002, Jan Ploski wrote:
> >
> > > I am in particular wondering, why an index on message(sectionID,
> dateSent)
> > > does not make these queries comparably fast:
> > >
> > >     select msgnum from message where
> > >         sectionID = ? and
> > >         dateSent > ?
> > >         order by dateSent
> > >         limit 1;
> >
> > I don't think that'll use an index on (sectionID, dateSent) for the sort
> > step.  I think an index on (dateSent,sectionID) might be, however.
> >
>
>   I know I've read this before on the list (probably several times).  But
> either my skull is too thick or the topic too abstract; why is no index used
> for (sectionID, dateSent) but (dateSent, sectionID) does?  They are the same
> columns, but just reversed.  I don't see why that would make a difference.
> Is there some rule-of-thumb for determining when an index is used and when
> it isn't rather than trail and error using EXPLAIN?

Hmm, take out the order by. How long does it take then? How about trying:

select * from (select msgnum, datesent
         from message where
         sectionID = ? and
         dateSent > ?) order by datesent limit 1;

maybe that will force the plan you want.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.