Обсуждение: Sequences vs statistics

Поиск
Список
Период
Сортировка

Sequences vs statistics

От
Herbey Eric Zepeda
Дата:
Hi, I'm in the process of deciding whether to use.

-1 sequence per table containing the number of rows
-Using the collected statistics and functions like pg_relation_size

This is my use case:

I have several hundred tables in one database that I would like to 
constantly monitor as far as size or number of records.

When a certain threshold is met, I will not allow any further inserts
into that table.
The threshold-met-event can be approximate.

How will I affect the performance of the server if I chose to query 
for the system statistics data say, every 5 seconds as opposed to
just reading the latest_value of a sequence that gets updated every time
a row is added to its corresponding table?

Thank you



Re: Sequences vs statistics

От
David Johnston
Дата:
Herbey Eric Zepeda wrote
> Hi, I'm in the process of deciding whether to use.
>
> -1 sequence per table containing the number of rows
> -Using the collected statistics and functions like pg_relation_size
>
> This is my use case:
>
> I have several hundred tables in one database that I would like to
> constantly monitor as far as size or number of records.
>
> When a certain threshold is met, I will not allow any further inserts
> into that table.
> The threshold-met-event can be approximate.
>
> How will I affect the performance of the server if I chose to query
> for the system statistics data say, every 5 seconds as opposed to
> just reading the latest_value of a sequence that gets updated every time
> a row is added to its corresponding table?
>
> Thank you

Assuming you are using "serial" primary keys on these tables, or just add
one if you are not (though you wouldn't have to make it a PK), why not add a
check constraint on the column that the serial values cannot be larger than
a given number?

I get the feeling I handing over a loaded gun without any idea how you plan
to use it...

The use-case is one I have not encountered before.  Sequences have
well-documented limitations.  Where do you plan on doing this monitoring?
How large are your thresholds going to be?  Do you plan to allow for
deleting of data on these tables - thus making the sequence considerably
less accurate?  And if they delete enough records do you want to re-enable
insertions?

Maybe someone can answer the performance comparison question but if you
really care likely you should setup test-case yourself since given the
unusualness of your need I'm doubting anything official has been done in
this area and so you may just get "gut feel" answers from some of the more
knowledgeable persons on the list.

I'm guessing you doing multi-tenant data hosting and want to try and impose
some form of quota on your users - where each user having a single set of
dedicated tables.  If so (or even if not) it is generally recommended to be
as open as to "why" you are doing something and not just say what you are
trying to do.  You'll get much better help the more of the big-picture you
provide.  If so disk-space is likely more important than record count.  For
many tables you can calculate a reasonable ratio of rows-to-size but for
some tables (document storage for instance) that is impossible and you will
have to monitor system attributes and not just a sequence counter.

I guess hard-limits to avoid denial-of-service attacks (or mistakes) has
some merit but I would have an alternative set of thresholds, and a
corresponding monitoring system, to issue soft warnings.  And note that
filling up a table is only one of many vectors that need consideration
(again, this said not knowing why you feel you need to institute this
limit).

So, in the end you probably should just figure out how to monitor the system
(catalogs and function calls) and only if the performance hit becomes too
much consider ways to get more resources.  Likely via some kind of sharding
instead of figuring out a faster way to monitor.

David J.





--
View this message in context: http://postgresql.1045698.n5.nabble.com/Sequences-vs-statistics-tp5776207p5776211.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


Re: Sequences vs statistics

От
Elliot
Дата:
On 2013-10-28 23:56, David Johnston wrote:
> Herbey Eric Zepeda wrote
>> Hi, I'm in the process of deciding whether to use.
>>
>> -1 sequence per table containing the number of rows
>> -Using the collected statistics and functions like pg_relation_size
>>
>> This is my use case:
>>
>> I have several hundred tables in one database that I would like to
>> constantly monitor as far as size or number of records.
>>
>> When a certain threshold is met, I will not allow any further inserts
>> into that table.
>> The threshold-met-event can be approximate.
>>
>> How will I affect the performance of the server if I chose to query
>> for the system statistics data say, every 5 seconds as opposed to
>> just reading the latest_value of a sequence that gets updated every time
>> a row is added to its corresponding table?
>>
>> Thank you
> Assuming you are using "serial" primary keys on these tables, or just add
> one if you are not (though you wouldn't have to make it a PK), why not add a
> check constraint on the column that the serial values cannot be larger than
> a given number?
>
> I get the feeling I handing over a loaded gun without any idea how you plan
> to use it...
>
> The use-case is one I have not encountered before.  Sequences have
> well-documented limitations.  Where do you plan on doing this monitoring?
> How large are your thresholds going to be?  Do you plan to allow for
> deleting of data on these tables - thus making the sequence considerably
> less accurate?  And if they delete enough records do you want to re-enable
> insertions?
>
> Maybe someone can answer the performance comparison question but if you
> really care likely you should setup test-case yourself since given the
> unusualness of your need I'm doubting anything official has been done in
> this area and so you may just get "gut feel" answers from some of the more
> knowledgeable persons on the list.
>
> I'm guessing you doing multi-tenant data hosting and want to try and impose
> some form of quota on your users - where each user having a single set of
> dedicated tables.  If so (or even if not) it is generally recommended to be
> as open as to "why" you are doing something and not just say what you are
> trying to do.  You'll get much better help the more of the big-picture you
> provide.  If so disk-space is likely more important than record count.  For
> many tables you can calculate a reasonable ratio of rows-to-size but for
> some tables (document storage for instance) that is impossible and you will
> have to monitor system attributes and not just a sequence counter.
>
> I guess hard-limits to avoid denial-of-service attacks (or mistakes) has
> some merit but I would have an alternative set of thresholds, and a
> corresponding monitoring system, to issue soft warnings.  And note that
> filling up a table is only one of many vectors that need consideration
> (again, this said not knowing why you feel you need to institute this
> limit).
>
> So, in the end you probably should just figure out how to monitor the system
> (catalogs and function calls) and only if the performance hit becomes too
> much consider ways to get more resources.  Likely via some kind of sharding
> instead of figuring out a faster way to monitor.
>
> David J.
>
>
>
>
>
> --
> View this message in context: http://postgresql.1045698.n5.nabble.com/Sequences-vs-statistics-tp5776207p5776211.html
> Sent from the PostgreSQL - admin mailing list archive at Nabble.com.
>
>
Just some ideas here. You could maybe have a generic trigger function
that looks up the statistics using TG_RELID (so that the same
table-agnostic function can be used across all tables for which you want
this monitoring) and rejects inserts if the estimated row count is over
some threshold. You could also add in a "raise notice" if the row count
is lower than that threshold but larger than some secondary threshold
(to enable warnings before hitting your limit) - probably wouldn't hurt
to raise a similar notice for inserts that are altogether blocked so
that you have a way of identifying outside of the database when that
happens.

Your use case is definitely going to play a big part in the decision
making, though. It's pretty easy to bypass the restrictions this
approach gives you (and probably any approach): if your limit were 100
rows, then how do you stop a query that's trying to insert 10,000 rows
in one shot? You might couple that with some post hoc checking with
something like the check_postgres nagios plugin
(http://bucardo.org/check_postgres/check_postgres.pl.html). Depending on
what you can get away with, you might also consider not even using the
trigger for the automatic insert prevention and instead just use the
plugin to watch table sizes; especially for newer developers coming on
board later on it can be confusing why an insert claims that zero rows
were successfully inserted when a trigger blocks the insert.

To address your actual question, unless you've got a relatively large
number of table in your database, looking up statistics values probably
isn't going to be too expensive.  Similarly, checking the current value
of a sequence probably isn't too expensive either, but as was pointed
out this approach can fall out of sync with reality pretty quickly (if
you were to delete rows or manually alter the sequence to a different
value). I'd emphasize maintainability over performance in deciding how
to do this. I think the simplest approach would just be using
check_postgres to watch table sizes, warn if an estimated row count is
over a certain threshold, and notify critical if its over an even higher
threshold.