Обсуждение: How to check a table content efficiently? With LIMIT and OFFSET?
Hi, I’d like to monitor a table in a read-only Postgres database by writing a PL/pgSQL function. The database gets periodically overwritten by a mirroring loading process. The success criteria is, that the table contains at least some records (actually at least more than 100000). The first idea which comes to one’s mind is probably a count(*): SELECT CASE WHEN (count(*) > 10000) THEN 'yes' ELSE 'no' END FROM planet_osm_point; But my goal is to consume as few Postgres server resources as possible. So my idea is to retrieve an arbitrary record, like this: SELECT CASE WHEN (osm_id::text > '') THEN 'yes' ELSE 'no' END FROM planet_osm_point LIMIT 1 OFFSET 1000000; Is there anyone having an even better better idea? Yours, Stefan
Hi, That's my solution candidate: CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS ' SELECT (count(*) = 1) FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp ' LANGUAGE SQL; Yours, Stefan 2011/5/28 Stefan Keller <sfkeller@gmail.com>: > Hi, > > I’d like to monitor a table in a read-only Postgres database by > writing a PL/pgSQL function. > The database gets periodically overwritten by a mirroring loading process. > The success criteria is, that the table contains at least some records > (actually at least more than 100000). > > The first idea which comes to one’s mind is probably a count(*): > > SELECT CASE WHEN (count(*) > 10000) THEN 'yes' ELSE 'no' END > FROM planet_osm_point; > > But my goal is to consume as few Postgres server resources as possible. > So my idea is to retrieve an arbitrary record, like this: > > SELECT CASE WHEN (osm_id::text > '') THEN 'yes' ELSE 'no' END > FROM planet_osm_point LIMIT 1 OFFSET 1000000; > > Is there anyone having an even better better idea? > > Yours, Stefan >
On 05/29/2011 05:45 AM, Stefan Keller wrote: > Hi, > > That's my solution candidate: > > CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS ' > SELECT (count(*) = 1) > FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp > ' LANGUAGE SQL; LIMIT and OFFSET are often no more efficient than count(*). You're still likely to need a full table scan. Here's how I'd do it: I'd ANALYZE the table, then check the table statistics to see that they looked to be within reasonable bounds. That way you not only check the import, but in the process you ensure the statistics used by the query planner are up to date. Since ANALYZE only tests a sampling of records it does pretty much what you want, something that it's not so easy to do in SQL. -- Craig Ringer
Hi Craig Thanks for the answer. I also thought about this. You mean something like this? SELECT reltuples FROM pg_class WHERE relname = 'mytable'; 182820 (rows) That seams reasonably fast compared to count(*). But I'm hesitating to use ANALYZE for two reasons: 1. It's very slow: it repeadly takes 59000 ms on my machine. 2. There's an autovacuum background process which already does the job, doesn't it? Yours, Stefan 2011/5/29 Craig Ringer <craig@postnewspapers.com.au>: > On 05/29/2011 05:45 AM, Stefan Keller wrote: >> >> Hi, >> >> That's my solution candidate: >> >> CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS ' >> SELECT (count(*) = 1) >> FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp >> ' LANGUAGE SQL; > > LIMIT and OFFSET are often no more efficient than count(*). You're still > likely to need a full table scan. > > Here's how I'd do it: I'd ANALYZE the table, then check the table statistics > to see that they looked to be within reasonable bounds. That way you not > only check the import, but in the process you ensure the statistics used by > the query planner are up to date. Since ANALYZE only tests a sampling of > records it does pretty much what you want, something that it's not so easy > to do in SQL. > > -- > Craig Ringer >
On 29 May 2011, at 19:45, Stefan Keller wrote: > But I'm hesitating to use ANALYZE for two reasons: > 1. It's very slow: it repeadly takes 59000 ms on my machine. ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did providethat information, but that's not much at all) and how many indexes are on it? Are you sure you're not overburdeningyour hardware in some way? Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM ANALYZE<table>? > 2. There's an autovacuum background process which already does the > job, doesn't it? Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse immediatelyon that table. Also, on this mailing-list people don't appreciate it if you top-post. It makes the context hard to decipher and sometimeseven makes it difficult to give an accurate answer because the information people want to refer to is far separatedfrom the bit where they're trying to reply to something you said/asked. Remember, people aren't here for your sake. > 2011/5/29 Craig Ringer <craig@postnewspapers.com.au>: >> On 05/29/2011 05:45 AM, Stefan Keller wrote: >>> >>> Hi, >>> >>> That's my solution candidate: >>> >>> CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS ' >>> SELECT (count(*) = 1) >>> FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp >>> ' LANGUAGE SQL; >> >> LIMIT and OFFSET are often no more efficient than count(*). You're still >> likely to need a full table scan. >> >> Here's how I'd do it: I'd ANALYZE the table, then check the table statistics >> to see that they looked to be within reasonable bounds. That way you not >> only check the import, but in the process you ensure the statistics used by >> the query planner are up to date. Since ANALYZE only tests a sampling of >> records it does pretty much what you want, something that it's not so easy >> to do in SQL. >> >> -- >> Craig Ringer >> > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling. !DSPAM:737,4de2b6f311926797845409!
Hi Alban On 2011/5/29 Alban Hertroys wrote: > On 29 May 2011, at 19:45, Stefan Keller wrote: > >> But I'm hesitating to use ANALYZE for two reasons: >> 1. It's very slow: it repeadly takes 59000 ms on my machine. > > ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did providethat information, but that's not much at all) and how many indexes are on it? Are you sure you're not overburdeningyour hardware in some way? > > Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM ANALYZE<table>? You are right: I used ANALYZE (without specifying a table). But this still takes about 1 to 3 sec which is about 100 times slower than SELECT (count(*) = 1) FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp; or SELECT reltuples FROM pg_class WHERE relname = 'planet_osm_point'; >> 2. There's an autovacuum background process which already does the >> job, doesn't it? > > Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse immediatelyon that table. My table is a read-only table after all. That's another reason why I'm reluctant using ANALYZE <table>. > Also, on this mailing-list people don't appreciate it if you top-post. It makes the context > hard to decipher and sometimes even makes it difficult to give an accurate answer > because the information people want to refer to is far separated from the bit where > they're trying to reply to something you said/asked. Remember, people aren't here for your sake. Thank you for the hint, which I didn't know: Is this really still part of this elderly USENET netiquette here? Yours, Stefan >> 2011/5/29 Craig Ringer <craig@postnewspapers.com.au>: >>> On 05/29/2011 05:45 AM, Stefan Keller wrote: >>>> >>>> Hi, >>>> >>>> That's my solution candidate: >>>> >>>> CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS ' >>>> SELECT (count(*) = 1) >>>> FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp >>>> ' LANGUAGE SQL; >>> >>> LIMIT and OFFSET are often no more efficient than count(*). You're still >>> likely to need a full table scan. >>> >>> Here's how I'd do it: I'd ANALYZE the table, then check the table statistics >>> to see that they looked to be within reasonable bounds. That way you not >>> only check the import, but in the process you ensure the statistics used by >>> the query planner are up to date. Since ANALYZE only tests a sampling of >>> records it does pretty much what you want, something that it's not so easy >>> to do in SQL. >>> >>> -- >>> Craig Ringer >>> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general >> > > Alban Hertroys > > -- > Screwing up is an excellent way to attach something to the ceiling. > > > !DSPAM:1205,4de2b6e411923449910736! > > >
On 05/30/2011 05:55 AM, Stefan Keller wrote: > Hi Alban > > On 2011/5/29 Alban Hertroys wrote: >> On 29 May 2011, at 19:45, Stefan Keller wrote: >> >>> But I'm hesitating to use ANALYZE for two reasons: >>> 1. It's very slow: it repeadly takes 59000 ms on my machine. >> >> ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did providethat information, but that's not much at all) and how many indexes are on it? Are you sure you're not overburdeningyour hardware in some way? >> >> Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM ANALYZE<table>? > > You are right: I used ANALYZE (without specifying a table). But this > still takes about 1 to 3 sec which is about 100 times slower than Hmm, ok. I would've expected ANALYZE on the table to be much, much faster than your LIMIT ... OFFSET query. If the LIMIT ... OFFSET approach works better for you, use that. > SELECT (count(*) = 1) FROM (SELECT osm_id FROM planet_osm_point LIMIT > 1 OFFSET 100000) tmp; Instead of (count(*)=1) try writing: SELECT EXISTS(SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000); (untested but I'm pretty sure that's right). Just a readability tweak, it shouldn't make any real difference in performance. > SELECT reltuples FROM pg_class WHERE relname = 'planet_osm_point'; That is using cached statistics. If you have just done a batch update then it is *not* trustworthy without running ANALYZE tablename; first . -- Craig Ringer
On Sun, May 29, 2011 at 4:55 PM, Stefan Keller <sfkeller@gmail.com> wrote: > >>> 2. There's an autovacuum background process which already does the >>> job, doesn't it? >> >> Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse immediatelyon that table. > > My table is a read-only table after all. > That's another reason why I'm reluctant using ANALYZE <table>. > sorry, i don't follow that... why do you think that a read-only table doesn't need an ANALYZE? -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte y capacitación de PostgreSQL
Hi Jaime 2011/5/30 Jaime Casanova <jaime@2ndquadrant.com> wrote: > On Sun, May 29, 2011 at 4:55 PM, Stefan Keller <sfkeller@gmail.com> wrote: >> >>>> 2. There's an autovacuum background process which already does the >>>> job, doesn't it? >>> >>> Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse immediatelyon that table. >> >> My table is a read-only table after all. >> That's another reason why I'm reluctant using ANALYZE <table>. >> > > sorry, i don't follow that... why do you think that a read-only table > doesn't need an ANALYZE? Thanks for joining the discussion. I'm only reluctant to do an ANALYZE as part of a perdiodical (hourly) "check table contents" function. Such an ANALYZE command is already included in the perdiodical (nightly) update script which mirrors OpenStreetMap data. BTW: I've asked before for best parameters over at pgsql-performance ("How to configure a read-only database server?" 19. April 2011) and I am still happy about any hint. Yours, Stefan
On 29 May 2011, at 23:55, Stefan Keller wrote: > Hi Alban > > On 2011/5/29 Alban Hertroys wrote: >> On 29 May 2011, at 19:45, Stefan Keller wrote: >> >>> But I'm hesitating to use ANALYZE for two reasons: >>> 1. It's very slow: it repeadly takes 59000 ms on my machine. >> >> ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did providethat information, but that's not much at all) and how many indexes are on it? Are you sure you're not overburdeningyour hardware in some way? >> >> Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM ANALYZE<table>? > > You are right: I used ANALYZE (without specifying a table). But this > still takes about 1 to 3 sec which is about 100 times slower than > > SELECT (count(*) = 1) FROM (SELECT osm_id FROM planet_osm_point LIMIT > 1 OFFSET 100000) tmp; > or Still 1 to 3 seconds? It should be faster than a full table-scan, as it just takes samples across the table. Perhaps youhave many indexes or some of an uncommon type? Or is your system I/O-bound perhaps? It does need to write those statisticsto a system table at some point, in which an I/O-bound system won't perform very well of course. > SELECT reltuples FROM pg_class WHERE relname = 'planet_osm_point'; That will only work if you want to know the total amount of rows in the table. If you need to know how many rows will matcha specific WHERE-clause this falls on its behind. For cases like that you can use the output of EXPLAIN <query>, butthat will not be very accurate since it uses statistical information about value distribution and such of your data (whichis gathered by ANALYSE). If you really only need the total number of records and if you're indeed inserting/deleting in batches, then it's probablybest to create statement-level INSERT/DELETE triggers (that could call the same function). I've done this in thepast using a row-level trigger, but my data came in live. For batches of data it's probably more efficient to call a statement-leveltrigger once per query than a row-level one for each row. I think you can obtain the number of modified rowsfrom GET DIAGNOSTICS, off the top of my head. >>> 2. There's an autovacuum background process which already does the >>> job, doesn't it? >> >> Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse immediatelyon that table. > > My table is a read-only table after all. > That's another reason why I'm reluctant using ANALYZE <table>. You probably won't need to run it as often as every time you need to know the number of rows in it. If the data doesn't change,then the row-count in the statistics won't either. You probably do want to run this after a batch-INSERT/DELETE, or your row-counts will be inaccurate until auto-vacuum comesalong. >> Also, on this mailing-list people don't appreciate it if you top-post. It makes the context >> hard to decipher and sometimes even makes it difficult to give an accurate answer >> because the information people want to refer to is far separated from the bit where >> they're trying to reply to something you said/asked. Remember, people aren't here for your sake. > > Thank you for the hint, which I didn't know: > Is this really still part of this elderly USENET netiquette here? There's nothing elderly about it. If you're communicating with a large number of people at once, the requirements change.You don't want to make it difficult on people to follow a thread they possibly didn't follow earlier, or they eitherwon't bother to answer or they only pick up the latest bit of the thread. In both cases the chances that their answerswill be irrelevant are quite significant, provided they even do reply. In the end it has little to do with style and much more with common sense. The format used here (as well as in USENET) ismore suitable for mailing lists. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4de33d4211921620335251!