Обсуждение: [rfc] overhauling pgstat.stat
Hi, I'm considering overhauling pgstat.stat, and would like to know how many people are interested in this topic. As you may know, this file could be handreds of MB in size, because pgstat.stat holds all access statistics in each database, and it needs to read/write an entire pgstat.stat frequently. As a result, pgstat.stat often generates massive I/O operation, particularly when having a large number of tables in the database. To support multi-tenancy or just a large number of tables (up to 10k tables in single database), I think pgstat.stat needs to be overhauled. I think using heap and btree in pgstat.stat would be preferred to reduce read/write and to allow updating access statistics for specific tables in pgstat.stat file. Is this good for us? Any comments or suggestions? Regards, -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote: > Hi, > > I'm considering overhauling pgstat.stat, and would like to know how many > people are interested in this topic. > > As you may know, this file could be handreds of MB in size, because > pgstat.stat holds all access statistics in each database, and it needs > to read/write an entire pgstat.stat frequently. > > As a result, pgstat.stat often generates massive I/O operation, > particularly when having a large number of tables in the database. > > To support multi-tenancy or just a large number of tables (up to 10k > tables in single database), I think pgstat.stat needs to be overhauled. > > I think using heap and btree in pgstat.stat would be preferred to reduce > read/write and to allow updating access statistics for specific tables > in pgstat.stat file. > > Is this good for us? Hi, Nice thought. I/O reduction in pgstat can be really helpful. I am trying to think of our aim here. Would we be looking to split pgstat per table, so that the I/O write happens for only a portion of pgstat? Or reduce the I/O in general? If the later, how would using BTree help us? I would rather go for a range tree or something. But again, I may be completely wrong. Please elaborate a bit more on the solution we are trying to achieve.It seems really interesting. Regards, Atri -- Regards, Atri l'apprenant
Satoshi Nagayasu wrote: > As you may know, this file could be handreds of MB in size, because > pgstat.stat holds all access statistics in each database, and it needs > to read/write an entire pgstat.stat frequently. > > As a result, pgstat.stat often generates massive I/O operation, > particularly when having a large number of tables in the database. We already changed it: commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: Mon Feb 18 17:56:08 2013 -0300 Split pgstat file in smaller pieces We now write one file per database and one global file, instead of having thewhole thing in a single huge file. This reduces the I/O that must be done when partial data is required -- which isall the time, because each process only needs information on its own database anyway. Also, the autovacuum launcherdoes not need data about tables and functions in each database; having the global stats for all DBs is enough. Catalog version bumped because we have a new subdir under PGDATA. Author: Tomas Vondra. Some rework byÁlvaro Testing by Jeff Janes Other discussion by Heikki Linnakangas, Tom Lane. I don't oppose further tweaking, of course, but I wonder if you are considering these changes. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, (2013/09/04 13:07), Alvaro Herrera wrote: > Satoshi Nagayasu wrote: > >> As you may know, this file could be handreds of MB in size, because >> pgstat.stat holds all access statistics in each database, and it needs >> to read/write an entire pgstat.stat frequently. >> >> As a result, pgstat.stat often generates massive I/O operation, >> particularly when having a large number of tables in the database. > > We already changed it:>> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24> Author: Alvaro Herrera <alvherre@alvh.no-ip.org>>Date: Mon Feb 18 17:56:08 2013 -0300>> Split pgstat file in smaller pieces Thanks for the comments. I forgot to mention that. Yes, we have already split single pgstat.stat file into several pieces. However, we still need to read/write large amount of statistics data when we have a large number of tables in single database or multiple databases being accessed. Right? I think the issue here is that it is necessary to write/read statistics data even it's not actually changed. So, I'm wondering how we can minimize read/write operations on these statistics data files with using heap and btree. Regards, > > commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: Mon Feb 18 17:56:08 2013 -0300 > > Split pgstat file in smaller pieces > > We now write one file per database and one global file, instead of > having the whole thing in a single huge file. This reduces the I/O that > must be done when partial data is required -- which is all the time, > because each process only needs information on its own database anyway. > Also, the autovacuum launcher does not need data about tables and > functions in each database; having the global stats for all DBs is > enough. > > Catalog version bumped because we have a new subdir under PGDATA. > > Author: Tomas Vondra. Some rework by Álvaro > Testing by Jeff Janes > Other discussion by Heikki Linnakangas, Tom Lane. > > > I don't oppose further tweaking, of course, but I wonder if you are > considering these changes. > -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
Hi, (2013/09/04 12:52), Atri Sharma wrote: > On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote: >> Hi, >> >> I'm considering overhauling pgstat.stat, and would like to know how many >> people are interested in this topic. >> >> As you may know, this file could be handreds of MB in size, because >> pgstat.stat holds all access statistics in each database, and it needs >> to read/write an entire pgstat.stat frequently. >> >> As a result, pgstat.stat often generates massive I/O operation, >> particularly when having a large number of tables in the database. >> >> To support multi-tenancy or just a large number of tables (up to 10k >> tables in single database), I think pgstat.stat needs to be overhauled. >> >> I think using heap and btree in pgstat.stat would be preferred to reduce >> read/write and to allow updating access statistics for specific tables >> in pgstat.stat file. >> >> Is this good for us? > > Hi, > > Nice thought. I/O reduction in pgstat can be really helpful. > > I am trying to think of our aim here. Would we be looking to split > pgstat per table, so that the I/O write happens for only a portion of > pgstat? Or reduce the I/O in general? I prefer the latter. Under the current implementation, DBA need to split single database into many smaller databases with considering access locality of the tables. It's difficult and could be change in future. And splitting the statistics data into many files (per table, for example) would cause another performance issue when collecting/showing statistics at once. Just my guess though. So, I'm looking for a new way to reduce I/O for the statistics data in general. Regards, > > If the later, how would using BTree help us? I would rather go for a > range tree or something. But again, I may be completely wrong. > > Please elaborate a bit more on the solution we are trying to > achieve.It seems really interesting. > > Regards, > > Atri > > -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
Sent from my iPad On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga@uptime.jp> wrote: > Hi, > > (2013/09/04 12:52), Atri Sharma wrote: >> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote: >>> Hi, >>> >>> I'm considering overhauling pgstat.stat, and would like to know how many >>> people are interested in this topic. >>> >>> As you may know, this file could be handreds of MB in size, because >>> pgstat.stat holds all access statistics in each database, and it needs >>> to read/write an entire pgstat.stat frequently. >>> >>> As a result, pgstat.stat often generates massive I/O operation, >>> particularly when having a large number of tables in the database. >>> >>> To support multi-tenancy or just a large number of tables (up to 10k >>> tables in single database), I think pgstat.stat needs to be overhauled. >>> >>> I think using heap and btree in pgstat.stat would be preferred to reduce >>> read/write and to allow updating access statistics for specific tables >>> in pgstat.stat file. >>> >>> Is this good for us? >> >> Hi, >> >> Nice thought. I/O reduction in pgstat can be really helpful. >> >> I am trying to think of our aim here. Would we be looking to split >> pgstat per table, so that the I/O write happens for only a portion of >> pgstat? Or reduce the I/O in general? > > I prefer the latter. > > Under the current implementation, DBA need to split single database > into many smaller databases with considering access locality of the > tables. It's difficult and could be change in future. > > And splitting the statistics data into many files (per table, > for example) would cause another performance issue when > collecting/showing statistics at once. Just my guess though. > > So, I'm looking for a new way to reduce I/O for the statistics data > in general. > > Regards, > >> >> If the later, how would using BTree help us? I would rather go for a >> range tree or something. But again, I may be completely wrong. >> >> Please elaborate a bit more on the solution we are trying to >> achieve.It seems really interesting. >> >> Regards, >> >> Atri >> >> > > Right,thanks. How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system of thestats? Regards, Atri
(2013/09/04 15:23), Atri Sharma wrote: > > > Sent from my iPad > > On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga@uptime.jp> wrote: > >> Hi, >> >> (2013/09/04 12:52), Atri Sharma wrote: >>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote: >>>> Hi, >>>> >>>> I'm considering overhauling pgstat.stat, and would like to know how many >>>> people are interested in this topic. >>>> >>>> As you may know, this file could be handreds of MB in size, because >>>> pgstat.stat holds all access statistics in each database, and it needs >>>> to read/write an entire pgstat.stat frequently. >>>> >>>> As a result, pgstat.stat often generates massive I/O operation, >>>> particularly when having a large number of tables in the database. >>>> >>>> To support multi-tenancy or just a large number of tables (up to 10k >>>> tables in single database), I think pgstat.stat needs to be overhauled. >>>> >>>> I think using heap and btree in pgstat.stat would be preferred to reduce >>>> read/write and to allow updating access statistics for specific tables >>>> in pgstat.stat file. >>>> >>>> Is this good for us? >>> >>> Hi, >>> >>> Nice thought. I/O reduction in pgstat can be really helpful. >>> >>> I am trying to think of our aim here. Would we be looking to split >>> pgstat per table, so that the I/O write happens for only a portion of >>> pgstat? Or reduce the I/O in general? >> >> I prefer the latter. >> >> Under the current implementation, DBA need to split single database >> into many smaller databases with considering access locality of the >> tables. It's difficult and could be change in future. >> >> And splitting the statistics data into many files (per table, >> for example) would cause another performance issue when >> collecting/showing statistics at once. Just my guess though. >> >> So, I'm looking for a new way to reduce I/O for the statistics data >> in general. >> >> Regards, >> >>> >>> If the later, how would using BTree help us? I would rather go for a >>> range tree or something. But again, I may be completely wrong. >>> >>> Please elaborate a bit more on the solution we are trying to >>> achieve.It seems really interesting. >>> >>> Regards, >>> >>> Atri >>> >>> >> >> > > Right,thanks. > > How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system ofthe stats? For example, when you read only a single block from your table, then you need to write all values in your database statistics next. It often generates large amount of i/o operation. However, if random access is allowed in the statistics, you can update only as single record for the specific table which you read. It would be less than 100 bytes for each table. I have no idea about how a priority queue can work here so far. However, if the statistics is overhauled, PostgreSQL would be able to host a much larger number of customers more easily. Regards, -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
Satoshi, * Satoshi Nagayasu (snaga@uptime.jp) wrote: > (2013/09/04 13:07), Alvaro Herrera wrote: > >We already changed it: > > > > commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 > > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > > Date: Mon Feb 18 17:56:08 2013 -0300 > > > > Split pgstat file in smaller pieces > > Thanks for the comments. I forgot to mention that. > > Yes, we have already split single pgstat.stat file into > several pieces. > > However, we still need to read/write large amount of statistics > data when we have a large number of tables in single database > or multiple databases being accessed. Right? Would simply also splitting per tablespace help? > I think the issue here is that it is necessary to write/read > statistics data even it's not actually changed. > > So, I'm wondering how we can minimize read/write operations > on these statistics data files with using heap and btree. It does sound like an interesting idea to use heap/btree instead but I wonder about the effort involved, particularly around coordinating access. We wouldn't want to end up introducing additional contention points by doing this.. Thanks, Stephen
Sent from my iPad On 04-Sep-2013, at 15:39, Satoshi Nagayasu <snaga@uptime.jp> wrote: > (2013/09/04 15:23), Atri Sharma wrote: >> >> >> Sent from my iPad >> >> On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga@uptime.jp> wrote: >> >>> Hi, >>> >>> (2013/09/04 12:52), Atri Sharma wrote: >>>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote: >>>>> Hi, >>>>> >>>>> I'm considering overhauling pgstat.stat, and would like to know how many >>>>> people are interested in this topic. >>>>> >>>>> As you may know, this file could be handreds of MB in size, because >>>>> pgstat.stat holds all access statistics in each database, and it needs >>>>> to read/write an entire pgstat.stat frequently. >>>>> >>>>> As a result, pgstat.stat often generates massive I/O operation, >>>>> particularly when having a large number of tables in the database. >>>>> >>>>> To support multi-tenancy or just a large number of tables (up to 10k >>>>> tables in single database), I think pgstat.stat needs to be overhauled. >>>>> >>>>> I think using heap and btree in pgstat.stat would be preferred to reduce >>>>> read/write and to allow updating access statistics for specific tables >>>>> in pgstat.stat file. >>>>> >>>>> Is this good for us? >>>> >>>> Hi, >>>> >>>> Nice thought. I/O reduction in pgstat can be really helpful. >>>> >>>> I am trying to think of our aim here. Would we be looking to split >>>> pgstat per table, so that the I/O write happens for only a portion of >>>> pgstat? Or reduce the I/O in general? >>> >>> I prefer the latter. >>> >>> Under the current implementation, DBA need to split single database >>> into many smaller databases with considering access locality of the >>> tables. It's difficult and could be change in future. >>> >>> And splitting the statistics data into many files (per table, >>> for example) would cause another performance issue when >>> collecting/showing statistics at once. Just my guess though. >>> >>> So, I'm looking for a new way to reduce I/O for the statistics data >>> in general. >>> >>> Regards, >>> >>>> >>>> If the later, how would using BTree help us? I would rather go for a >>>> range tree or something. But again, I may be completely wrong. >>>> >>>> Please elaborate a bit more on the solution we are trying to >>>> achieve.It seems really interesting. >>>> >>>> Regards, >>>> >>>> Atri >> >> Right,thanks. >> >> How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system ofthe stats? > > For example, when you read only a single block from your table, > then you need to write all values in your database statistics next. > It often generates large amount of i/o operation. > > However, if random access is allowed in the statistics, you can > update only as single record for the specific table which you read. > It would be less than 100 bytes for each table. > > I have no idea about how a priority queue can work here so far. > However, if the statistics is overhauled, PostgreSQL would be able > to host a much larger number of customers Ah, now I get it. Thanks a ton for the detailed explanation. Yes, a BTree will sufficiently isolate per table stats here and allow for random access. Another thing I can think of is having a write back cache which could probably be used for a buffer before the actual statswrite. I am just musing here though. Regards, Atri
2013/9/4 Atri Sharma <atri.jiit@gmail.com>
Sent from my iPad
On 04-Sep-2013, at 15:39, Satoshi Nagayasu <snaga@uptime.jp> wrote:
> (2013/09/04 15:23), Atri Sharma wrote:
>>
>>
>> Sent from my iPad
>>
>> On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga@uptime.jp> wrote:
>>
>>> Hi,
>>>
>>> (2013/09/04 12:52), Atri Sharma wrote:
>>>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:
>>>>> Hi,
>>>>>
>>>>> I'm considering overhauling pgstat.stat, and would like to know how many
>>>>> people are interested in this topic.
>>>>>
>>>>> As you may know, this file could be handreds of MB in size, because
>>>>> pgstat.stat holds all access statistics in each database, and it needs
>>>>> to read/write an entire pgstat.stat frequently.
>>>>>
>>>>> As a result, pgstat.stat often generates massive I/O operation,
>>>>> particularly when having a large number of tables in the database.
>>>>>
>>>>> To support multi-tenancy or just a large number of tables (up to 10k
>>>>> tables in single database), I think pgstat.stat needs to be overhauled.
>>>>>
>>>>> I think using heap and btree in pgstat.stat would be preferred to reduce
>>>>> read/write and to allow updating access statistics for specific tables
>>>>> in pgstat.stat file.
>>>>>
>>>>> Is this good for us?
>>>>
>>>> Hi,
>>>>
>>>> Nice thought. I/O reduction in pgstat can be really helpful.
>>>>
>>>> I am trying to think of our aim here. Would we be looking to split
>>>> pgstat per table, so that the I/O write happens for only a portion of
>>>> pgstat? Or reduce the I/O in general?
>>>
>>> I prefer the latter.
>>>
>>> Under the current implementation, DBA need to split single database
>>> into many smaller databases with considering access locality of the
>>> tables. It's difficult and could be change in future.
>>>
>>> And splitting the statistics data into many files (per table,
>>> for example) would cause another performance issue when
>>> collecting/showing statistics at once. Just my guess though.
>>>
>>> So, I'm looking for a new way to reduce I/O for the statistics data
>>> in general.
>>>
>>> Regards,
>>>
>>>>
>>>> If the later, how would using BTree help us? I would rather go for a
>>>> range tree or something. But again, I may be completely wrong.
>>>>
>>>> Please elaborate a bit more on the solution we are trying to
>>>> achieve.It seems really interesting.
>>>>
>>>> Regards,
>>>>
>>>> Atri
>>
>> Right,thanks.
>>
>> How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system of the stats?
>
> For example, when you read only a single block from your table,
> then you need to write all values in your database statistics next.
> It often generates large amount of i/o operation.
>
> However, if random access is allowed in the statistics, you can
> update only as single record for the specific table which you read.
> It would be less than 100 bytes for each table.
>
> I have no idea about how a priority queue can work here so far.
> However, if the statistics is overhauled, PostgreSQL would be able
> to host a much larger number of customers
Ah, now I get it. Thanks a ton for the detailed explanation.
Yes, a BTree will sufficiently isolate per table stats here and allow for random access.
Another thing I can think of is having a write back cache which could probably be used for a buffer before the actual stats write. I am just musing here though.
we very successfully use a tmpfs volume for pgstat files (use a backport of multiple statfiles from 9.3 to 9.1)
Regards
Pavel
Pavel
Regards,
Atri
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2013/9/4 Atri Sharma <atri.jiit@gmail.com>
Sent from my iPad
On 04-Sep-2013, at 15:39, Satoshi Nagayasu <snaga@uptime.jp> wrote:
> (2013/09/04 15:23), Atri Sharma wrote:
>>
>>
>> Sent from my iPad
>>
>> On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga@uptime.jp> wrote:
>>
>>> Hi,
>>>
>>> (2013/09/04 12:52), Atri Sharma wrote:
>>>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote:
>>>>> Hi,
>>>>>
>>>>> I'm considering overhauling pgstat.stat, and would like to know how many
>>>>> people are interested in this topic.
>>>>>
>>>>> As you may know, this file could be handreds of MB in size, because
>>>>> pgstat.stat holds all access statistics in each database, and it needs
>>>>> to read/write an entire pgstat.stat frequently.
>>>>>
>>>>> As a result, pgstat.stat often generates massive I/O operation,
>>>>> particularly when having a large number of tables in the database.
>>>>>
>>>>> To support multi-tenancy or just a large number of tables (up to 10k
>>>>> tables in single database), I think pgstat.stat needs to be overhauled.
>>>>>
>>>>> I think using heap and btree in pgstat.stat would be preferred to reduce
>>>>> read/write and to allow updating access statistics for specific tables
>>>>> in pgstat.stat file.
>>>>>
>>>>> Is this good for us?
>>>>
>>>> Hi,
>>>>
>>>> Nice thought. I/O reduction in pgstat can be really helpful.
>>>>
>>>> I am trying to think of our aim here. Would we be looking to split
>>>> pgstat per table, so that the I/O write happens for only a portion of
>>>> pgstat? Or reduce the I/O in general?
>>>
>>> I prefer the latter.
>>>
>>> Under the current implementation, DBA need to split single database
>>> into many smaller databases with considering access locality of the
>>> tables. It's difficult and could be change in future.
>>>
>>> And splitting the statistics data into many files (per table,
>>> for example) would cause another performance issue when
>>> collecting/showing statistics at once. Just my guess though.
>>>
>>> So, I'm looking for a new way to reduce I/O for the statistics data
>>> in general.
>>>
>>> Regards,
>>>
>>>>
>>>> If the later, how would using BTree help us? I would rather go for a
>>>> range tree or something. But again, I may be completely wrong.
>>>>
>>>> Please elaborate a bit more on the solution we are trying to
>>>> achieve.It seems really interesting.
>>>>
>>>> Regards,
>>>>
>>>> Atri
>>
>> Right,thanks.
>>
>> How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system of the stats?
>
> For example, when you read only a single block from your table,
> then you need to write all values in your database statistics next.
> It often generates large amount of i/o operation.
>
> However, if random access is allowed in the statistics, you can
> update only as single record for the specific table which you read.
> It would be less than 100 bytes for each table.
>
> I have no idea about how a priority queue can work here so far.
> However, if the statistics is overhauled, PostgreSQL would be able
> to host a much larger number of customers
Ah, now I get it. Thanks a ton for the detailed explanation.
Yes, a BTree will sufficiently isolate per table stats here and allow for random access.
Another thing I can think of is having a write back cache which could probably be used for a buffer before the actual stats write. I am just musing here though.
we very successfully use a tmpfs volume for pgstat files (use a backport of multiple statfiles from 9.3 to 9.1
Regards,
Atri
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4.9.2013 07:24, Satoshi Nagayasu wrote: > Hi, > > (2013/09/04 12:52), Atri Sharma wrote: >> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga@uptime.jp> >> wrote: >>> Hi, >>> >>> I'm considering overhauling pgstat.stat, and would like to know >>> how many people are interested in this topic. >>> >>> As you may know, this file could be handreds of MB in size, >>> because pgstat.stat holds all access statistics in each database, >>> and it needs to read/write an entire pgstat.stat frequently. >>> >>> As a result, pgstat.stat often generates massive I/O operation, >>> particularly when having a large number of tables in the >>> database. >>> >>> To support multi-tenancy or just a large number of tables (up to >>> 10k tables in single database), I think pgstat.stat needs to be >>> overhauled. >>> >>> I think using heap and btree in pgstat.stat would be preferred to >>> reduce read/write and to allow updating access statistics for >>> specific tables in pgstat.stat file. >>> >>> Is this good for us? >> >> Hi, >> >> Nice thought. I/O reduction in pgstat can be really helpful. >> >> I am trying to think of our aim here. Would we be looking to split >> pgstat per table, so that the I/O write happens for only a portion >> of pgstat? Or reduce the I/O in general? > > I prefer the latter. > > Under the current implementation, DBA need to split single database > into many smaller databases with considering access locality of the > tables. It's difficult and could be change in future. > > And splitting the statistics data into many files (per table, for > example) would cause another performance issue when > collecting/showing statistics at once. Just my guess though. > > So, I'm looking for a new way to reduce I/O for the statistics data > in general. Hi, as one of the authors of the 9.3 patch (per database stats), I planned to work on this a bit more in the 9.4 cycle. So a few comments / ideas. I'm not entirely sure splitting the stats per table would be that bad. After all we already have per-relation data files, so either the users already have serious problems (so this won't make it noticeably worse) or it will work fine. But I'm not saying it's the right choice either. My idea was to keep the per-database stats, but allow some sort of "random" access - updating / deleting the records in place, adding records etc. The simplest way I could think of was adding a simple "index" - a mapping of OID to position in the stat file. I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or something like that. This would make it quite simple to access existing record 1: get position from the index 2: read sizeof(Entry) from the file 3: if it's update, just overwrite the bytes, for deleteset isdeleted flag (needs to be added to all entries) or reading all the records (just read the whole file as today). regards Tomas
On 4.9.2013 14:43, Pavel Stehule wrote: > > > > 2013/9/4 Atri Sharma <atri.jiit@gmail.com <mailto:atri.jiit@gmail.com>> > > > > Sent from my iPad > > On 04-Sep-2013, at 15:39, Satoshi Nagayasu <snaga@uptime.jp > <mailto:snaga@uptime.jp>> wrote: > > > (2013/09/04 15:23), Atri Sharma wrote: > >> > >> > >> Sent from my iPad > >> > >> On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga@uptime.jp > <mailto:snaga@uptime.jp>> wrote: > >> > >>> Hi, > >>> > >>> (2013/09/04 12:52), Atri Sharma wrote: > >>>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu > <snaga@uptime.jp <mailto:snaga@uptime.jp>> wrote: > >>>>> Hi, > >>>>> > >>>>> I'm considering overhauling pgstat.stat, and would like to > know how many > >>>>> people are interested in this topic. > >>>>> > >>>>> As you may know, this file could be handreds of MB in size, > because > >>>>> pgstat.stat holds all access statistics in each database, and > it needs > >>>>> to read/write an entire pgstat.stat frequently. > >>>>> > >>>>> As a result, pgstat.stat often generates massive I/O operation, > >>>>> particularly when having a large number of tables in the database. > >>>>> > >>>>> To support multi-tenancy or just a large number of tables (up > to 10k > >>>>> tables in single database), I think pgstat.stat needs to be > overhauled. > >>>>> > >>>>> I think using heap and btree in pgstat.stat would be preferred > to reduce > >>>>> read/write and to allow updating access statistics for > specific tables > >>>>> in pgstat.stat file. > >>>>> > >>>>> Is this good for us? > >>>> > >>>> Hi, > >>>> > >>>> Nice thought. I/O reduction in pgstat can be really helpful. > >>>> > >>>> I am trying to think of our aim here. Would we be looking to split > >>>> pgstat per table, so that the I/O write happens for only a > portion of > >>>> pgstat? Or reduce the I/O in general? > >>> > >>> I prefer the latter. > >>> > >>> Under the current implementation, DBA need to split single database > >>> into many smaller databases with considering access locality of the > >>> tables. It's difficult and could be change in future. > >>> > >>> And splitting the statistics data into many files (per table, > >>> for example) would cause another performance issue when > >>> collecting/showing statistics at once. Just my guess though. > >>> > >>> So, I'm looking for a new way to reduce I/O for the statistics data > >>> in general. > >>> > >>> Regards, > >>> > >>>> > >>>> If the later, how would using BTree help us? I would rather go > for a > >>>> range tree or something. But again, I may be completely wrong. > >>>> > >>>> Please elaborate a bit more on the solution we are trying to > >>>> achieve.It seems really interesting. > >>>> > >>>> Regards, > >>>> > >>>> Atri > >> > >> Right,thanks. > >> > >> How would using heap and BTree help here? Are we looking at a > priority queue which supports the main storage system of the stats? > > > > For example, when you read only a single block from your table, > > then you need to write all values in your database statistics next. > > It often generates large amount of i/o operation. > > > > However, if random access is allowed in the statistics, you can > > update only as single record for the specific table which you read. > > It would be less than 100 bytes for each table. > > > > I have no idea about how a priority queue can work here so far. > > However, if the statistics is overhauled, PostgreSQL would be able > > to host a much larger number of customers > > > Ah, now I get it. Thanks a ton for the detailed explanation. > > Yes, a BTree will sufficiently isolate per table stats here and > allow for random access. > > Another thing I can think of is having a write back cache which > could probably be used for a buffer before the actual stats write. I > am just musing here though. > > > we very successfully use a tmpfs volume for pgstat files (use a backport > of multiple statfiles from 9.3 to 9.1 It works quite well as long as you have the objects (tables, indexes, functions) spread across multiple databases. Once you have one database with very large number of objects, tmpfs is not as effective. It's going to help with stats I/O, but it's not going to help with high CPU usage (you're reading and parsing the stat files over and over) and every rewrite creates a copy of the file. So if you have 400MB stats, you will need 800MB tmpfs + some slack (say, 200MB). That means you'll use ~1GB tmpfs although 400MB would be just fine. And this 600MB won't be used for page cache etc. OTOH, it's true that if you have that many objects, 600MB of RAM is not going to help you anyway. Tomas
It works quite well as long as you have the objects (tables, indexes,>
>
> we very successfully use a tmpfs volume for pgstat files (use a backport
> of multiple statfiles from 9.3 to 9.1
functions) spread across multiple databases. Once you have one database
with very large number of objects, tmpfs is not as effective.
It's going to help with stats I/O, but it's not going to help with high
CPU usage (you're reading and parsing the stat files over and over) and
every rewrite creates a copy of the file. So if you have 400MB stats,
you will need 800MB tmpfs + some slack (say, 200MB). That means you'll
use ~1GB tmpfs although 400MB would be just fine. And this 600MB won't
be used for page cache etc.
OTOH, it's true that if you have that many objects, 600MB of RAM is not
going to help you anyway.
and just idea - can we use a database for storing these files. It can be used in unlogged tables. Second idea - hold a one bg worker as persistent memory key value database and hold data in memory with some optimizations - using anti cache and similar memory database fetures.
Pavel
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4.9.2013 14:13, Stephen Frost wrote: > * Satoshi Nagayasu (snaga@uptime.jp) wrote: > >> Yes, we have already split single pgstat.stat file into several >> pieces. >> >> However, we still need to read/write large amount of statistics >> data when we have a large number of tables in single database or >> multiple databases being accessed. Right? > > Would simply also splitting per tablespace help? I don't think that's a good solution. Forcing the users to use tablespaces just to minimize the stats overhead is not that far from forcing them to use multiple databases. Tablespaces have disadvantages too, so I'd suggest to look for a solution metting the "just works" criterion. >> I think the issue here is that it is necessary to write/read >> statistics data even it's not actually changed. >> >> So, I'm wondering how we can minimize read/write operations on >> these statistics data files with using heap and btree. > > It does sound like an interesting idea to use heap/btree instead but > I wonder about the effort involved, particularly around coordinating > access. We wouldn't want to end up introducing additional contention > points by doing this.. Yes, this is definitely an important thing to keep in mind. Maintaining the "index" (no matter what kind of index will be used) will cause some contention, but I believe it will be manageable with a bit of effort. Tomas
Tomas Vondra wrote: > My idea was to keep the per-database stats, but allow some sort of > "random" access - updating / deleting the records in place, adding > records etc. The simplest way I could think of was adding a simple > "index" - a mapping of OID to position in the stat file. > > I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or > something like that. This would make it quite simple to access existing > record > > 1: get position from the index > 2: read sizeof(Entry) from the file > 3: if it's update, just overwrite the bytes, for delete set isdeleted > flag (needs to be added to all entries) > > or reading all the records (just read the whole file as today). Sounds reasonable. However, I think the index should be a real index, i.e. have a tree structure that can be walked down, not just a plain array. If you have a 400 MB stat file, then you must have about 4 million tables, and you will not want to scan such a large array every time you want to find an entry. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4.9.2013 20:59, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> My idea was to keep the per-database stats, but allow some sort of >> "random" access - updating / deleting the records in place, adding >> records etc. The simplest way I could think of was adding a simple >> "index" - a mapping of OID to position in the stat file. >> >> I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or >> something like that. This would make it quite simple to access existing >> record >> >> 1: get position from the index >> 2: read sizeof(Entry) from the file >> 3: if it's update, just overwrite the bytes, for delete set isdeleted >> flag (needs to be added to all entries) >> >> or reading all the records (just read the whole file as today). > > Sounds reasonable. However, I think the index should be a real index, > i.e. have a tree structure that can be walked down, not just a plain > array. If you have a 400 MB stat file, then you must have about 4 > million tables, and you will not want to scan such a large array every > time you want to find an entry. I was thinking about a sorted array, so a bisection would be a simple and fast way to search. New items could be added to another small unsorted array (say, 1000 elements) and this would be extended and resorted only when this small one gets full. Tomas
(2013/09/05 3:50), Pavel Stehule wrote: > > we very successfully use a tmpfs volume for pgstat files (use a > backport > > of multiple statfiles from 9.3 to 9.1 > > It works quite well as long as you have the objects (tables, indexes, > functions) spread across multiple databases. Once you have one database > with very large number of objects, tmpfs is not as effective. > > It's going to help with stats I/O, but it's not going to help with high > CPU usage (you're reading and parsing the stat files over and over) and > every rewrite creates a copy of the file. So if you have 400MB stats, > you will need 800MB tmpfs + some slack (say, 200MB). That means you'll > use ~1GB tmpfs although 400MB would be just fine. And this 600MB won't > be used for page cache etc. > > OTOH, it's true that if you have that many objects, 600MB of RAM is not > going to help you anyway. > > > and just idea - can we use a database for storing these files. It can be > used in unlogged tables. Second idea - hold a one bg worker as > persistent memory key value database and hold data in memory with some > optimizations - using anti cache and similar memory database fetures. Yeah, I'm interested in this idea too. If the stat collector has a dedicated connection to the backend in order to store statistics into dedicated tables, we can easily take advantages of index (btree, or hash?) and heap storage. Is this worth trying? Regards, -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
(2013/09/05 3:59), Alvaro Herrera wrote: > Tomas Vondra wrote: > >> My idea was to keep the per-database stats, but allow some sort of >> "random" access - updating / deleting the records in place, adding >> records etc. The simplest way I could think of was adding a simple >> "index" - a mapping of OID to position in the stat file. >> >> I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or >> something like that. This would make it quite simple to access existing >> record >> >> 1: get position from the index >> 2: read sizeof(Entry) from the file >> 3: if it's update, just overwrite the bytes, for delete set isdeleted >> flag (needs to be added to all entries) >> >> or reading all the records (just read the whole file as today). > > Sounds reasonable. However, I think the index should be a real index, > i.e. have a tree structure that can be walked down, not just a plain > array. If you have a 400 MB stat file, then you must have about 4 > million tables, and you will not want to scan such a large array every > time you want to find an entry. I thought an array structure at first. But, for now, I think we should have a real index for the statistics data because we already have several index storages, and it will allow us to minimize read/write operations. BTW, what kind of index would be preferred for this purpose? btree or hash? If we use btree, do we need "range scan" thing on the statistics tables? I have no idea so far. Regards, -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
Sent from my iPad On 05-Sep-2013, at 8:58, Satoshi Nagayasu <snaga@uptime.jp> wrote: > (2013/09/05 3:59), Alvaro Herrera wrote: >> Tomas Vondra wrote: >> >>> My idea was to keep the per-database stats, but allow some sort of >>> "random" access - updating / deleting the records in place, adding >>> records etc. The simplest way I could think of was adding a simple >>> "index" - a mapping of OID to position in the stat file. >>> >>> I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or >>> something like that. This would make it quite simple to access existing >>> record >>> >>> 1: get position from the index >>> 2: read sizeof(Entry) from the file >>> 3: if it's update, just overwrite the bytes, for delete set isdeleted >>> flag (needs to be added to all entries) >>> >>> or reading all the records (just read the whole file as today). >> >> Sounds reasonable. However, I think the index should be a real index, >> i.e. have a tree structure that can be walked down, not just a plain >> array. If you have a 400 MB stat file, then you must have about 4 >> million tables, and you will not want to scan such a large array every >> time you want to find an entry. > > I thought an array structure at first. > > But, for now, I think we should have a real index for the > statistics data because we already have several index storages, > and it will allow us to minimize read/write operations. > > BTW, what kind of index would be preferred for this purpose? > btree or hash? > > If we use btree, do we need "range scan" thing on the statistics > tables? I have no idea so far. > The thing I am interested in is range scan. That is the reason I wish to explore range tree usage here, maybe as a secondaryindex. Regards, Atri
Satoshi Nagayasu wrote: > But, for now, I think we should have a real index for the > statistics data because we already have several index storages, > and it will allow us to minimize read/write operations. > > BTW, what kind of index would be preferred for this purpose? > btree or hash? I find it hard to get excited about using the AM interface for this purpose. To me it makes a lot more sense to have separate, much simpler code. We don't need any transactionality, user defined types, user defined operators, or anything like that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 5, 2013 at 10:59 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Satoshi Nagayasu wrote: > >> But, for now, I think we should have a real index for the >> statistics data because we already have several index storages, >> and it will allow us to minimize read/write operations. >> >> BTW, what kind of index would be preferred for this purpose? >> btree or hash? > > I find it hard to get excited about using the AM interface for this > purpose. To me it makes a lot more sense to have separate, much > simpler code. We don't need any transactionality, user defined types, > user defined operators, or anything like that. +1. But, would not rewriting a lot of existing functionalities potentially lead to points of contention and/or much more effort? Regards, Atri -- Regards, Atri l'apprenant
On 5.9.2013 09:36, Atri Sharma wrote: > On Thu, Sep 5, 2013 at 10:59 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Satoshi Nagayasu wrote: >> >>> But, for now, I think we should have a real index for the >>> statistics data because we already have several index storages, >>> and it will allow us to minimize read/write operations. >>> >>> BTW, what kind of index would be preferred for this purpose? >>> btree or hash? >> >> I find it hard to get excited about using the AM interface for >> this purpose. To me it makes a lot more sense to have separate, >> much simpler code. We don't need any transactionality, user >> defined types, user defined operators, or anything like that. > > +1. > > But, would not rewriting a lot of existing functionalities > potentially lead to points of contention and/or much more effort? Don't forget the stats are written only by the postmaster, all the regular backends only read it (and eventually send updates back). But yes, contention might be a problem, because there will have to be some kind of locking that is not needed now when the postmaster is writing fresh copy into a new file. But I think we need to implement something and then measure this. Because it might even with the contention the overall performance might actually improve. I'd vote to try a simple approach first - adding some simple array 'index' allowing fast access to particular records etc. At least that was my plan. But feel free to implement something more advanced (e.g. a BTree storage) and we can compare the results. Tomas
On 5.9.2013 07:29, Alvaro Herrera wrote: > Satoshi Nagayasu wrote: > >> But, for now, I think we should have a real index for the >> statistics data because we already have several index storages, and >> it will allow us to minimize read/write operations. >> >> BTW, what kind of index would be preferred for this purpose? btree >> or hash? > > I find it hard to get excited about using the AM interface for this > purpose. To me it makes a lot more sense to have separate, much > simpler code. We don't need any transactionality, user defined > types, user defined operators, or anything like that. +1 to these concerns And I think using regular tables might actually cause more harm than benefits. For example let's say we have a large database with many objects (which is the aim of this thread) with high activity - sessions accessing objects, i.e. updating many "rows" in the stats tables. Now, the stats table is likely to bloat thanks of the MVCC copy-on-update. Not a good think, and it might easily happen the price for maintenance of the table will be much higher than what we saved. There are probably other similar scenarios. Tomas
Sent from my iPad On 08-Sep-2013, at 4:27, Tomas Vondra <tv@fuzzy.cz> wrote: > On 5.9.2013 09:36, Atri Sharma wrote: >> On Thu, Sep 5, 2013 at 10:59 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> Satoshi Nagayasu wrote: >>> >>>> But, for now, I think we should have a real index for the >>>> statistics data because we already have several index storages, >>>> and it will allow us to minimize read/write operations. >>>> >>>> BTW, what kind of index would be preferred for this purpose? >>>> btree or hash? >>> >>> I find it hard to get excited about using the AM interface for >>> this purpose. To me it makes a lot more sense to have separate, >>> much simpler code. We don't need any transactionality, user >>> defined types, user defined operators, or anything like that. >> >> +1. >> >> But, would not rewriting a lot of existing functionalities >> potentially lead to points of contention and/or much more effort? > > Don't forget the stats are written only by the postmaster, all the > regular backends only read it (and eventually send updates back). But > yes, contention might be a problem, because there will have to be some > kind of locking that is not needed now when the postmaster is writing > fresh copy into a new file. > > But I think we need to implement something and then measure this. > Because it might even with the contention the overall performance might > actually improve. > > I'd vote to try a simple approach first - adding some simple array > 'index' allowing fast access to particular records etc. At least that > was my plan. But feel free to implement something more advanced (e.g. a > BTree storage) and we can compare the results. > > +1 on the simple implementation and measure approach. My focus here is to identify what kind of queries we expect to be serving from the stats.I think someone mentioned rangequeries upthread. I feel we should be looking at range trees as secondary index, if not the primary storage for pgstat. Regards, Atri
Sent from my iPad On 08-Sep-2013, at 4:27, Tomas Vondra <tv@fuzzy.cz> wrote: > On 5.9.2013 09:36, Atri Sharma wrote: >> On Thu, Sep 5, 2013 at 10:59 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> Satoshi Nagayasu wrote: >>> >>>> But, for now, I think we should have a real index for the >>>> statistics data because we already have several index storages, >>>> and it will allow us to minimize read/write operations. >>>> >>>> BTW, what kind of index would be preferred for this purpose? >>>> btree or hash? >>> >>> I find it hard to get excited about using the AM interface for >>> this purpose. To me it makes a lot more sense to have separate, >>> much simpler code. We don't need any transactionality, user >>> defined types, user defined operators, or anything like that. >> >> +1. >> >> But, would not rewriting a lot of existing functionalities >> potentially lead to points of contention and/or much more effort? > > Don't forget the stats are written only by the postmaster, all the > regular backends only read it (and eventually send updates back). But > yes, contention might be a problem, because there will have to be some > kind of locking that is not needed now when the postmaster is writing > fresh copy into a new file. > > But I think we need to implement something and then measure this. > Because it might even with the contention the overall performance might > actually improve. > > I'd vote to try a simple approach first - adding some simple array > 'index' allowing fast access to particular records etc. At least that > was my plan. But feel free to implement something more advanced (e.g. a > BTree storage) and we can compare Another thing I would want to explore is if we could somehow prioritise the more frequently accessed records to reduce theiraccess times even more.I am thinking on the lines of self organising lists.I am not sure if and how it would be possibleto implement this idea of self organising in BTree or any other tree though. Regards, Atri
On Tue, Sep 3, 2013 at 10:09 PM, Satoshi Nagayasu <snaga@uptime.jp> wrote: > Hi, > > > (2013/09/04 13:07), Alvaro Herrera wrote: >> >> Satoshi Nagayasu wrote: >> >>> As you may know, this file could be handreds of MB in size, because >>> pgstat.stat holds all access statistics in each database, and it needs >>> to read/write an entire pgstat.stat frequently. >>> >>> As a result, pgstat.stat often generates massive I/O operation, >>> particularly when having a large number of tables in the database. >> >> >> We already changed it: > >> >> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 >> Author: Alvaro Herrera <alvherre@alvh.no-ip.org> >> Date: Mon Feb 18 17:56:08 2013 -0300 >> >> Split pgstat file in smaller pieces > > Thanks for the comments. I forgot to mention that. > > Yes, we have already split single pgstat.stat file into > several pieces. > > However, we still need to read/write large amount of statistics > data when we have a large number of tables in single database > or multiple databases being accessed. Right? Do you have a test case for measuring this? I vaguely remember from when I was testing the split patch, that I thought that after that improvement the load that was left was so low that there was little point in optimizing it further. Cheers, Jeff
On 8.9.2013 23:04, Jeff Janes wrote: > On Tue, Sep 3, 2013 at 10:09 PM, Satoshi Nagayasu <snaga@uptime.jp> > wrote: >> Hi, >> >> >> (2013/09/04 13:07), Alvaro Herrera wrote: >>> >>> Satoshi Nagayasu wrote: >>> >>>> As you may know, this file could be handreds of MB in size, >>>> because pgstat.stat holds all access statistics in each >>>> database, and it needs to read/write an entire pgstat.stat >>>> frequently. >>>> >>>> As a result, pgstat.stat often generates massive I/O operation, >>>> particularly when having a large number of tables in the >>>> database. >>> >>> >>> We already changed it: >> >>> >>> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro >>> Herrera <alvherre@alvh.no-ip.org> Date: Mon Feb 18 17:56:08 >>> 2013 -0300 >>> >>> Split pgstat file in smaller pieces >> >> Thanks for the comments. I forgot to mention that. >> >> Yes, we have already split single pgstat.stat file into several >> pieces. >> >> However, we still need to read/write large amount of statistics >> data when we have a large number of tables in single database or >> multiple databases being accessed. Right? > > Do you have a test case for measuring this? I vaguely remember from > when I was testing the split patch, that I thought that after that > improvement the load that was left was so low that there was little > point in optimizing it further. This is actually a pretty good point. Creating a synthetic test case is quite simple - just create 1.000.000 tables in a single database, but I'm wondering if it's actually realistic. Do we have a real-world example where the current "one stat file per db" is not enough? The reason why I worked on the split patch is that our application is slightly crazy and creates a lot of tables (+ indexes) on the fly, and as we have up to a thousand databases on each host, we often ended up with a huge stat file. Splitting the stat file improved that considerably, although that's partially because we have the stats on a tmpfs, so I/O is not a problem, and the CPU overhead is negligible thanks to splitting the stats per database. But AFAIK there are operating systems where creating a filesystem in RAM is not that simple - e.g. Windows. In such cases even a moderate number of objects may be a significant issue I/O-wise. But then again, I can't really think of reasonable a system creating that many objects in a single database (except for e.g. a shared database using schemas instead of databases). Tomas
(2013/09/09 8:19), Tomas Vondra wrote: > On 8.9.2013 23:04, Jeff Janes wrote: >> On Tue, Sep 3, 2013 at 10:09 PM, Satoshi Nagayasu <snaga@uptime.jp> >> wrote: >>> Hi, >>> >>> >>> (2013/09/04 13:07), Alvaro Herrera wrote: >>>> >>>> Satoshi Nagayasu wrote: >>>> >>>>> As you may know, this file could be handreds of MB in size, >>>>> because pgstat.stat holds all access statistics in each >>>>> database, and it needs to read/write an entire pgstat.stat >>>>> frequently. >>>>> >>>>> As a result, pgstat.stat often generates massive I/O operation, >>>>> particularly when having a large number of tables in the >>>>> database. >>>> >>>> >>>> We already changed it: >>> >>>> >>>> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro >>>> Herrera <alvherre@alvh.no-ip.org> Date: Mon Feb 18 17:56:08 >>>> 2013 -0300 >>>> >>>> Split pgstat file in smaller pieces >>> >>> Thanks for the comments. I forgot to mention that. >>> >>> Yes, we have already split single pgstat.stat file into several >>> pieces. >>> >>> However, we still need to read/write large amount of statistics >>> data when we have a large number of tables in single database or >>> multiple databases being accessed. Right? >> >> Do you have a test case for measuring this? I vaguely remember from >> when I was testing the split patch, that I thought that after that >> improvement the load that was left was so low that there was little >> point in optimizing it further. > > This is actually a pretty good point. Creating a synthetic test case is > quite simple - just create 1.000.000 tables in a single database, but > I'm wondering if it's actually realistic. Do we have a real-world > example where the current "one stat file per db" is not enough? I have several assumptions for that. - Single shared database contains thousands of customers. - Each customer has hundreds of tables and indexes. - Customers are separated by schemas (namespaces) in single database. - Application server uses connection pooling for performance reason. - Workload (locality in the table access) can not be predicted. Looks reasonable? > The reason why I worked on the split patch is that our application is > slightly crazy and creates a lot of tables (+ indexes) on the fly, and > as we have up to a thousand databases on each host, we often ended up > with a huge stat file. > > Splitting the stat file improved that considerably, although that's > partially because we have the stats on a tmpfs, so I/O is not a problem, > and the CPU overhead is negligible thanks to splitting the stats per > database. I agree that splitting a single large database into several pieces, like thousands of tiny databases, could be an option in some cases. However, what I intend here is eliminating those limitations on database design. In fact, when considering connection pooling, splitting a database is not a good idea, because AFAIK, many connection poolers manage connections per database. So, I'd like to support 100k tables in single database. Any comments? Regards, > But AFAIK there are operating systems where creating a filesystem in RAM > is not that simple - e.g. Windows. In such cases even a moderate number > of objects may be a significant issue I/O-wise. But then again, I can't > really think of reasonable a system creating that many objects in a > single database (except for e.g. a shared database using schemas instead > of databases). > > Tomas > > -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
On Sat, Sep 7, 2013 at 6:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Don't forget the stats are written only by the postmaster, all the > regular backends only read it (and eventually send updates back). The postmaster, or the stats collector? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 Září 2013, 18:50, Robert Haas wrote: > On Sat, Sep 7, 2013 at 6:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Don't forget the stats are written only by the postmaster, all the >> regular backends only read it (and eventually send updates back). > > The postmaster, or the stats collector? Stats collector, of course. I meant to point out that the write activity comes from a single dedicated process, which may not be that obvious, and I somehow managed to name the incorrect one. Tomas