Обсуждение: Best practices to manage custom statistics
Hi,
     I'm checking if there's a best way to obtain stastistics based on
my database tables
Here's the scenario.
First of all, technical details:
- Postgresql 9.1, Ubuntu 12 on a 4 core, 32 GB machine with 600 GB disk
migrating to Postgresql 9.5.3, Debian 8 on a 8-core, 52 GB machine with
2 TB disk.
- 350 databases, 350 users, every user connects to his own database and
his teammates' (max 10 in total) so each user can connect to max 10
databases at a time
My application needs to achieve a certain number of statistics (how many
records are in a certain state, how many are in another state) to send
back to user.
This is obtained, at the moment, with a select count(*) from ..... (that
involves 4 joins on 4 tables) to be run run every 20 secs from each
client connected to the cluster (ATM about 650 clients configured, about
200 concurrent) to each database it has rights to connect.
I noticed that in some cases, especially when working with not-so-small
datasets (200k rows x 95 cols), and sometines returning not-so-small
datasets (10k rows) the query performs not so well, but the worst thing
is that it raises overall server load (I/O) and bandwidth usage. While
bandwidth is not a problem (I have spikes at 20 Mbps while "normal"
traffic speed is at about 3Mbps, but I have 1 Gbps available), server
load *is* a main problem, because in high-access periods
(summer/holidays) I see my server load go up to 22-25 on a 4-core
machine, and users call complaining for timeouts and slowness.
Even if I'm migrating to a better instance, I'm still trying to
"normalize" this feature.
I can start looking at indices (I'm not quite sure that those fields in
WHERE clause are all indexed), but I don't think it would boost its
performance.
I thought about having a table, say, 'tbl_counters', like this
CREATE TABLE tbl_counters{
uuid coduser,
int counter1,
int counter2,
....
int counterx
};
updated by trigger (when a value in a table is INSERTed/UPDATEd/DELETEd
it fires a function that increments/decrements values for counter x at
user y).
Just to avoid eventual trigger skipping its update, one time a day (say
at 2 am) a cron performs the above "monster query" for every database
and adjusts, if necessary, counter values for each user reflecting real
values.
In your experience, would this approach help me lower server load?
Are there any other approach I can try?
If more details are needed, just ask.
Thanks in advance and sorry for the long message (but I had to explain
such a complex thing)
Moreno.-
			
		On 11/08/2016 12:13 PM, Moreno Andreo wrote:
> Hi,
>     I'm checking if there's a best way to obtain stastistics based on my
> database tables
>
> Here's the scenario.
> First of all, technical details:
> - Postgresql 9.1, Ubuntu 12 on a 4 core, 32 GB machine with 600 GB disk
> migrating to Postgresql 9.5.3, Debian 8 on a 8-core, 52 GB machine with
> 2 TB disk.
> - 350 databases, 350 users, every user connects to his own database and
> his teammates' (max 10 in total) so each user can connect to max 10
> databases at a time
>
>
> My application needs to achieve a certain number of statistics (how many
> records are in a certain state, how many are in another state) to send
> back to user.
> This is obtained, at the moment, with a select count(*) from ..... (that
> involves 4 joins on 4 tables) to be run run every 20 secs from each
> client connected to the cluster (ATM about 650 clients configured, about
> 200 concurrent) to each database it has rights to connect.
>
> I noticed that in some cases, especially when working with not-so-small
> datasets (200k rows x 95 cols), and sometines returning not-so-small
> datasets (10k rows) the query performs not so well, but the worst thing
> is that it raises overall server load (I/O) and bandwidth usage. While
> bandwidth is not a problem (I have spikes at 20 Mbps while "normal"
> traffic speed is at about 3Mbps, but I have 1 Gbps available), server
> load *is* a main problem, because in high-access periods
> (summer/holidays) I see my server load go up to 22-25 on a 4-core
> machine, and users call complaining for timeouts and slowness.
>
> Even if I'm migrating to a better instance, I'm still trying to
> "normalize" this feature.
> I can start looking at indices (I'm not quite sure that those fields in
> WHERE clause are all indexed), but I don't think it would boost its
> performance.
>
> I thought about having a table, say, 'tbl_counters', like this
>
> CREATE TABLE tbl_counters{
> uuid coduser,
> int counter1,
> int counter2,
> ....
> int counterx
> };
> updated by trigger (when a value in a table is INSERTed/UPDATEd/DELETEd
> it fires a function that increments/decrements values for counter x at
> user y).
> Just to avoid eventual trigger skipping its update, one time a day (say
> at 2 am) a cron performs the above "monster query" for every database
> and adjusts, if necessary, counter values for each user reflecting real
> values.
>
> In your experience, would this approach help me lower server load?
> Are there any other approach I can try?
Instead of pushing why not pull. In other words do the users really
check/need the statistics every 20 secs? Given that you say exact is not
important over the course of day, why not create a mechanism for the
user to poll the database when they need the information.
>
> If more details are needed, just ask.
>
> Thanks in advance and sorry for the long message (but I had to explain
> such a complex thing)
> Moreno.-
>
>
>
--
Adrian Klaver
adrian.klaver@aklaver.com
			
		Sorry for late reply... i'm in some quite rough days.... Il 08/11/2016 21:28, Adrian Klaver ha scritto: > On 11/08/2016 12:13 PM, Moreno Andreo wrote: >> [...] >> >> In your experience, would this approach help me lower server load? >> Are there any other approach I can try? > > Instead of pushing why not pull. Excuse me Adrian, but I can't get what you mean by not pushing but pulling. We are now pulling data from clients for about everything we need... what I'd like to do is either - the database pushes updates when needed, or - the client pulls data from database, but querying a reduced dataset (just a row values with all values for that user and not thousands of rows) > In other words do the users really check/need the statistics every 20 > secs? Ideally, I need a real value to be read when that value changes. But on Earth I'm happy with a consistent value (If it should be 800 and it reads 799 it's not an earthquake) at least on a regular basis. This means that if there's no activity, we will be uselessly polling the database, so here's why I thought about "pushing" data from backend to client, that would be the nearest to ideal solution. > Given that you say exact is not important over the course of day, why > not create a mechanism for the user to poll the database when they > need the information. This is what we did in the past. The result was that users _did not_ update values (clicking an "update" button) and made disasters working with "old" data (they forgot to do it, they didn't want to do it because "it's another click, I waste my time", and so many, even stupid, excuses... but they're the customers, they pay, and here we say that "customer is always right") So we changed: now we check for values and for data (not every 20 but 60 seconds... I just checked the right value). I need something that's lighter for the DB backend, at least for the values procedure. If we had only a database, I think that queries and datasets would be stuck in cache, so response times would be faster. With more than 350 databases, that's not possible (or we have to grow RAM size to values very big...) I've also thought about using LISTEN/NOTIFY to send value updates to client only when needed, but with NPgSQL I read that we need to keep an open connection, and that's not a good idea AFAIK. Thanks Moreno > >> >> If more details are needed, just ask. >> >> Thanks in advance and sorry for the long message (but I had to explain >> such a complex thing) >> Moreno.- >> >> >> > >
On 11/15/2016 07:39 AM, Moreno Andreo wrote: > Sorry for late reply... i'm in some quite rough days.... > > Il 08/11/2016 21:28, Adrian Klaver ha scritto: >> On 11/08/2016 12:13 PM, Moreno Andreo wrote: >>> [...] >>> >>> In your experience, would this approach help me lower server load? >>> Are there any other approach I can try? >> >> Instead of pushing why not pull. > Excuse me Adrian, but I can't get what you mean by not pushing but pulling. > We are now pulling data from clients for about everything we need... > what I'd like to do is either > - the database pushes updates when needed, Pushes updates of what, the statistics you talking about or other data? > or > - the client pulls data from database, but querying a reduced dataset > (just a row values with all values for that user and not thousands of rows) This confuses me given from your OP: " 350 databases, 350 users, every user connects to his own database and his teammates' (max 10 in total) so each user can connect to max 10 databases at a time" "This is obtained, at the moment, with a select count(*) from ..... (that involves 4 joins on 4 tables) to be run run every 20 secs from each client connected to the cluster (ATM about 650 clients configured, about 200 concurrent) to each database it has rights to connect." So does the user need only their data or do they need the other users data also? >> In other words do the users really check/need the statistics every 20 >> secs? > Ideally, I need a real value to be read when that value changes. But on > Earth I'm happy with a consistent value (If it should be 800 and it > reads 799 it's not an earthquake) at least on a regular basis. This > means that if there's no activity, we will be uselessly polling the > database, so here's why I thought about "pushing" data from backend to > client, that would be the nearest to ideal solution. >> Given that you say exact is not important over the course of day, why >> not create a mechanism for the user to poll the database when they >> need the information. > > This is what we did in the past. The result was that users _did not_ > update values (clicking an "update" button) and made disasters working > with "old" data (they forgot to do it, they didn't want to do it because How can they be working with 'old' data? The queries you are running are compiling stats on data that exist at the time they are run and at any point in time between stats runs the user is working with current data regardless of what the last stats say. > "it's another click, I waste my time", and so many, even stupid, > excuses... but they're the customers, they pay, and here we say that > "customer is always right") Except when they are wrong:) Still been there. > > So we changed: now we check for values and for data (not every 20 but > 60 seconds... I just checked the right value). I need something that's > lighter for the DB backend, at least for the values procedure. If we had > only a database, I think that queries and datasets would be stuck in > cache, so response times would be faster. With more than 350 databases, > that's not possible (or we have to grow RAM size to values very big...) > > I've also thought about using LISTEN/NOTIFY to send value updates to > client only when needed, but with NPgSQL I read that we need to keep an > open connection, and that's not a good idea AFAIK. > > Thanks > Moreno > >> >>> >>> If more details are needed, just ask. >>> >>> Thanks in advance and sorry for the long message (but I had to explain >>> such a complex thing) >>> Moreno.- >>> >>> >>> >> >> > > > > -- Adrian Klaver adrian.klaver@aklaver.com
Il 15/11/2016 18:19, Adrian Klaver ha scritto: > On 11/15/2016 07:39 AM, Moreno Andreo wrote: >> Sorry for late reply... i'm in some quite rough days.... >> >> Il 08/11/2016 21:28, Adrian Klaver ha scritto: >>> On 11/08/2016 12:13 PM, Moreno Andreo wrote: >>>> [...] >>>> >>>> In your experience, would this approach help me lower server load? >>>> Are there any other approach I can try? >>> >>> Instead of pushing why not pull. >> Excuse me Adrian, but I can't get what you mean by not pushing but >> pulling. >> We are now pulling data from clients for about everything we need... >> what I'd like to do is either >> - the database pushes updates when needed, > > Pushes updates of what, the statistics you talking about or other data? Statistics, modified by "something internal". I just need a number (okay, it's not just one, say about six, for now) > >> or >> - the client pulls data from database, but querying a reduced dataset >> (just a row values with all values for that user and not thousands of >> rows) > > This confuses me given from your OP: > > " 350 databases, 350 users, every user connects to his own database and > his teammates' (max 10 in total) so each user can connect to max 10 > databases at a time" > > "This is obtained, at the moment, with a select count(*) from ..... (that > involves 4 joins on 4 tables) to be run run every 20 secs from each > client connected to the cluster (ATM about 650 clients configured, about > 200 concurrent) to each database it has rights to connect." > > So does the user need only their data or do they need the other users > data also? they can be able to download also their group mates' data (stats, numbers), if they want. Numbers can be confusing because evey user can have more than one workplace. > >>> In other words do the users really check/need the statistics every 20 >>> secs? >> Ideally, I need a real value to be read when that value changes. But on >> Earth I'm happy with a consistent value (If it should be 800 and it >> reads 799 it's not an earthquake) at least on a regular basis. This >> means that if there's no activity, we will be uselessly polling the >> database, so here's why I thought about "pushing" data from backend to >> client, that would be the nearest to ideal solution. >>> Given that you say exact is not important over the course of day, why >>> not create a mechanism for the user to poll the database when they >>> need the information. >> >> This is what we did in the past. The result was that users _did not_ >> update values (clicking an "update" button) and made disasters working >> with "old" data (they forgot to do it, they didn't want to do it because > > How can they be working with 'old' data? The queries you are running > are compiling stats on data that exist at the time they are run and at > any point in time between stats runs the user is working with current > data regardless of what the last stats say. Since we are on ADO.NET (with Npgsql) and we don't keep connections open (query, fill a structure and disconnect), in the time slice between two updates they will have both data and stats that are not "real" in that moment... I'll try to make an example You and me are teammates and work everyone at his place (several miles away). You read data from my database and get both stats and work data. Suddenly I make a change. This change won't be visible to you until you refresh data (that was pressing "Update" button, now it's automated every 60 secs). Hope It's clearer now... >> "it's another click, I waste my time", and so many, even stupid, >> excuses... but they're the customers, they pay, and here we say that >> "customer is always right") > > Except when they are wrong:) Still been there. Don't tell me.... :-) > >> >> So we changed: now we check for values and for data (not every 20 but >> 60 seconds... I just checked the right value). I need something that's >> lighter for the DB backend, at least for the values procedure. If we had >> only a database, I think that queries and datasets would be stuck in >> cache, so response times would be faster. With more than 350 databases, >> that's not possible (or we have to grow RAM size to values very big...) >> >> I've also thought about using LISTEN/NOTIFY to send value updates to >> client only when needed, but with NPgSQL I read that we need to keep an >> open connection, and that's not a good idea AFAIK. >> >> Thanks >> Moreno >> >>> >>>> >>>> If more details are needed, just ask. >>>> >>>> Thanks in advance and sorry for the long message (but I had to explain >>>> such a complex thing) >>>> Moreno.- >>>> >>>> >>>> >>> >>> >> >> >> >> > >
On 11/23/2016 05:24 AM, Moreno Andreo wrote: > Il 15/11/2016 18:19, Adrian Klaver ha scritto: >> On 11/15/2016 07:39 AM, Moreno Andreo wrote: >>> Sorry for late reply... i'm in some quite rough days.... >>> >>> Il 08/11/2016 21:28, Adrian Klaver ha scritto: >>>> On 11/08/2016 12:13 PM, Moreno Andreo wrote: >>>>> [...] >>>>> >>>>> In your experience, would this approach help me lower server load? >>>>> Are there any other approach I can try? >>>> >> >> So does the user need only their data or do they need the other users >> data also? > they can be able to download also their group mates' data (stats, > numbers), if they want. Numbers can be confusing because evey user can > have more than one workplace. >> >> >> How can they be working with 'old' data? The queries you are running >> are compiling stats on data that exist at the time they are run and at >> any point in time between stats runs the user is working with current >> data regardless of what the last stats say. > Since we are on ADO.NET (with Npgsql) and we don't keep connections open > (query, fill a structure and disconnect), in the time slice between two > updates they will have both data and stats that are not "real" in that > moment... > I'll try to make an example > You and me are teammates and work everyone at his place (several miles > away). You read data from my database and get both stats and work data. > Suddenly I make a change. This change won't be visible to you until you > refresh data (that was pressing "Update" button, now it's automated > every 60 secs). > Hope It's clearer now... > Aah, I get it now. You are refreshing forms every 60 seconds over 350 users each with their own database. Actually worse then that as there is user overlap over databases(up to 10 per user), so the same database can be hit multiple times at a given refresh. Seems physics is at work here as you have already pointed out. Namely fetching all that data at regular intervals taxes the bandwith as well as the CPU/storage. High levels solutions that come to mind to spread the load out: 1) Replication to spread data across multiple machines. Or just split the databases over multiple non-replicated Postgres instances on separate machines 2) Caching results from each individual database so subsequent calls for the information do not touch the database. You already touched on this with your counter table. 3) Spreading out the refresh interval. Not sure if the 60 second interval is synced across users. Still maybe setting different refresh intervals and/or changing it on the fly when load increases. A back off equation so to speak. >> >> Except when they are wrong:) Still been there. > Don't tell me.... :-) >> >>>>> >>>>> Thanks in advance and sorry for the long message (but I had to explain >>>>> such a complex thing) >>>>> Moreno.- -- Adrian Klaver adrian.klaver@aklaver.com
Hi Adrian,
     First of all: now I've seen that not all fields touched by WHERE
clauses in queries are indexed. Time to try some indices and EXPLAIN a
bit....
(I must be blind... I've checked it multiple times....)
Scroll down for the rest of the thread.....
Il 23/11/2016 16:59, Adrian Klaver ha scritto:
> On 11/23/2016 05:24 AM, Moreno Andreo wrote:
>> Il 15/11/2016 18:19, Adrian Klaver ha scritto:
>>> On 11/15/2016 07:39 AM, Moreno Andreo wrote:
>>>> Sorry for late reply... i'm in some quite rough days....
>>>>
>>>> Il 08/11/2016 21:28, Adrian Klaver ha scritto:
>>>>> On 11/08/2016 12:13 PM, Moreno Andreo wrote:
>>>>>> [...]
>>>>>>
>>>>>> In your experience, would this approach help me lower server load?
>>>>>> Are there any other approach I can try?
>>>>>
>
>>>
>>> So does the user need only their data or do they need the other users
>>> data also?
>> they can be able to download also their group mates' data (stats,
>> numbers), if they want. Numbers can be confusing because evey user can
>> have more than one workplace.
>>>
>
>>>
>>> How can they be working with 'old' data? The queries you are running
>>> are compiling stats on data that exist at the time they are run and at
>>> any point in time between stats runs the user is working with current
>>> data regardless of what the last stats say.
>> Since we are on ADO.NET (with Npgsql) and we don't keep connections open
>> (query, fill a structure and disconnect), in the time slice between two
>> updates they will have both data and stats that are not "real" in that
>> moment...
>> I'll try to make an example
>> You and me are teammates and work everyone at his place (several miles
>> away). You read data from my database and get both stats and work data.
>> Suddenly I make a change. This change won't be visible to you until you
>> refresh data (that was pressing "Update" button, now it's automated
>> every 60 secs).
>> Hope It's clearer now...
>>
>
> Aah, I get it now. You are refreshing forms every 60 seconds over 350
> users each with their own database. Actually worse then that as there
> is user overlap over databases(up to 10 per user), so the same
> database can be hit multiple times at a given refresh. Seems physics
> is at work here as you have already pointed out. Namely fetching all
> that data at regular intervals taxes the bandwith as well as the
> CPU/storage.
Exactly. At the moment the bottleneck is I/O (running on a VM over a
RAID-5 with 15kRPM SAS), in the next weeks I'm up to prepare a test
server with SSDs ("attached SSD" on Google Cloud Platform), that's
pretty much expensive, but data sheets point out a very good IOPS rate
(rising as size, and price, rises). CPU is not a problem (almost never
over 70-80%, average is about 40% over a 4 core server)
>
> High levels solutions that come to mind to spread the load out:
>
> 1) Replication to spread data across multiple machines.
> Or just split the databases over multiple non-replicated Postgres
> instances on separate machines
Already in place, but only for owner's database. Other databases can't
be "downloaded" essentially for privacy matters.
>
> 2) Caching results from each individual database so subsequent calls
> for the information do not touch the database. You already touched on
> this with your counter table.
Yes, and I'm all ears on suggestions on what to be done and if there are
hidden caveats...
I heard that stored procedures (and triggers, I suppose) are faster to
execute than the same "operation" coming from outside because the
backend has not to translate it, and in this case can be executed only
when needed
>
> 3) Spreading out the refresh interval. Not sure if the 60 second
> interval is synced across users. Still maybe setting different refresh
> intervals and/or changing it on the fly when load increases. A back
> off equation so to speak.
I'm afraid that if we have a moment when load is extremely high (I
experienced top measuring 45 (5-minutes basis) on a 4-core machine, it
was even hard to type in putty console!!), we won't have any upgrade.
Yes, in these conditions we're still not having upgrades, but my goal is
to find a way to reduce overall query load to have a good (V-)hardware
design so I can have better performance with lower cost.
"Pushing" data would be the best way... but I need to have a look to
LISTEN/NOTIFY, and what it means with Npgsql and JDBC (one thing I
didn't specify is that between user and database we have a web server,
in LAN with PgSQL server, hosting web services for the user to interact
with database. Maybe web server can "proxy" connections (keep them open)
and the listen/notify method can be achieved (yes, it's a bit of hard
work to rewrite all connection policies, but if it's necessary...)
I think that a combination of the last two should be the best.
Tell me what you think.....
>
>>>
>>> Except when they are wrong:) Still been there.
>> Don't tell me.... :-)
>>>
>
>>>>>>
>>>>>> Thanks in advance and sorry for the long message (but I had to
>>>>>> explain
>>>>>> such a complex thing)
>>>>>> Moreno.-
>
>
>
			
		On 11/24/2016 09:59 AM, Moreno Andreo wrote:
> Hi Adrian,
>     First of all: now I've seen that not all fields touched by WHERE
>>
>> Aah, I get it now. You are refreshing forms every 60 seconds over 350
>> users each with their own database. Actually worse then that as there
>> is user overlap over databases(up to 10 per user), so the same
>> database can be hit multiple times at a given refresh. Seems physics
>> is at work here as you have already pointed out. Namely fetching all
>> that data at regular intervals taxes the bandwith as well as the
>> CPU/storage.
> Exactly. At the moment the bottleneck is I/O (running on a VM over a
> RAID-5 with 15kRPM SAS), in the next weeks I'm up to prepare a test
> server with SSDs ("attached SSD" on Google Cloud Platform), that's
> pretty much expensive, but data sheets point out a very good IOPS rate
> (rising as size, and price, rises). CPU is not a problem (almost never
> over 70-80%, average is about 40% over a 4 core server)
>>
>> High levels solutions that come to mind to spread the load out:
>>
>> 1) Replication to spread data across multiple machines.
>> Or just split the databases over multiple non-replicated Postgres
>> instances on separate machines
> Already in place, but only for owner's database. Other databases can't
> be "downloaded" essentially for privacy matters.
Not sure I follow, a user can see their database and up to 9 other users
databases. Not seeing how replication would be any less 'private' then
that, especially if the databases are replicated to machines the company
owns.
>>
>> 2) Caching results from each individual database so subsequent calls
>> for the information do not touch the database. You already touched on
>> this with your counter table.
> Yes, and I'm all ears on suggestions on what to be done and if there are
> hidden caveats...
The usual I would guess:
Storage for the cached data.
Maintaining the cache.
"There are only two hard things in Computer Science: cache invalidation
and naming things."
     -- Phil Karlton
> I heard that stored procedures (and triggers, I suppose) are faster to
> execute than the same "operation" coming from outside because the
> backend has not to translate it, and in this case can be executed only
> when needed
Well that and that less data has to flow back and forth across a
network. In your case you said bandwidth is less of on an issue then
disk I/0 on the server. In-database procedures are still going to
require I/O on the server.
>>
>> 3) Spreading out the refresh interval. Not sure if the 60 second
>> interval is synced across users. Still maybe setting different refresh
>> intervals and/or changing it on the fly when load increases. A back
>> off equation so to speak.
> I'm afraid that if we have a moment when load is extremely high (I
> experienced top measuring 45 (5-minutes basis) on a 4-core machine, it
> was even hard to type in putty console!!), we won't have any upgrade.
Is there a predictability(even a rough one) about when the load spikes
occur? Something you could use to spread the load out.
> Yes, in these conditions we're still not having upgrades, but my goal is
> to find a way to reduce overall query load to have a good (V-)hardware
> design so I can have better performance with lower cost.
> "Pushing" data would be the best way... but I need to have a look to
> LISTEN/NOTIFY, and what it means with Npgsql and JDBC (one thing I
> didn't specify is that between user and database we have a web server,
> in LAN with PgSQL server, hosting web services for the user to interact
?
http://www.memcached.org/
http://redis.io/
> with database. Maybe web server can "proxy" connections (keep them open)
> and the listen/notify method can be achieved (yes, it's a bit of hard
> work to rewrite all connection policies, but if it's necessary...)
>
> I think that a combination of the last two should be the best.
> Tell me what you think.....
I think there are a lot of moving parts to this and more pop up with
each post:) I would not even know how to start to compose a useful
answer to what is a complex problem that also looks to be in the process
of fairly major hardware changes. All I can suggest is that you create a
test setup and start doing some incremental changes, using some of the
suggestions already provided, with tests to measure whether the changes
actually are a benefit.
>
>>
>>>>
>>>> Except when they are wrong:) Still been there.
>>> Don't tell me.... :-)
>>>>
>>
>>>>>>>
>>>>>>> Thanks in advance and sorry for the long message (but I had to
>>>>>>> explain
>>>>>>> such a complex thing)
>>>>>>> Moreno.-
>>
>>
>>
>
>
>
>
--
Adrian Klaver
adrian.klaver@aklaver.com