Обсуждение: Extremely slow to establish connection when user has a high number of roles
Extremely slow to establish connection when user has a high number of roles
От
Michal Charemza
Дата:
Hi,
We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed.
The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quicker. There seems to be no middle ground - never have I seen a connection take between 0.5 and 25 seconds for example. We suspect it's related to the number of roles the connecting user has (including via other roles), because if we remove all roles but one from the connecting user (the one that grants connection permissions), connecting is always virtually instantaneous.
The closest issue that I can find that's similar is https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, which reports that GRANT role is slow with a high number of roles - but in our case, it's connecting that's the problem, before (as far as we can tell) even one query is run. The database is busy, say up to 60-80% on a 16 VCPU machine - even if it's a "good amount" below 100%, the issue occurs.
Is there anything we can do to investigate (or hopefully fix!) the issue?
Thanks,
Michal
------
We would like to connect to the database - expect it to connect in less than 1 second, but sometimes 25 - 40s.
PostgreSQL version number you are running:
PostgreSQL version number you are running:
PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit
How you installed PostgreSQL:
How you installed PostgreSQL:
Via AWS/Amazon Aurora
Changes made to the settings in the postgresql.conf file
Changes made to the settings in the postgresql.conf file
In attached CSV file
Operating system and version:
Operating system and version:
Unknown
What program you're using to connect to PostgreSQL:
What program you're using to connect to PostgreSQL:
Python + SQLAlchemy, psql, or also via Amazon Quicksight (Unsure which client they use under the hood, but it surfaces connection timeout errors, which we suspect is due to the issue described above)
Is there anything relevant or unusual in the PostgreSQL server logs?:
Is there anything relevant or unusual in the PostgreSQL server logs?:
No
For questions about any kind of error:
N/A
For questions about any kind of error:
N/A
Вложения
Re: Extremely slow to establish connection when user has a high number of roles
От
Tomas Vondra
Дата:
On 4/20/24 13:55, Michal Charemza wrote: > Hi, > > We're running PostgreSQL as essentially a data warehouse, and we have a few > thousand roles, which are used to grant permissions on a table-by-table > basis to a few thousand users, so a user would typically have say between 1 > and 2 thousand roles. There is also quite a lot of "churn" in terms of > tables being created/removed, and permissions changed. > > The issue is that we're hitting a strange performance problem on > connection. Sometimes it can take ~25 to 40 seconds just to connect, > although it's often way quicker. There seems to be no middle ground - never > have I seen a connection take between 0.5 and 25 seconds for example. We > suspect it's related to the number of roles the connecting user has > (including via other roles), because if we remove all roles but one from > the connecting user (the one that grants connection permissions), > connecting is always virtually instantaneous. > I tried a couple simple setups with many roles (user with many roles granted directly and with many roles granted through other roles), but I've been unable to reproduce this. > The closest issue that I can find that's similar is > https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, > which reports that GRANT role is slow with a high number of roles - but in > our case, it's connecting that's the problem, before (as far as we can > tell) even one query is run. The database is busy, say up to 60-80% on a 16 > VCPU machine - even if it's a "good amount" below 100%, the issue occurs. > > Is there anything we can do to investigate (or hopefully fix!) the issue? > A reproducer would be great - a script that creates user/roles, and triggers the long login time would allow us to investigate that. Another option would be to get a perf profile from the process busy with logging the user in - assuming it's CPU-intensive, and not (e.g.) some sort of locking issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Michal Charemza <michal@charemza.name> writes: > The issue is that we're hitting a strange performance problem on > connection. Sometimes it can take ~25 to 40 seconds just to connect, > although it's often way quicker. There seems to be no middle ground - never > have I seen a connection take between 0.5 and 25 seconds for example. We > suspect it's related to the number of roles the connecting user has > (including via other roles), because if we remove all roles but one from > the connecting user (the one that grants connection permissions), > connecting is always virtually instantaneous. It's not very clear what you mean by "sometimes". Is the slowness reproducible for a particular user and role configuration, or does it seem to come and go by itself? As Tomas said, a self-contained reproduction script would be very helpful for looking into this. > The closest issue that I can find that's similar is > https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, > which reports that GRANT role is slow with a high number of roles - but in > our case, it's connecting that's the problem, before (as far as we can > tell) even one query is run. That specific problem is (we think) new in v16, but the root cause is an inefficient lookup mechanism that has been there a long time. Maybe you have found a usage pattern that exposes its weakness in older branches. If so, we could consider back-patching 14e991db8 further than v16 ... but I don't plan to take any risk there without concrete evidence that it'd improve things. regards, tom lane
Re: Extremely slow to establish connection when user has a high number of roles
От
Vijaykumar Jain
Дата:
On Sat, Apr 20, 2024, 5:25 PM Michal Charemza <michal@charemza.name> wrote:
Hi,We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed.The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quicker
can you rule out system catalog bloat ?
Re: Extremely slow to establish connection when user has a high number of roles
От
Michal Charemza
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> It's not very clear what you mean by "sometimes". Is the slowness
reproducible for a particular user and role configuration, or doesit seem to come and go by itself?
Ah it's more come and go by itself - as in one connection takes 30 seconds, then the next say 0.06s. It's happened for every user we've tried. Even more anecdotally, I would say it happens more when the database is busy in terms of tables being dropped/created and permissions changing.
Also: realise we did have one user that had directly was a member of several thousand roles, but indirectly several million. It would sometimes take 10 minutes for that user to connect. We've since changed that to one role, and that user connects fine now.
> As Tomas said, a self-contained reproduction script would be very
helpful for looking into this.Have tried... but alas it seems fine in anything but the production environment. My closest attempt is attached to at least it show in more detail how our system is setup, but it always works fine for me locally.
I am wondering - what happens on connection? What catalogue tables does PostgreSQL check and how? What's allowed to happen concurrently and what isn't? If I knew, maybe I could come up with a reproduction script that does reproduce the issue?
Вложения
Re: Extremely slow to establish connection when user has a high number of roles
От
Michal Charemza
Дата:
Vijaykumar Jain <vijaykumarjain.github@gmail.com> writes:
> can you rule out system catalog bloat ? I don't know! I've now run the query from https://wiki.postgresql.org/wiki/Show_database_bloat just just on pg_catalog, results attached
On Sat, Apr 20, 2024 at 3:52 PM Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:
On Sat, Apr 20, 2024, 5:25 PM Michal Charemza <michal@charemza.name> wrote:Hi,We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed.The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quickercan you rule out system catalog bloat ?
Вложения
Re: Extremely slow to establish connection when user has a high number of roles
От
Frits Hoogland
Дата:
Michael, can you validate if this is consistently happening for the first connection after database cluster startup?
Frits
Op 20 apr 2024 om 04:55 heeft Michal Charemza <michal@charemza.name> het volgende geschreven:
Hi,We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed.The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quicker. There seems to be no middle ground - never have I seen a connection take between 0.5 and 25 seconds for example. We suspect it's related to the number of roles the connecting user has (including via other roles), because if we remove all roles but one from the connecting user (the one that grants connection permissions), connecting is always virtually instantaneous.The closest issue that I can find that's similar is https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, which reports that GRANT role is slow with a high number of roles - but in our case, it's connecting that's the problem, before (as far as we can tell) even one query is run. The database is busy, say up to 60-80% on a 16 VCPU machine - even if it's a "good amount" below 100%, the issue occurs.Is there anything we can do to investigate (or hopefully fix!) the issue?Thanks,Michal------A description of what you are trying to achieve and what results you expect.:We would like to connect to the database - expect it to connect in less than 1 second, but sometimes 25 - 40s.
PostgreSQL version number you are running:PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit
How you installed PostgreSQL:Via AWS/Amazon Aurora
Changes made to the settings in the postgresql.conf fileIn attached CSV file
Operating system and version:Unknown
What program you're using to connect to PostgreSQL:Python + SQLAlchemy, psql, or also via Amazon Quicksight (Unsure which client they use under the hood, but it surfaces connection timeout errors, which we suspect is due to the issue described above)
Is there anything relevant or unusual in the PostgreSQL server logs?:No
For questions about any kind of error:
N/A<server_configuration.csv>
Re: Extremely slow to establish connection when user has a high number of roles
От
Michal Charemza
Дата:
Frits Hoogland <frits.hoogland@gmail.com> writes:
> Michael, can you validate if this is consistently happening for the first connection after database cluster startup?
Hmmm... it'll be tricky and need some planning. It might even be impossible since this is on AWS Aurora, and I think AWS connects in regularly as part of a heartbeat check(?)
But: do you mean any connection, or just a user that hasn't connected yet after the restart? A user that hasn't connected yet would be much easier.
Michal Charemza <michal@charemza.name> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> It's not very clear what you mean by "sometimes". Is the slowness > reproducible for a particular user and role configuration, or does > it seem to come and go by itself? > Ah it's more come and go by itself - as in one connection takes 30 seconds, > then the next say 0.06s. It's happened for every user we've tried. Even > more anecdotally, I would say it happens more when the database is busy in > terms of tables being dropped/created and permissions changing. OK, that pretty much eliminates the idea that it's a new manifestation of the catcache-inefficiency problem. Vijaykumar may well have the right idea, that it's a form of catalog bloat. Do you do bulk permissions updates that might affect thousands of role memberships at once? > Also: realise we did have one user that had directly was a member of > several thousand roles, but indirectly several million. It would sometimes > take 10 minutes for that user to connect. We've since changed that to one > role, and that user connects fine now. Interesting --- but even for that user, it was sometimes fast to connect? > I am wondering - what happens on connection? What catalogue tables does > PostgreSQL check and how? What's allowed to happen concurrently and what > isn't? If I knew, maybe I could come up with a reproduction script that > does reproduce the issue? Well, it's going to be looking to see that the user has CONNECT privileges on the target database. If that db doesn't have public connect privileges, but only grants CONNECT to certain roles, then we'll have to test whether the connecting user is a member of those roles --- which involves looking into pg_auth_members and possibly even doing recursive searches there. For the sort of setup you're describing with thousands of role grants (pg_auth_members entries) it's not hard to imagine that search being rather expensive. What remains to be explained is how come it's only expensive sometimes. The catalog-bloat idea comes from thinking about how Postgres handles row updates. There will be multiple physical copies (row versions) of any recently-updated row, and this is much more expensive to scan than a static situation with only one live row version. First just because we have to look at more than one copy, and second because testing whether that copy is the live version is noticeably more expensive if it's recent than once it's older than the xmin horizon, and third because if we are the first process to scan it since it became dead-to-everybody then it's our responsibility to mark it as dead-to-everybody, so that we have to incur additional I/O to do that. A plausible idea for particular connection attempts being slow is that they came in just as a whole lot of pg_auth_members entries became dead-to-everybody, and hence they were unlucky enough to get saddled with a whole lot of that hint-bit-updating work. (This also nicely explains why the next attempt isn't slow: the work's been done.) But this is only plausible if you regularly do actions that cause a lot of pg_auth_members entries to be updated at the same time. So we still don't have good insight into that, and your test script isn't shedding any light. A couple of other thoughts: * I don't think your test script would show a connection-slowness problem even if there was one to be shown, because you forgot to revoke the PUBLIC connect privilege on the postgres database. I'm fairly sure that if that exists it's always noticed first, bypassing the need for any role membership tests. So please confirm whether your production database has revoked PUBLIC connect privilege. * It could be that the problem is not associated with the database's connect privilege, but with role membership lookups triggered by pg_hba.conf entries. Do you have any entries there that require testing membership (i.e. the role column is not "all")? regards, tom lane
Re: Extremely slow to establish connection when user has a high number of roles
От
Michal Charemza
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Do you do bulk> permissions updates that might affect thousands of role memberships at
> once?
So we do regularly update role memberships - essentially a sync from a separate database, and some could well have happened just before connections, but it's more in the tens at a time at most, not thousands... Or at least, that's what I thought. It sounds like it would be good to see if it's doing more. It'll take some time for me to figure this out though...
> I'm fairly sure that if that exists it's always noticed first,
> bypassing the need for any role membership tests. So please> confirm whether your production database has revoked PUBLIC
> connect privilege.
I realised that in fact we hadn't revoked this. So it sounds like whatever the issue, it's not about checking if the user has the CONNECT privilege?
> It could be that the problem is not associated with the
> database's connect privilege, but with role membership lookups> triggered by pg_hba.conf entries. Do you have any entries there
> that require testing membership (i.e. the role column is not
> "all")?
Running `select * from pg_hba_file_rules` it looks like the user column is always {all} or {rdsadmin}
Thanks,
Michal
Michal Charemza <michal@charemza.name> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I'm fairly sure that if that exists it's always noticed first, >> bypassing the need for any role membership tests. So please >> confirm whether your production database has revoked PUBLIC >> connect privilege. > I realised that in fact we hadn't revoked this. So it sounds like whatever > the issue, it's not about checking if the user has the CONNECT privilege? Yeah. I double-checked the code (see aclmask()), and it will detect holding a privilege via PUBLIC before it performs any role membership searches. So whatever is happening, it's not that lookup. >> It could be that the problem is not associated with the >> database's connect privilege, but with role membership lookups >> triggered by pg_hba.conf entries. Do you have any entries there >> that require testing membership (i.e. the role column is not >> "all")? > Running `select * from pg_hba_file_rules` it looks like the user column is > always {all} or {rdsadmin} You'll need to look closer and figure out which of the HBA rules is being used for the slow connection attempts. If it's {rdsadmin} then that would definitely involve a role membership search. If it's {all} then we're back to square one. A different line of thought could be that the slow connections are slow because they are waiting on a lock that some other process has got and is in no hurry to release. It would be worth trying to capture the contents of the pg_locks view (and I guess also pg_stat_activity) while one of these sessions is stuck, if you can reproduce it often enough to make that feasible. regards, tom lane