Обсуждение: Extremely slow to establish connection when user has a high number of roles

Поиск
Список
Период
Сортировка

Extremely slow to establish connection when user has a high number of roles

От
Michal Charemza
Дата:
Hi,

We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed.

The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quicker. There seems to be no middle ground - never have I seen a connection take between 0.5 and 25 seconds for example. We suspect it's related to the number of roles the connecting user has (including via other roles), because if we remove all roles but one from the connecting user (the one that grants connection permissions), connecting is always virtually instantaneous.

The closest issue that I can find that's similar is https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, which reports that GRANT role is slow with a high number of roles - but in our case, it's connecting that's the problem, before (as far as we can tell) even one query is run. The database is busy, say up to 60-80% on a 16 VCPU machine - even if it's a "good amount" below 100%, the issue occurs.

Is there anything we can do to investigate (or hopefully fix!) the issue?

Thanks,

Michal

------

A description of what you are trying to achieve and what results you expect.:
We would like to connect to the database - expect it to connect in less than 1 second, but sometimes 25 - 40s.

PostgreSQL version number you are running:
PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit

How you installed PostgreSQL:
Via AWS/Amazon Aurora

Changes made to the settings in the postgresql.conf file
In attached CSV file

Operating system and version:
Unknown

What program you're using to connect to PostgreSQL:
Python + SQLAlchemy, psql, or also via Amazon Quicksight (Unsure which client they use under the hood, but it surfaces connection timeout errors, which we suspect is due to the issue described above)
 
Is there anything relevant or unusual in the PostgreSQL server logs?:
No
 
For questions about any kind of error:
N/A
Вложения

Re: Extremely slow to establish connection when user has a high number of roles

От
Tomas Vondra
Дата:
On 4/20/24 13:55, Michal Charemza wrote:
> Hi,
> 
> We're running PostgreSQL as essentially a data warehouse, and we have a few
> thousand roles, which are used to grant permissions on a table-by-table
> basis to a few thousand users, so a user would typically have say between 1
> and 2 thousand roles. There is also quite a lot of "churn" in terms of
> tables being created/removed, and permissions changed.
> 
> The issue is that we're hitting a strange performance problem on
> connection. Sometimes it can take ~25 to 40 seconds just to connect,
> although it's often way quicker. There seems to be no middle ground - never
> have I seen a connection take between 0.5 and 25 seconds for example. We
> suspect it's related to the number of roles the connecting user has
> (including via other roles), because if we remove all roles but one from
> the connecting user (the one that grants connection permissions),
> connecting is always virtually instantaneous.
> 

I tried a couple simple setups with many roles (user with many roles
granted directly and with many roles granted through other roles), but
I've been unable to reproduce this.

> The closest issue that I can find that's similar is
> https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com,
> which reports that GRANT role is slow with a high number of roles - but in
> our case, it's connecting that's the problem, before (as far as we can
> tell) even one query is run. The database is busy, say up to 60-80% on a 16
> VCPU machine - even if it's a "good amount" below 100%, the issue occurs.
> 
> Is there anything we can do to investigate (or hopefully fix!) the issue?
> 

A reproducer would be great - a script that creates user/roles, and
triggers the long login time would allow us to investigate that.

Another option would be to get a perf profile from the process busy with
logging the user in - assuming it's CPU-intensive, and not (e.g.) some
sort of locking issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Michal Charemza <michal@charemza.name> writes:
> The issue is that we're hitting a strange performance problem on
> connection. Sometimes it can take ~25 to 40 seconds just to connect,
> although it's often way quicker. There seems to be no middle ground - never
> have I seen a connection take between 0.5 and 25 seconds for example. We
> suspect it's related to the number of roles the connecting user has
> (including via other roles), because if we remove all roles but one from
> the connecting user (the one that grants connection permissions),
> connecting is always virtually instantaneous.

It's not very clear what you mean by "sometimes".  Is the slowness
reproducible for a particular user and role configuration, or does
it seem to come and go by itself?

As Tomas said, a self-contained reproduction script would be very
helpful for looking into this.

> The closest issue that I can find that's similar is
> https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com,
> which reports that GRANT role is slow with a high number of roles - but in
> our case, it's connecting that's the problem, before (as far as we can
> tell) even one query is run.

That specific problem is (we think) new in v16, but the root cause
is an inefficient lookup mechanism that has been there a long time.
Maybe you have found a usage pattern that exposes its weakness in
older branches.  If so, we could consider back-patching 14e991db8
further than v16 ... but I don't plan to take any risk there without
concrete evidence that it'd improve things.

            regards, tom lane



Re: Extremely slow to establish connection when user has a high number of roles

От
Vijaykumar Jain
Дата:


On Sat, Apr 20, 2024, 5:25 PM Michal Charemza <michal@charemza.name> wrote:
Hi,

We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed.

The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quicker

can you rule out system catalog bloat ? 

Re: Extremely slow to establish connection when user has a high number of roles

От
Michal Charemza
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> It's not very clear what you mean by "sometimes".  Is the slowness
reproducible for a particular user and role configuration, or does
it seem to come and go by itself?

Ah it's more come and go by itself - as in one connection takes 30 seconds, then the next say 0.06s. It's happened for every user we've tried. Even more anecdotally, I would say it happens more when the database is busy in terms of tables being dropped/created and permissions changing.

Also: realise we did have one user that had directly was a member of several thousand roles, but indirectly several million. It would sometimes take 10 minutes for that user to connect. We've since changed that to one role, and that user connects fine now.

> As Tomas said, a self-contained reproduction script would be very
helpful for looking into this.

Have tried... but alas it seems fine in anything but the production environment. My closest attempt is attached to at least it show in more detail how our system is setup, but it always works fine for me locally.

I am wondering - what happens on connection? What catalogue tables does PostgreSQL check and how? What's allowed to happen concurrently and what isn't? If I knew, maybe I could come up with a reproduction script that does reproduce the issue?
Вложения

Re: Extremely slow to establish connection when user has a high number of roles

От
Michal Charemza
Дата:
Vijaykumar Jain <vijaykumarjain.github@gmail.com> writes:
> can you rule out system catalog bloat ?

I don't know! I've now run the query from https://wiki.postgresql.org/wiki/Show_database_bloat just just on pg_catalog, results attached

On Sat, Apr 20, 2024 at 3:52 PM Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:


On Sat, Apr 20, 2024, 5:25 PM Michal Charemza <michal@charemza.name> wrote:
Hi,

We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed.

The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quicker

can you rule out system catalog bloat ? 

Вложения

Re: Extremely slow to establish connection when user has a high number of roles

От
Frits Hoogland
Дата:
Michael, can you validate if this is consistently happening for the first connection after database cluster startup?

Frits

Op 20 apr 2024 om 04:55 heeft Michal Charemza <michal@charemza.name> het volgende geschreven:


Hi,

We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed.

The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quicker. There seems to be no middle ground - never have I seen a connection take between 0.5 and 25 seconds for example. We suspect it's related to the number of roles the connecting user has (including via other roles), because if we remove all roles but one from the connecting user (the one that grants connection permissions), connecting is always virtually instantaneous.

The closest issue that I can find that's similar is https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, which reports that GRANT role is slow with a high number of roles - but in our case, it's connecting that's the problem, before (as far as we can tell) even one query is run. The database is busy, say up to 60-80% on a 16 VCPU machine - even if it's a "good amount" below 100%, the issue occurs.

Is there anything we can do to investigate (or hopefully fix!) the issue?

Thanks,

Michal

------

A description of what you are trying to achieve and what results you expect.:
We would like to connect to the database - expect it to connect in less than 1 second, but sometimes 25 - 40s.

PostgreSQL version number you are running:
PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit

How you installed PostgreSQL:
Via AWS/Amazon Aurora

Changes made to the settings in the postgresql.conf file
In attached CSV file

Operating system and version:
Unknown

What program you're using to connect to PostgreSQL:
Python + SQLAlchemy, psql, or also via Amazon Quicksight (Unsure which client they use under the hood, but it surfaces connection timeout errors, which we suspect is due to the issue described above)
 
Is there anything relevant or unusual in the PostgreSQL server logs?:
No
 
For questions about any kind of error:
N/A
<server_configuration.csv>

Re: Extremely slow to establish connection when user has a high number of roles

От
Michal Charemza
Дата:
Frits Hoogland <frits.hoogland@gmail.com> writes:
> Michael, can you validate if this is consistently happening for the first connection after database cluster startup?

Hmmm... it'll be tricky and need some planning. It might even be impossible since this is on AWS Aurora, and I think AWS connects in regularly as part of a heartbeat check(?)

But: do you mean any connection, or just a user that hasn't connected yet after the restart? A user that hasn't connected yet would be much easier.
Michal Charemza <michal@charemza.name> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> It's not very clear what you mean by "sometimes".  Is the slowness
> reproducible for a particular user and role configuration, or does
> it seem to come and go by itself?

> Ah it's more come and go by itself - as in one connection takes 30 seconds,
> then the next say 0.06s. It's happened for every user we've tried. Even
> more anecdotally, I would say it happens more when the database is busy in
> terms of tables being dropped/created and permissions changing.

OK, that pretty much eliminates the idea that it's a new manifestation
of the catcache-inefficiency problem.  Vijaykumar may well have the
right idea, that it's a form of catalog bloat.  Do you do bulk
permissions updates that might affect thousands of role memberships at
once?

> Also: realise we did have one user that had directly was a member of
> several thousand roles, but indirectly several million. It would sometimes
> take 10 minutes for that user to connect. We've since changed that to one
> role, and that user connects fine now.

Interesting --- but even for that user, it was sometimes fast to
connect?

> I am wondering - what happens on connection? What catalogue tables does
> PostgreSQL check and how? What's allowed to happen concurrently and what
> isn't? If I knew, maybe I could come up with a reproduction script that
> does reproduce the issue?

Well, it's going to be looking to see that the user has CONNECT
privileges on the target database.  If that db doesn't have public
connect privileges, but only grants CONNECT to certain roles, then
we'll have to test whether the connecting user is a member of those
roles --- which involves looking into pg_auth_members and possibly
even doing recursive searches there.  For the sort of setup you're
describing with thousands of role grants (pg_auth_members entries)
it's not hard to imagine that search being rather expensive.  What
remains to be explained is how come it's only expensive sometimes.

The catalog-bloat idea comes from thinking about how Postgres handles
row updates.  There will be multiple physical copies (row versions)
of any recently-updated row, and this is much more expensive to scan
than a static situation with only one live row version.  First just
because we have to look at more than one copy, and second because
testing whether that copy is the live version is noticeably more
expensive if it's recent than once it's older than the xmin horizon,
and third because if we are the first process to scan it since it
became dead-to-everybody then it's our responsibility to mark it as
dead-to-everybody, so that we have to incur additional I/O to do that.
A plausible idea for particular connection attempts being slow is that
they came in just as a whole lot of pg_auth_members entries became
dead-to-everybody, and hence they were unlucky enough to get saddled
with a whole lot of that hint-bit-updating work.  (This also nicely
explains why the next attempt isn't slow: the work's been done.)

But this is only plausible if you regularly do actions that cause a
lot of pg_auth_members entries to be updated at the same time.
So we still don't have good insight into that, and your test script
isn't shedding any light.

A couple of other thoughts:

* I don't think your test script would show a connection-slowness
problem even if there was one to be shown, because you forgot to
revoke the PUBLIC connect privilege on the postgres database.
I'm fairly sure that if that exists it's always noticed first,
bypassing the need for any role membership tests.  So please
confirm whether your production database has revoked PUBLIC
connect privilege.

* It could be that the problem is not associated with the
database's connect privilege, but with role membership lookups
triggered by pg_hba.conf entries.  Do you have any entries there
that require testing membership (i.e. the role column is not
"all")?

            regards, tom lane



Re: Extremely slow to establish connection when user has a high number of roles

От
Michal Charemza
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Do you do bulk
> permissions updates that might affect thousands of role memberships at
> once?

So we do regularly update role memberships - essentially a sync from a separate database, and some could well have happened just before connections, but it's more in the tens at a time at most, not thousands... Or at least, that's what I thought. It sounds like it would be good to see if it's doing more. It'll take some time for me to figure this out though...

> I'm fairly sure that if that exists it's always noticed first,
> bypassing the need for any role membership tests.  So please
> confirm whether your production database has revoked PUBLIC
> connect privilege.

I realised that in fact we hadn't revoked this. So it sounds like whatever the issue, it's not about checking if the user has the CONNECT privilege?

> It could be that the problem is not associated with the
> database's connect privilege, but with role membership lookups
> triggered by pg_hba.conf entries.  Do you have any entries there
> that require testing membership (i.e. the role column is not
> "all")?

Running `select * from pg_hba_file_rules` it looks like the user column is always {all} or {rdsadmin}

Thanks,

Michal

Michal Charemza <michal@charemza.name> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I'm fairly sure that if that exists it's always noticed first,
>> bypassing the need for any role membership tests.  So please
>> confirm whether your production database has revoked PUBLIC
>> connect privilege.

> I realised that in fact we hadn't revoked this. So it sounds like whatever
> the issue, it's not about checking if the user has the CONNECT privilege?

Yeah.  I double-checked the code (see aclmask()), and it will detect
holding a privilege via PUBLIC before it performs any role membership
searches.  So whatever is happening, it's not that lookup.

>> It could be that the problem is not associated with the
>> database's connect privilege, but with role membership lookups
>> triggered by pg_hba.conf entries.  Do you have any entries there
>> that require testing membership (i.e. the role column is not
>> "all")?

> Running `select * from pg_hba_file_rules` it looks like the user column is
> always {all} or {rdsadmin}

You'll need to look closer and figure out which of the HBA rules is
being used for the slow connection attempts.  If it's {rdsadmin}
then that would definitely involve a role membership search.
If it's {all} then we're back to square one.

A different line of thought could be that the slow connections
are slow because they are waiting on a lock that some other
process has got and is in no hurry to release.  It would be
worth trying to capture the contents of the pg_locks view
(and I guess also pg_stat_activity) while one of these sessions
is stuck, if you can reproduce it often enough to make that
feasible.

            regards, tom lane