Обсуждение: could not open relation with OID

Поиск
Список
Период
Сортировка

could not open relation with OID

От
Ben Chobot
Дата:
We do a lot of queries per day, over a lot of hosts, all of which are on 12.9. We've recently started doing a better job at analyzing our db logs and have found that, a few times a day, every day, we see some of our queries fail with errors like:

could not open relation with OID 201940279

In the cases we've examined so far, the failed query succeeds just fine when we run it manually. The failed query also had run on an async streaming replica, and the primary has completed at least one autovacuum since the failure. I don't know if either of those two facts are relevant, but I'm not sure what else to blame. The internet seems to want to blame issues like this on temp tables, which makes sense, but in our case, most of the queries that are failing this way are simple PK scans, which then fall back to the table to pull all the columns. The tables themselves are small in row count - although some values are likely TOASTed - so I would be surprised if anything is spilling to disk for sorting, which might have counted as a temp table enough to give such an error.

This is a minuscule failure percentage, so replicating it is going to be hard, but it is still breaking for reasons I don't understand, and so I'd like to fix it. Has anybody else seen this, or have an ideas of what to look at?

Other things we've considered:
    - we run pg_repack, which certainly seems like it could make an error like this, but we see this error in places and times that pg_repack isn't currently running
    - although all our servers are currently on 12.9, I don't think this is a new error for us. I believe we might have seen it on previous minor versions of 12 and probably on 9.5 as well.
    - our filesystem is xfs and seems reliable. I would expect that if it was a filesystem level error, it would not be so transient. We do, occasionally, expand our filesystems, but not at all the times we've seen this error.

Re: could not open relation with OID

От
Michael Paquier
Дата:
On Wed, Jan 26, 2022 at 05:30:01PM -0800, Ben Chobot wrote:
> We do a lot of queries per day, over a lot of hosts, all of which are on
> 12.9. We've recently started doing a better job at analyzing our db logs and
> have found that, a few times a day, every day, we see some of our queries
> fail with errors like:
>
> could not open relation with OID 201940279
>
> In the cases we've examined so far, the failed query succeeds just fine when
> we run it manually. The failed query also had run on an async streaming
> replica, and the primary has completed at least one autovacuum since the
> failure. I don't know if either of those two facts are relevant, but I'm not
> sure what else to blame. The internet seems to want to blame issues like
> this on temp tables, which makes sense, but in our case, most of the queries
> that are failing this way are simple PK scans, which then fall back to the
> table to pull all the columns. The tables themselves are small in row count
> - although some values are likely TOASTed - so I would be surprised if
> anything is spilling to disk for sorting, which might have counted as a temp
> table enough to give such an error.

Do those OIDs point to some specific relations?  It should be easy
enough to guess to which pg_class entry they point to, especially if
you have a persistent schema, and it these are indeed temporary
entries or not depending on their pg_class.relnamespace.

> This is a minuscule failure percentage, so replicating it is going to be
> hard, but it is still breaking for reasons I don't understand, and so I'd
> like to fix it. Has anybody else seen this, or have an ideas of what to look
> at?

I don't recall seeing such reports recently.

> Other things we've considered:
>     - we run pg_repack, which certainly seems like it could make an error
> like this, but we see this error in places and times that pg_repack isn't
> currently running

It could also take time for the issue to show up, depending on the
state of the relcache.
--
Michael

Вложения

Re: could not open relation with OID

От
Ben Chobot
Дата:
Michael Paquier wrote on 1/26/22 9:14 PM:
On Wed, Jan 26, 2022 at 05:30:01PM -0800, Ben Chobot wrote:
Other things we've considered:
    - we run pg_repack, which certainly seems like it could make an error
like this, but we see this error in places and times that pg_repack isn't
currently running
It could also take time for the issue to show up, depending on the
state of the relcache.

So.... tell me more about stale relcaches? It turns out I was totally wrong and this is being driven by pg_repack. I can even make it happen pretty easily:

1. Put some data in a table with a single btree index on a primary db.
2. Set up streaming replication to a secondary db.
3. In a loop on the primary, have pg_repack repack the indices of that table. (the -x flag)
4. In a loop on the secondary, have psql query the secondary db for an indexed value of that table.

When I do this with replication, I can get the OID error consistently within 30 minutes. Without replication, I've been unable to get it to happen after 2 hours.

Given that this fails much faster on the secondary than the primary (where it has yet to fail at all) I'm leaning towards a postgres bug, but I'm happy to do more research to point the blame at something pg_repack is doing, if you could point me at a thing to research.