Обсуждение: Hanging query on a fresh restart

Поиск
Список
Период
Сортировка

Hanging query on a fresh restart

От
Jason Jho
Дата:
Hi,

We using Postgres 9.3.10 on Amazon RDS and running into some strange behavior that has been tough to track down and debug (partially due to the limited admin access from RDS).

We're running a read-only query that normally takes ~10-15 min., but also runs concurrently with several other intensive queries (these queries themselves, finish).  

On one particular day, this query hung for many hours and even while we killed pids for running queries and any locks granted, the query would never return.  Also no hints of blocking processes.  After some digging through some I/O metrics, we didn't see any memory issues or unusual spikes that would lead us to believe that we're running low on resources.

There is 1 caveat, however:  there was a different schema that contained a day-old copy of data that isn't normally present when the hang started to occur.  However, since these are completely different schema namespaces with no crossovers in the queries themselves, I don't see how this is relevant.


 1) We ended up doing a full reboot of the RDS instance and ran the query again, this time, no other queries are running off of a fresh boot-up (no competing locks or transactions).  The query continued to hang.

 2) We then ran pg_dump to snapshot the current data and did a full pg_restore (after dropping all schemas) of an older dataset where we knew this query would run successfully.  As expected, the query ran fine.

 3) We then dropped all schemas again and pg_restored the previous dataset that was causing the query to hang, and then to my surprise, the query ran just fine.  No hangs.  

We thought this might be possibly due to some internal vacuuming, but this is unlikely since there are no real concurrent reads or updates happening.  Auto-vacuum is also on with default settings.

What is the most confusing part in all of this is why a DROP SCHEMA CASCADE and a fresh pg_restore would somehow fix the problem.  Even a fresh reboot didn't fix it.

Any ideas??

Re: Hanging query on a fresh restart

От
Jim Nasby
Дата:
On 11/10/15 3:42 PM, Jason Jho wrote:
> On one particular day, this query hung for many hours and even while we
> killed pids for running queries and any locks granted, the query would
> never return.  Also no hints of blocking processes.  After some digging
> through some I/O metrics, we didn't see any memory issues or unusual
> spikes that would lead us to believe that we're running low on resources.

Did IO stats indicate IO was happening? Did you see a pegged CPU running
the query?

> There is 1 caveat, however:  there was a different schema that contained
> a day-old copy of data that isn't normally present when the hang started
> to occur.  However, since these are completely different schema
> namespaces with no crossovers in the queries themselves, I don't see how
> this is relevant.

If search_path wasn't what you thought it was you could have easily been
running against the wrong set of tables.

> We thought this might be possibly due to some internal vacuuming, but
> this is unlikely since there are no real concurrent reads or updates
> happening.  Auto-vacuum is also on with default settings.

There are other reasons why autovacuum could kick in, notably to prevent
transaction ID wraparound.

> What is the most confusing part in all of this is why a DROP SCHEMA
> CASCADE and a fresh pg_restore would somehow fix the problem.  Even a
> fresh reboot didn't fix it.

Without more info we're stuck guessing. You might try submitting a
ticket with amazon, especially if you can reproduce this.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: Hanging query on a fresh restart

От
Kevin Grittner
Дата:
On Friday, November 13, 2015 3:41 PM, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote:
> On 11/10/15 3:42 PM, Jason Jho wrote:

>> We using Postgres 9.3.10

>> What is the most confusing part in all of this is why a DROP
>> SCHEMA CASCADE and a fresh pg_restore would somehow fix the
>> problem. Even a fresh reboot didn't fix it.
>
> Without more info we're stuck guessing. You might try submitting
> a ticket with amazon, especially if you can reproduce this.

There have been occasional reports of corrupted indexes causing
endless loops which could cause these symptoms if one core was
pegged at 100% during the incident.  There are many possible causes
for such corruption -- see:

http://rhaas.blogspot.com/2012/03/why-is-my-database-corrupted.html

That said, there was a long-standing bug in btree index page
deletion (which could only happen during vacuum or autovacuum)
which was fixed in 9.4:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=efada2b8e920adfdf7418862e939925d2acd1b89

It was pretty hard to hit, and normally wouldn't cause these
symptoms, but if there was a "perfect storm" of events before the
problem self-corrected, I think it might possibly lead to this.  If
we could somehow confirm that this old bug was the cause, it might
justify pushing this patch back into older branches.  As the commit
message said:

| This bug is old, all supported versions are affected, but this patch is too
| big to back-patch (and changes the WAL record formats of related records).
| We have not heard any reports of the bug from users, so clearly it's not
| easy to bump into. Maybe backpatch later, after this has had some field
| testing.

Did you make a filesystem-level copy of the data directory?  If so,
the first step in checking this theory would be to restore a copy
and reindex all indexes used by the problem query to see if that
fixes it.  If it does, close examination of the corrupted index
might provide clues about how the corruption occurred.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company