Обсуждение: BUG #19411: libpq 16.x exhibits a memory leak when connections are repeatedly created and destroyed
BUG #19411: libpq 16.x exhibits a memory leak when connections are repeatedly created and destroyed
От
PG Bug reporting form
Дата:
The following bug has been logged on the website:
Bug reference: 19411
Logged by: Matt Carter
Email address: mcarter@twosigma.com
PostgreSQL version: 16.7
Operating system: Debian GNU/Linux 12 with kernel 6.12.51-ts1-amd64
Description:
This report is for the libpq client library for PostgreSQL.
I noticed the issue because my Apache Airflow task supervisor processes were
experiencing memory growth of ~600 KB/min. When running hundreds of tasks,
my 32 GB machine OOM-crashed every few hours. I traced the problem to when
I upgraded libpq from version 13 to 16. Varying Python version, and
PostgreSQL DB version didn't make a difference. When I updated to psycopg3,
the leak rate dropped dramatically.
Confirmed leaking:
libpq 16.0.4
libpq 16.0.7
Confirmed NOT leaking:
libpq 13.0.11
libpq 13.11
My environment for the below reproducer is:
Client libpq: 16.0.4 or 16.0.7 (runtime version)
Python: 3.10.16 or 3.11.11
psycopg2: 2.9.10
SQLAlchemy: 1.4.46
PostgreSQL Server: 13.11 OR 16.7 (both affected equally)
Platform: Linux, K8s containers
Reliable reproducer:
```
#!/usr/bin/env python3
"""
Reproduces ~800 KB/min memory leak in libpq 16.x
MUST monitor with 'ps aux', NOT Python's resource module!
"""
import datetime, os, subprocess
from time import sleep
from sqlalchemy import Column, Integer, types, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.pool import NullPool
# Verify libpq version
import psycopg2
print(f"psycopg2 compiled against libpq: {psycopg2.__libpq_version__}")
# Check ACTUAL runtime libpq (critical!)
so_path = os.path.dirname(psycopg2.__file__) +
"/_psycopg.cpython-311-x86_64-linux-gnu.so"
result = subprocess.run(["ldd", so_path], capture_output=True, text=True)
for line in result.stdout.split('\n'):
if 'libpq.so' in line:
print(f"ACTUAL runtime libpq:
{line.split('=>')[1].split('(')[0].strip()}")
print(f"\nMonitor this PID: {os.getpid()}")
print("Run in another terminal: watch -n 10 'ps -p <PID> -o
pid,rss,vsz,cmd'")
print()
# Setup SQLAlchemy with NullPool (creates/destroys connections)
Base = declarative_base(metadata=None)
class TestTable(Base):
__tablename__ = "test_leak_table"
id = Column(Integer, primary_key=True)
updated_at = Column(types.DateTime(timezone=True))
obj = TestTable()
engine = create_engine(
"postgresql://your-server/your-database",
poolclass=NullPool, # Critical: no pooling, create/destroy each time
encoding="utf-8"
)
Session = scoped_session(sessionmaker(
autocommit=False,
autoflush=False,
bind=engine,
expire_on_commit=False,
))
session = Session()
session.add(obj)
# Monitor memory using ps aux (CRITICAL - internal monitoring doesn't show
leak!)
def get_rss_kb():
"""Get RSS from ps aux - DO NOT use resource.getrusage()!"""
pid = os.getpid()
result = subprocess.run(["ps", "aux"], capture_output=True, text=True)
for line in result.stdout.split('\n'):
if str(pid) in line and 'python' in line:
return int(line.split()[5]) # RSS is 6th column
return None
print("Starting 10-minute test...")
print(f"{'Time(s)':<8} | {'Iteration':<10} | {'RSS(KB)':<10} |
{'Growth(KB)':<10}")
print("-" * 55)
start_time = datetime.datetime.now()
baseline_rss = None
for i in range(1, 601): # 600 iterations = 10 minutes
# Update and commit (creates/destroys connection with NullPool)
obj.updated_at = datetime.datetime.now(datetime.timezone.utc)
session.commit()
# Report every 30 iterations
if i % 30 == 0:
rss = get_rss_kb()
if rss:
if i == 30:
baseline_rss = rss
growth = 0
else:
growth = rss - baseline_rss
elapsed = (datetime.datetime.now() - start_time).total_seconds()
print(f"{int(elapsed):<8} | {i:<10} | {rss:<10} | {growth:<10}",
flush=True)
sleep(1)
# Final results
final_rss = get_rss_kb()
total_elapsed = (datetime.datetime.now() - start_time).total_seconds()
growth_total = final_rss - baseline_rss
leak_rate = growth_total / (total_elapsed / 60)
print()
print("=" * 60)
print(f"Baseline RSS (t=30s): {baseline_rss} KB")
print(f"Final RSS (t=600s): {final_rss} KB")
print(f"Total Growth: {growth_total} KB")
print(f"Leak Rate: {leak_rate:.1f} KB/min")
print("=" * 60)
```
The DB set-up for this reproducer is simple:
```
CREATE TABLE test_leak_table (
id SERIAL PRIMARY KEY,
updated_at TIMESTAMP WITH TIME ZONE
);
INSERT INTO test_leak_table (updated_at) VALUES (NOW());
```
When I run this with libpq 13.x, I get:
Baseline: ~41,000 KB
Final: ~41,200 KB
Growth: ~200 KB (stabilizes)
Leak Rate: ~14 KB/min
But when I run it with libpq 16.x, I reproduce the leak behavior:
Baseline: ~42,000 KB
Final: ~52,000 KB
Growth: ~10,000 KB (continuous linear growth)
Leak Rate: ~800 KB/min
Notes:
1. PostgreSQL server version does NOT matter
PG 13.11: ~797 KB/min average
PG 16.7: ~802 KB/min average
Difference: <1% (within measurement error)
2. Leak is linear and continuous
Grows ~530 KB every 30 seconds
No stabilization
No sawtooth pattern
Suggests unbounded memory accumulation
3. Leak is in C Library Memory
The leak is:
- Visible in `ps aux` RSS (operating system view)
- NOT visible in Python's resource.getrusage() (application view)
- NOT any Python object (I checked with the Python VM's `gc.get_objects()`)
This indicates the leak is in libpq's C-level allocations, not Python heap.
4. The leak requires connection churn. In particular, it needs:
- NullPool (no connection pooling)
- Repeated connection create/destroy
- UPDATE + COMMIT operations each iteration
Note that the only psycopg2 version I tested with was 2.9.10, as it is the
only version available in my test environment.
Also, I only tested with SQLAlchemy version 1.4.46, as v2 would require a
rewrite due to an incompatible API.
Workarounds:
Use libpq 13.x (43x improvement: 14-22 KB/min vs 800 KB/min).
Use connection pooling to reduce connection churn.
Use psycopg3.
Recycle processes frequently.
PG Bug reporting form <noreply@postgresql.org> writes:
> This report is for the libpq client library for PostgreSQL.
> I noticed the issue because my Apache Airflow task supervisor processes were
> experiencing memory growth of ~600 KB/min. When running hundreds of tasks,
> my 32 GB machine OOM-crashed every few hours. I traced the problem to when
> I upgraded libpq from version 13 to 16. Varying Python version, and
> PostgreSQL DB version didn't make a difference. When I updated to psycopg3,
> the leak rate dropped dramatically.
I think there are way too many moving parts here, and too few
configuration details, to allow assigning blame confidently.
I tried making a simple C test program that just connected, issued
BEGIN/UPDATE/COMMIT, and disconnected in a tight loop. I see zero
leakage with that. So either the leak isn't actually in libpq, or
there's some critical environmental factor you didn't mention.
Plausible such factors include connection type and authentication
method. (For example, years ago libpq did leak memory while using
GSSAPI encryption.) I tried both regular and SSL connections but
didn't really push hard on that, since I'd just be guessing blindly
about what your setup is.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think there are way too many moving parts here, and too few configuration details, > to allow assigning blame confidently. > I tried making a simple C test program that just connected, issued > BEGIN/UPDATE/COMMIT, and disconnected in a tight loop. I see zero leakage with > that. So either the leak isn't actually in libpq, or there's some critical environmental > factor you didn't mention. > > Plausible such factors include connection type and authentication method. (For > example, years ago libpq did leak memory while using GSSAPI encryption.) I tried > both regular and SSL connections but didn't really push hard on that, since I'd just > be guessing blindly about what your setup is. Tom, Thank you for taking the time to test this and for the feedback. Your C test showing no leak suggests the issue is specificto how psycopg2 uses libpq, not libpq itself. I apologize for not including enough environmental details. I usedKerberos/GSSAPI with SSL (TLS 1.2 connections). My connection string was: "postgresql://hostname/database" (no password,Kerberos auth). Your mention of "years ago libpq did leak memory while using GSSAPI encryption" is interesting because we ARE using GSSAPI/Kerberosauthentication. Given your C test showed no leak, this appears to be specific to GSSAPI authentication, or perhaps specific to how psycopg2uses libpq with GSSAPI. I can test with non-GSSAPI authentication to try to isolate that variable. I can also create a pure psycopg2 reproducer(without SQLAlchemy). I can also test whether disabling GSSAPI encryption (but keeping GSSAPI auth) changes thebehavior. Would testing with GSSAPI authentication help narrow this down? I can also report this to the psycopg2 projectif you think it's their issue. Thanks again for your help, Matt
Matt Carter <Matt.Carter@twosigma.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think there are way too many moving parts here, and too few configuration details,
>> to allow assigning blame confidently.
> Thank you for taking the time to test this and for the feedback. Your C test showing no leak suggests the issue is
specificto how psycopg2 uses libpq, not libpq itself. I apologize for not including enough environmental details. I
usedKerberos/GSSAPI with SSL (TLS 1.2 connections). My connection string was: "postgresql://hostname/database" (no
password,Kerberos auth).
> Your mention of "years ago libpq did leak memory while using GSSAPI encryption" is interesting because we ARE using
GSSAPI/Kerberosauthentication.
Interesting. I wondered about GSSAPI, but spinning up such an
environment is more work than I wanted to do on speculation.
> I can test with non-GSSAPI authentication to try to isolate that variable. I can also create a pure psycopg2
reproducer(without SQLAlchemy). I can also test whether disabling GSSAPI encryption (but keeping GSSAPI auth) changes
thebehavior. Would testing with GSSAPI authentication help narrow this down? I can also report this to the psycopg2
projectif you think it's their issue.
Please try varying the connection type and encryption. I do suspect
this may be psycopg2's fault, but we lack enough data to pin blame
as yet.
regards, tom lane
Hi, On 2026-02-17 13:47:49 -0500, Tom Lane wrote: > Matt Carter <Matt.Carter@twosigma.com> writes: > > Thank you for taking the time to test this and for the feedback. Your C test showing no leak suggests the issue is specificto how psycopg2 uses libpq, not libpq itself. I apologize for not including enough environmental details. I usedKerberos/GSSAPI with SSL (TLS 1.2 connections). My connection string was: "postgresql://hostname/database" (no password,Kerberos auth). > > Your mention of "years ago libpq did leak memory while using GSSAPI encryption" is interesting because we ARE using GSSAPI/Kerberosauthentication. > > Interesting. I wondered about GSSAPI, but spinning up such an > environment is more work than I wanted to do on speculation. Heh, understandable... > > I can test with non-GSSAPI authentication to try to isolate that variable. I can also create a pure psycopg2 reproducer(without SQLAlchemy). I can also test whether disabling GSSAPI encryption (but keeping GSSAPI auth) changes thebehavior. Would testing with GSSAPI authentication help narrow this down? I can also report this to the psycopg2 projectif you think it's their issue. > > Please try varying the connection type and encryption. I do suspect > this may be psycopg2's fault, but we lack enough data to pin blame > as yet. Matt, you could try analyzing the memory usage with heaptrack, it tends to be pretty good at finding them even in uninstrumented builds, as long as enough debug symbols for a backtrace are available. Often enough it'll pinpoint where the leak is coming from quite easily (but note that it will report some constant-sized leaks that are "intentional"). Greetings, Andres
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Please try varying the connection type and encryption. Good idea. Here's what I found: Test 1: Configuration: Original (SQLAlchemy + NullPool) libpq: 16.0.7 SQLAlchemy: Yes Leak Rate: 803 KB/min Test 2: Configuration: Pure psycopg2 (default) libpq: 16.0.7 SQLAlchemy: NO Leak Rate: 801 KB/min Test 3: Configuration: Pure psycopg2 + gssencmode=disable libpq: 16.0.7 SQLAlchemy: NO Leak Rate: 858 KB/min Test 4: Configuration: Pure psycopg2 + SSL only libpq: 16.0.7 SQLAlchemy: NO Leak Rate: 861 KB/min Test 5: Configuration: Pure psycopg2 libpq: 13.0.11 SQLAlchemy: NO Leak Rate: 17 KB/min So, it seems none of these changes avoids the leak: - Removing SQLAlchemy - Disabling GSSAPI encryption - Using SSL-only - Changing PostgreSQL DB Server version from 16 to 13 The only changes that I found that avoid the leak are: - Changing libpq version from 16 to 13, or - Changing psycopg version from 2 to 3. Tom, since your C test showed no leak, the issue is likely in how psycopg2 calls libpq, not pure libpq itself. I guess I should report this to the psycopg2 project.
Matt Carter wrote: > Tom, since your C test showed no leak, the issue is likely in how psycopg2 calls libpq, not pure libpq itself. FYI, I have reported the issue to the psycopg project: https://github.com/psycopg/psycopg2/issues/1827 Thanks you for your help, Tom. Your test results helped show that this is likely a psycopg2 issue. Cheers, Matt