Обсуждение: contrib/pg_stat_tcpinfo
Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only
netstat/ss-like extension for showing detailed information about TCP
connections based on information from the kernel itself. It's the
result of some frustration and some corridor talks during recent
PgConfEU, where we agreed with Andres that it would be cool to have
this in core to support and resolve problems much more quickly (in
spite of this being a Linux-only extension for a start). As it stands
it can help troubleshooting advanced performance problems like:
* basic connection problems (e.g. imagine being able to see that
postgres_fdw PID is stuck in SYN_SENT)
* bandwidth problems where [logical/physical] replication and/or
backup (pg_basebackup) are not keeping up due any of the above: full
sendQ/recvQ, random packet drops on connection, too high RTT (WAN),
network jitter, etc.
* tracking down reasons for stuck or reset-by-peer TCP connections due
to wrongly firewalls silently killing idle connections (it's visible
via not TCP keepalive not being activated in time, and later stuck in
timer retransmits)
* high latency outliers for SyncRep too (e.g. via
lastsnd/lastrcv/lastack timers)
* probably some more
One could argue that while netstat/ss provide the necessary
information, it's often much harder than necessary to combine the
information with postgres information (like which connections are for
replication, wait events, correlation to other DB activity). The idea
is to have it in core, so it's available always.
Sample use (psql issuing COPY to postgres_fdw to the same DB over
127.0.0.1 -- 3 connections are visible: 1 for postgres_fdw itself, and
two sides of the network socket [this happens only on localhost])
postgres=# select
pid, application_name, substring(query,1,40) q,
wait_event, src_addr, src_port, dst_addr, dst_port,
recvq, sendq,
t.state, tcpinfo->>'rtt' as rtt, tcpinfo->>'rwnd_limited' as rwnd_limited,
pg_size_pretty(cast(tcpinfo->>'delivery_rate' as bigint)) as
peak_delivery_rate,
substring(tcpinfo::text,1,64) || '...' as substr
from pg_stat_get_tcpinfo() t join pg_stat_activity a using (pid)
where
t.state ='ESTABLISHED' and pid <> pg_backend_pid();
pid | application_name | q |
wait_event | src_addr | src_port | dst_addr | dst_port |
recvq | sendq | state | rtt | rwnd_limited |
peak_delivery_rate | substr
-------+------------------+------------------------------------------+----------------------+-----------+----------+-----------+----------+-------+-------+-------------+-------+--------------+--------------------+---------------------------------------------------------------------
81279 | psql | COPY (SELECT bindata FROM fdw_bandwidth_ |
PostgresFdwGetResult | 127.0.0.1 | 1234 | 127.0.0.1 | 52612 |
0 | 0 | ESTABLISHED | 5.092 | 0 | 422 MB
| {"ato": 40000, "rto": 206000, "rtt": 5.092, "lost": 0, "pmtu": 6...
81279 | psql | COPY (SELECT bindata FROM fdw_bandwidth_ |
PostgresFdwGetResult | 127.0.0.1 | 59562 | 127.0.0.1 | 1234 |
0 | 0 | ESTABLISHED | 0.038 | 0 | 2604 MB
| {"ato": 40000, "rto": 201000, "rtt": 0.038, "lost": 0, "pmtu": 6...
81282 | postgres_fdw | FETCH 100 FROM c1 |
BuffileWrite | 127.0.0.1 | 1234 | 127.0.0.1 | 59562 |
0 | 0 | ESTABLISHED | 0.380 | 1735000 | 833 MB
| {"ato": 40000, "rto": 201000, "rtt": 0.380, "lost": 0, "pmtu": 6...
(3 rows)
Some early feedback about direction in order to bring this into core
would be appreciated. State of stuff:
1. Andres is pushing for supporting UNIX domain sockets here, but I'm
not sure if it is really worth the effort (and it would trigger new
naming problem;)) and primarily making the code even more complex.
IMHO the netlinksock_diag API is already convoluted and adding AF_UNIX
would make it even less readable.
2. IPv6 works, but wasn't tested much.
3. Biggest TODO left is probably properly formatting the information
based on struct tcpinfo variables (just like ss(1) does, so keeping
the same unit/formatting)
4. Patch/tests are missing intentionally as I would like first to
stabilize the outputs/naming/code first.
5. [security] Should this be available to pg_monitor/pg_read_all_stats
or just to superuser?
6. [security] Should this return info about all TCP connections or
just the UID of the postmaster?
-J.
Вложения
On Mon, Nov 3, 2025 at 3:09 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only > netstat/ss-like extension for showing detailed information about TCP > connections based on information from the kernel itself. [..] > Some early feedback about direction in order to bring this into core > would be appreciated. State of stuff: > > 1. Andres is pushing for supporting UNIX domain sockets here, but I'm > not sure if it is really worth the effort (and it would trigger new > naming problem;)) and primarily making the code even more complex. > IMHO the netlinksock_diag API is already convoluted and adding AF_UNIX > would make it even less readable. > 2. IPv6 works, but wasn't tested much. > 3. Biggest TODO left is probably properly formatting the information > based on struct tcpinfo variables (just like ss(1) does, so keeping > the same unit/formatting) > 4. Patch/tests are missing intentionally as I would like first to > stabilize the outputs/naming/code first. > 5. [security] Should this be available to pg_monitor/pg_read_all_stats > or just to superuser? > 6. [security] Should this return info about all TCP connections or > just the UID of the postmaster? v2 attached with tiny fixes and little more code readability (for dumping struct tcpinfo *) -J.
Вложения
On 11/7/25 11:36, Jakub Wartak wrote: > On Mon, Nov 3, 2025 at 3:09 PM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: >> >> Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only >> netstat/ss-like extension for showing detailed information about TCP >> connections based on information from the kernel itself. > [..] > I personally don't remember ever needing this kind of visibility into TCP connections, but I'm also not doing all that much direct support recently. And even in the past I personally didn't need to look at TCP-level details all that much, the problems were usually elsewhere. But maybe it's very useful in practice, don't know. >> Some early feedback about direction in order to bring this into core >> would be appreciated. State of stuff: >> Well, it's an extension in contrib. Is that sufficiently "in core"? Do you expect this to be used in PROD environments, or more in DEV? >> 1. Andres is pushing for supporting UNIX domain sockets here, but I'm >> not sure if it is really worth the effort (and it would trigger new >> naming problem;)) and primarily making the code even more complex. >> IMHO the netlinksock_diag API is already convoluted and adding AF_UNIX >> would make it even less readable. No idea. For most real-world production systems the client is probably on a different machine, so won't use UNIX sockets. Not always, but how often do UNIX sockets have network-like problems for this visibility to be helpful? Also, how much work / extra code would it be to support UNIX sockets? >> 2. IPv6 works, but wasn't tested much. >> 3. Biggest TODO left is probably properly formatting the information >> based on struct tcpinfo variables (just like ss(1) does, so keeping >> the same unit/formatting) I don't follow? Why do you want to format data the way "ss" does? >> 4. Patch/tests are missing intentionally as I would like first to >> stabilize the outputs/naming/code first. >> 5. [security] Should this be available to pg_monitor/pg_read_all_stats >> or just to superuser? I suppose making this superuser-only would be against the effort to not require superuser privileges, so using roles seems like the way to go. The nature of the data seems very similar to pg_stat_activity (i.e. info about current connections), so I'd suggest to use pg_read_all_stats. It might even use an approach similar to pg_stat_get_activity(), which shows some fields to everyone, and the role is required only for fields with sensitive information. >> 6. [security] Should this return info about all TCP connections or >> just the UID of the postmaster? > Not sure if I understand the question, but IMHO this should return only info about connections opened by Postgres. Not for connections about TCP connections opened by other stuff running on the server. regards -- Tomas Vondra
Hi, On 2025-11-08 00:17:30 +0100, Tomas Vondra wrote: > On 11/7/25 11:36, Jakub Wartak wrote: > > On Mon, Nov 3, 2025 at 3:09 PM Jakub Wartak > > <jakub.wartak@enterprisedb.com> wrote: > >> > >> Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only > >> netstat/ss-like extension for showing detailed information about TCP > >> connections based on information from the kernel itself. > > [..] > > > > I personally don't remember ever needing this kind of visibility into > TCP connections, but I'm also not doing all that much direct support > recently. And even in the past I personally didn't need to look at > TCP-level details all that much, the problems were usually elsewhere. > > But maybe it's very useful in practice, don't know. FWIW, I've needed this many times. Without the TCP information it's very hard to figure out why higher latency connections aren't keeping up - is it packet loss, it it too small network buffers on the sending/receiving side, is it the remote side not keeping up on the CPU level... > >> Some early feedback about direction in order to bring this into core > >> would be appreciated. State of stuff: > >> > > Well, it's an extension in contrib. Is that sufficiently "in core"? Do > you expect this to be used in PROD environments, or more in DEV? I'd expect it to be mainly in prod, although not exclusively. > >> 2. IPv6 works, but wasn't tested much. > >> 3. Biggest TODO left is probably properly formatting the information > >> based on struct tcpinfo variables (just like ss(1) does, so keeping > >> the same unit/formatting) > > I don't follow? Why do you want to format data the way "ss" does? Yea, I don't get that either - IMO ss's format is just about the worst possible format. It's very hard to parse and doesn't seem to have advantages making up for that. Greetings, Andres Freund
On Sat, Nov 8, 2025 at 12:17 AM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas, thanks for responding!
> On 11/7/25 11:36, Jakub Wartak wrote:
> > On Mon, Nov 3, 2025 at 3:09 PM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> >>
> >> Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only
> >> netstat/ss-like extension for showing detailed information about TCP
> >> connections based on information from the kernel itself.
> > [..]
> >
>
> I personally don't remember ever needing this kind of visibility into
> TCP connections, but I'm also not doing all that much direct support
> recently. And even in the past I personally didn't need to look at
> TCP-level details all that much, the problems were usually elsewhere.
>
> But maybe it's very useful in practice, don't know.
Well, as stated earlier, such tooling was required each time for
latency problem with SyncRep or bandwidth problem with anything else.
> >> Some early feedback about direction in order to bring this into core
> >> would be appreciated. State of stuff:
> >>
>
> Well, it's an extension in contrib. Is that sufficiently "in core"? Do
> you expect this to be used in PROD environments, or more in DEV?
Contrib/* for some reason sounds to me like being already in core and
good enough (core project "PostgreSQL" not core like "src/").
Technically it's just about having stuff ready to be used quickly when
it's needed (banks, fintech, k8) rather than getting into debates IF
it can be installed, IF it is being supported and so on. Having access
to such more advanced debugging/correlation utilities makes a better
impression about its maturity (time to resolution is lower). It's
almost always about PROD-like systems.
> >> 1. Andres is pushing for supporting UNIX domain sockets here, but I'm
> >> not sure if it is really worth the effort (and it would trigger new
> >> naming problem;)) and primarily making the code even more complex.
> >> IMHO the netlinksock_diag API is already convoluted and adding AF_UNIX
> >> would make it even less readable.
>
> No idea. For most real-world production systems the client is probably
> on a different machine, so won't use UNIX sockets. Not always, but how
> often do UNIX sockets have network-like problems for this visibility to
> be helpful?
Yes, same here, never got into problems with domain sockets
performance, but Andres wanted to have that visibility probably to
test local pgbench/basebackups runs, but in such development-like
scenarios `ss -moe` does not show that many interesting performance
metrics there (it's just skmem):
Netid State Recv-Q Send-Q
Local Address:Port Peer Address:Port Process
u_str ESTAB 0 0
/var/run/postgresql/.s.PGSQL.5432 86934 * 112798
skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) <-> ino:3532 dev:0/27
So my line of thinking would be to not clutter the code more than
necessary to add something that's already easily available (on PG
devs/commiters laptops), but would be rareley used in the field.
> Also, how much work / extra code would it be to support UNIX sockets?
+150-200 lines more - maybe some generalization of some routines to
handle AF_UNIX and AF_UNIX (and that would potentially pave the road
to also support other sock_diag data, e.g. udp, but I dont want to add
that too)
> >> 3. Biggest TODO left is probably properly formatting the information
> >> based on struct tcpinfo variables (just like ss(1) does, so keeping
> >> the same unit/formatting)
>
> I don't follow? Why do you want to format data the way "ss" does?
Hm, no strong feelings there, I just wanted to have the same
information. I don't mind formatting it any other way. Today in v2
that's OBJ (One Big Jsonb):
postgres=# select pid, tcpinfo->'rtt' as RTT, jsonb_pretty(tcpinfo)
from pg_stat_tcpinfo limit 1;
-[ RECORD 1 ]+---------------------------------------------
pid | 13019
rtt | 23.445
jsonb_pretty | { +
| "ato": 40.000, +
| "rto": 224000, +
| "rtt": 23.445, +
| "lost": 0, +
| "pmtu": 1500, +
| "skmem": { +
| "optmem": 0, +
| "rcvbuf": 1239696, +
| "sndbuf": 87040, +
| "fwd_alloc": 3136, +
| "rmem_alloc": 960, +
| "wmem_alloc": 0, +
| "wmem_queued": 0 +
| }, +
| "state": 8, +
| "timer": "(off,0min0sec,0)", +
[..60 lines more..]
Any better ideas? It's liteally about dumping out struct tcpinfo (and
could be OS-dependent - for far future) - so that's why I've picked up
Json, just to to have this flexibility long term.
> >> 4. Patch/tests are missing intentionally as I would like first to
> >> stabilize the outputs/naming/code first.
> >> 5. [security] Should this be available to pg_monitor/pg_read_all_stats
> >> or just to superuser?
>
> I suppose making this superuser-only would be against the effort to not
> require superuser privileges, so using roles seems like the way to go.
> The nature of the data seems very similar to pg_stat_activity (i.e. info
> about current connections), so I'd suggest to use pg_read_all_stats. It
> might even use an approach similar to pg_stat_get_activity(), which
> shows some fields to everyone, and the role is required only for fields
> with sensitive information.
Thanks.
> >> 6. [security] Should this return info about all TCP connections or
> >> just the UID of the postmaster?
> >
>
> Not sure if I understand the question, but IMHO this should return only
> info about connections opened by Postgres. Not for connections about TCP
> connections opened by other stuff running on the server.
Yes, I'm more on this option too. Ok let's do it that way.
-J.
On Sat, Nov 8, 2025 at 1:18 AM Andres Freund <andres@anarazel.de> wrote:
Hi Andres!
>> But maybe it's very useful in practice, don't know.
> FWIW, I've needed this many times. Without the TCP information it's very hard
> to figure out why higher latency connections aren't keeping up - is it packet
> loss, it it too small network buffers on the sending/receiving side, is it the
> remote side not keeping up on the CPU level...
Pretty much the same!
> > I don't follow? Why do you want to format data the way "ss" does?
>
> Yea, I don't get that either - IMO ss's format is just about the worst
> possible format. It's very hard to parse and doesn't seem to have advantages
> making up for that.
Well yes, I've probably used the wrong expression there ("format as
the ss does"). Technically `ss` might return multiple lines and lacks
extraction of specific hosts/ports (well one can use filters, but...),
but here we have normal fields like src/dst/srcport/dstport + we can
apply WHERE on extracted JSON columns (->). I think it's way better in
pg_stat_tcpinfo even today than in ss, but I could adapt further based
on some feedback. E.g. rather than appending Vegas/BBR info like `ss`
does, we could put nested structure, just like I've did with "skmem",
so technically it's the same data as in ss, but format would not be
exactly the same, but use similar-looking names, so that someone
familiar with `ss` would know how to use it.
Still , it is still PoC, and I'm not sure if the community is open to
accept (or reject it).
-J.