Обсуждение: contrib/pg_stat_tcpinfo

Поиск
Список
Период
Сортировка

contrib/pg_stat_tcpinfo

От
Jakub Wartak
Дата:
Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only
netstat/ss-like extension for showing detailed information about TCP
connections based on information from the kernel itself. It's the
result of some frustration and some corridor talks during recent
PgConfEU, where we agreed with Andres that it would be cool to have
this in core to support and resolve problems much more quickly (in
spite of this being a Linux-only extension for a start). As it stands
it can help troubleshooting advanced performance problems like:
* basic connection problems (e.g. imagine being able to see that
postgres_fdw PID is stuck in SYN_SENT)
* bandwidth problems where [logical/physical] replication and/or
backup (pg_basebackup) are not keeping up due any of the above: full
sendQ/recvQ, random packet drops on connection, too high RTT (WAN),
network jitter, etc.
* tracking down reasons for stuck or reset-by-peer TCP connections due
to wrongly firewalls silently killing idle connections (it's visible
via not TCP keepalive not being activated in time, and later stuck in
timer retransmits)
* high latency outliers for SyncRep too (e.g. via
lastsnd/lastrcv/lastack timers)
* probably some more

One could argue that while netstat/ss provide the necessary
information, it's often much harder than necessary to combine the
information with postgres information (like which connections are for
replication, wait events, correlation to other DB activity). The idea
is to have it in core, so it's available always.

Sample use (psql issuing COPY to postgres_fdw to the same DB over
127.0.0.1 -- 3 connections are visible: 1 for postgres_fdw itself, and
two sides of the network socket [this happens only on localhost])

postgres=# select
    pid, application_name, substring(query,1,40) q,
    wait_event, src_addr, src_port, dst_addr, dst_port,
    recvq, sendq,
    t.state, tcpinfo->>'rtt' as rtt, tcpinfo->>'rwnd_limited' as rwnd_limited,
    pg_size_pretty(cast(tcpinfo->>'delivery_rate' as bigint)) as
peak_delivery_rate,
    substring(tcpinfo::text,1,64) || '...' as substr
from pg_stat_get_tcpinfo() t join pg_stat_activity a using (pid)
where
    t.state ='ESTABLISHED' and pid <> pg_backend_pid();
  pid  | application_name |                    q                     |
     wait_event      | src_addr  | src_port | dst_addr  | dst_port |
recvq | sendq |    state    |  rtt  | rwnd_limited |
peak_delivery_rate |                               substr

-------+------------------+------------------------------------------+----------------------+-----------+----------+-----------+----------+-------+-------+-------------+-------+--------------+--------------------+---------------------------------------------------------------------
 81279 | psql             | COPY (SELECT bindata FROM fdw_bandwidth_ |
PostgresFdwGetResult | 127.0.0.1 |     1234 | 127.0.0.1 |    52612 |
  0 |     0 | ESTABLISHED | 5.092 | 0            | 422 MB
| {"ato": 40000, "rto": 206000, "rtt": 5.092, "lost": 0, "pmtu": 6...
 81279 | psql             | COPY (SELECT bindata FROM fdw_bandwidth_ |
PostgresFdwGetResult | 127.0.0.1 |    59562 | 127.0.0.1 |     1234 |
  0 |     0 | ESTABLISHED | 0.038 | 0            | 2604 MB
| {"ato": 40000, "rto": 201000, "rtt": 0.038, "lost": 0, "pmtu": 6...
 81282 | postgres_fdw     | FETCH 100 FROM c1                        |
BuffileWrite         | 127.0.0.1 |     1234 | 127.0.0.1 |    59562 |
  0 |     0 | ESTABLISHED | 0.380 | 1735000      | 833 MB
| {"ato": 40000, "rto": 201000, "rtt": 0.380, "lost": 0, "pmtu": 6...
(3 rows)

Some early feedback about direction in order to bring this into core
would be appreciated. State of stuff:

1. Andres is pushing for supporting UNIX domain sockets here, but I'm
not sure if it is really worth the effort (and it would trigger new
naming problem;)) and primarily making the code even more complex.
IMHO the netlinksock_diag API is already convoluted and adding AF_UNIX
would make it even less readable.
2. IPv6 works, but wasn't tested much.
3. Biggest TODO left is probably properly formatting the information
based on struct tcpinfo variables (just like ss(1) does, so keeping
the same unit/formatting)
4. Patch/tests are missing intentionally as I would like first to
stabilize the outputs/naming/code first.
5. [security] Should this be available to pg_monitor/pg_read_all_stats
or just to superuser?
6. [security] Should this return info about all TCP connections or
just the UID of the postmaster?

-J.

Вложения

Re: contrib/pg_stat_tcpinfo

От
Jakub Wartak
Дата:
On Mon, Nov 3, 2025 at 3:09 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only
> netstat/ss-like extension for showing detailed information about TCP
> connections based on information from the kernel itself.
[..]

> Some early feedback about direction in order to bring this into core
> would be appreciated. State of stuff:
>
> 1. Andres is pushing for supporting UNIX domain sockets here, but I'm
> not sure if it is really worth the effort (and it would trigger new
> naming problem;)) and primarily making the code even more complex.
> IMHO the netlinksock_diag API is already convoluted and adding AF_UNIX
> would make it even less readable.
> 2. IPv6 works, but wasn't tested much.
> 3. Biggest TODO left is probably properly formatting the information
> based on struct tcpinfo variables (just like ss(1) does, so keeping
> the same unit/formatting)
> 4. Patch/tests are missing intentionally as I would like first to
> stabilize the outputs/naming/code first.
> 5. [security] Should this be available to pg_monitor/pg_read_all_stats
> or just to superuser?
> 6. [security] Should this return info about all TCP connections or
> just the UID of the postmaster?

v2 attached  with tiny fixes and little more code readability (for
dumping struct tcpinfo *)

-J.

Вложения

Re: contrib/pg_stat_tcpinfo

От
Tomas Vondra
Дата:
On 11/7/25 11:36, Jakub Wartak wrote:
> On Mon, Nov 3, 2025 at 3:09 PM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
>>
>> Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only
>> netstat/ss-like extension for showing detailed information about TCP
>> connections based on information from the kernel itself.
> [..]
> 

I personally don't remember ever needing this kind of visibility into
TCP connections, but I'm also not doing all that much direct support
recently. And even in the past I personally didn't need to look at
TCP-level details all that much, the problems were usually elsewhere.

But maybe it's very useful in practice, don't know.

>> Some early feedback about direction in order to bring this into core
>> would be appreciated. State of stuff:
>>

Well, it's an extension in contrib. Is that sufficiently "in core"? Do
you expect this to be used in PROD environments, or more in DEV?

>> 1. Andres is pushing for supporting UNIX domain sockets here, but I'm
>> not sure if it is really worth the effort (and it would trigger new
>> naming problem;)) and primarily making the code even more complex.
>> IMHO the netlinksock_diag API is already convoluted and adding AF_UNIX
>> would make it even less readable.

No idea. For most real-world production systems the client is probably
on a different machine, so won't use UNIX sockets. Not always, but how
often do UNIX sockets have network-like problems for this visibility to
be helpful?

Also, how much work / extra code would it be to support UNIX sockets?

>> 2. IPv6 works, but wasn't tested much.
>> 3. Biggest TODO left is probably properly formatting the information
>> based on struct tcpinfo variables (just like ss(1) does, so keeping
>> the same unit/formatting)

I don't follow? Why do you want to format data the way "ss" does?

>> 4. Patch/tests are missing intentionally as I would like first to
>> stabilize the outputs/naming/code first.
>> 5. [security] Should this be available to pg_monitor/pg_read_all_stats
>> or just to superuser?

I suppose making this superuser-only would be against the effort to not
require superuser privileges, so using roles seems like the way to go.
The nature of the data seems very similar to pg_stat_activity (i.e. info
about current connections), so I'd suggest to use pg_read_all_stats. It
might even use an approach similar to pg_stat_get_activity(), which
shows some fields to everyone, and the role is required only for fields
with sensitive information.

>> 6. [security] Should this return info about all TCP connections or
>> just the UID of the postmaster?
> 

Not sure if I understand the question, but IMHO this should return only
info about connections opened by Postgres. Not for connections about TCP
connections opened by other stuff running on the server.


regards

-- 
Tomas Vondra




Re: contrib/pg_stat_tcpinfo

От
Andres Freund
Дата:
Hi,

On 2025-11-08 00:17:30 +0100, Tomas Vondra wrote:
> On 11/7/25 11:36, Jakub Wartak wrote:
> > On Mon, Nov 3, 2025 at 3:09 PM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> >>
> >> Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only
> >> netstat/ss-like extension for showing detailed information about TCP
> >> connections based on information from the kernel itself.
> > [..]
> >
>
> I personally don't remember ever needing this kind of visibility into
> TCP connections, but I'm also not doing all that much direct support
> recently. And even in the past I personally didn't need to look at
> TCP-level details all that much, the problems were usually elsewhere.
>
> But maybe it's very useful in practice, don't know.

FWIW, I've needed this many times. Without the TCP information it's very hard
to figure out why higher latency connections aren't keeping up - is it packet
loss, it it too small network buffers on the sending/receiving side, is it the
remote side not keeping up on the CPU level...


> >> Some early feedback about direction in order to bring this into core
> >> would be appreciated. State of stuff:
> >>
>
> Well, it's an extension in contrib. Is that sufficiently "in core"? Do
> you expect this to be used in PROD environments, or more in DEV?

I'd expect it to be mainly in prod, although not exclusively.


> >> 2. IPv6 works, but wasn't tested much.
> >> 3. Biggest TODO left is probably properly formatting the information
> >> based on struct tcpinfo variables (just like ss(1) does, so keeping
> >> the same unit/formatting)
>
> I don't follow? Why do you want to format data the way "ss" does?

Yea, I don't get that either - IMO ss's format is just about the worst
possible format. It's very hard to parse and doesn't seem to have advantages
making up for that.

Greetings,

Andres Freund



Re: contrib/pg_stat_tcpinfo

От
Jakub Wartak
Дата:
On Sat, Nov 8, 2025 at 12:17 AM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas, thanks for responding!

> On 11/7/25 11:36, Jakub Wartak wrote:
> > On Mon, Nov 3, 2025 at 3:09 PM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> >>
> >> Attached is pg_stat_tcpinfo, an heavy work in progress, Linux-only
> >> netstat/ss-like extension for showing detailed information about TCP
> >> connections based on information from the kernel itself.
> > [..]
> >
>
> I personally don't remember ever needing this kind of visibility into
> TCP connections, but I'm also not doing all that much direct support
> recently. And even in the past I personally didn't need to look at
> TCP-level details all that much, the problems were usually elsewhere.
>
> But maybe it's very useful in practice, don't know.

Well, as stated earlier, such tooling was required each time for
latency problem with SyncRep or bandwidth problem with anything else.

> >> Some early feedback about direction in order to bring this into core
> >> would be appreciated. State of stuff:
> >>
>
> Well, it's an extension in contrib. Is that sufficiently "in core"? Do
> you expect this to be used in PROD environments, or more in DEV?

Contrib/* for some reason sounds to me like being already in core and
good enough (core project "PostgreSQL" not core like "src/").
Technically it's just about having stuff ready to be used quickly when
it's needed (banks, fintech, k8) rather than getting into debates IF
it can be installed, IF it is being supported and so on. Having access
to such more advanced debugging/correlation utilities makes a better
impression about its maturity (time to resolution is lower). It's
almost always about PROD-like systems.

> >> 1. Andres is pushing for supporting UNIX domain sockets here, but I'm
> >> not sure if it is really worth the effort (and it would trigger new
> >> naming problem;)) and primarily making the code even more complex.
> >> IMHO the netlinksock_diag API is already convoluted and adding AF_UNIX
> >> would make it even less readable.
>
> No idea. For most real-world production systems the client is probably
> on a different machine, so won't use UNIX sockets. Not always, but how
> often do UNIX sockets have network-like problems for this visibility to
> be helpful?

Yes, same here, never got into problems with domain sockets
performance, but Andres wanted to have that visibility probably to
test local pgbench/basebackups runs, but in such development-like
scenarios `ss -moe` does not show that many interesting performance
metrics there (it's just skmem):

Netid State      Recv-Q Send-Q
Local Address:Port           Peer Address:Port  Process
u_str ESTAB      0      0
/var/run/postgresql/.s.PGSQL.5432 86934                     * 112798
skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) <-> ino:3532 dev:0/27

So my line of thinking would be to not clutter the code more than
necessary to add something that's already easily available (on PG
devs/commiters laptops), but would be rareley used in the field.

> Also, how much work / extra code would it be to support UNIX sockets?

+150-200 lines more - maybe some generalization of some routines to
handle AF_UNIX and AF_UNIX (and that would potentially pave the road
to also support other sock_diag data, e.g. udp, but I dont want to add
that too)

> >> 3. Biggest TODO left is probably properly formatting the information
> >> based on struct tcpinfo variables (just like ss(1) does, so keeping
> >> the same unit/formatting)
>
> I don't follow? Why do you want to format data the way "ss" does?

Hm, no strong feelings there, I just wanted to have the same
information. I don't mind formatting it any other way. Today in v2
that's OBJ (One Big Jsonb):

postgres=# select pid, tcpinfo->'rtt' as RTT, jsonb_pretty(tcpinfo)
from pg_stat_tcpinfo limit 1;
-[ RECORD 1 ]+---------------------------------------------
pid          | 13019
rtt          | 23.445
jsonb_pretty | {                                           +
             |     "ato": 40.000,                          +
             |     "rto": 224000,                          +
             |     "rtt": 23.445,                          +
             |     "lost": 0,                              +
             |     "pmtu": 1500,                           +
             |     "skmem": {                              +
             |         "optmem": 0,                        +
             |         "rcvbuf": 1239696,                  +
             |         "sndbuf": 87040,                    +
             |         "fwd_alloc": 3136,                  +
             |         "rmem_alloc": 960,                  +
             |         "wmem_alloc": 0,                    +
             |         "wmem_queued": 0                    +
             |     },                                      +
             |     "state": 8,                             +
             |     "timer": "(off,0min0sec,0)",            +
[..60 lines more..]

Any better ideas? It's liteally about dumping out struct tcpinfo (and
could be OS-dependent - for far future) - so that's why I've picked up
Json, just to to have this flexibility long term.

> >> 4. Patch/tests are missing intentionally as I would like first to
> >> stabilize the outputs/naming/code first.
> >> 5. [security] Should this be available to pg_monitor/pg_read_all_stats
> >> or just to superuser?
>
> I suppose making this superuser-only would be against the effort to not
> require superuser privileges, so using roles seems like the way to go.
> The nature of the data seems very similar to pg_stat_activity (i.e. info
> about current connections), so I'd suggest to use pg_read_all_stats. It
> might even use an approach similar to pg_stat_get_activity(), which
> shows some fields to everyone, and the role is required only for fields
> with sensitive information.

Thanks.

> >> 6. [security] Should this return info about all TCP connections or
> >> just the UID of the postmaster?
> >
>
> Not sure if I understand the question, but IMHO this should return only
> info about connections opened by Postgres. Not for connections about TCP
> connections opened by other stuff running on the server.

Yes, I'm more on this option too. Ok let's do it that way.

-J.



Re: contrib/pg_stat_tcpinfo

От
Jakub Wartak
Дата:
On Sat, Nov 8, 2025 at 1:18 AM Andres Freund <andres@anarazel.de> wrote:

Hi Andres!

>> But maybe it's very useful in practice, don't know.

> FWIW, I've needed this many times. Without the TCP information it's very hard
> to figure out why higher latency connections aren't keeping up - is it packet
> loss, it it too small network buffers on the sending/receiving side, is it the
> remote side not keeping up on the CPU level...

Pretty much the same!

> > I don't follow? Why do you want to format data the way "ss" does?
>
> Yea, I don't get that either - IMO ss's format is just about the worst
> possible format. It's very hard to parse and doesn't seem to have advantages
> making up for that.

Well yes, I've probably used the wrong expression there ("format as
the ss does"). Technically `ss` might return multiple lines and lacks
extraction of specific hosts/ports (well one can use filters, but...),
but here we have normal fields like src/dst/srcport/dstport + we can
apply WHERE on extracted JSON columns (->). I think it's way better in
pg_stat_tcpinfo even today than in ss, but I could adapt further based
on some feedback. E.g. rather than appending Vegas/BBR info like `ss`
does, we could put nested structure, just like I've did with "skmem",
so technically it's the same data as in ss, but format would not be
exactly the same, but use similar-looking names, so that someone
familiar with `ss` would know how to use it.

Still , it is still PoC, and I'm not sure if the community is open to
accept (or reject it).

-J.