Обсуждение: POC: Extension for adding distributed tracing - pg_tracing

Поиск
Список
Период
Сортировка

POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:

Hi hackers,

Currently, most distributed tracing solutions only provide details from the client's drivers point of view. We don't have visibility when the query reached the database and how the query was processed.

The goal of this project is to provide distributed tracing within the database to provide more insights on a query execution as an extension. The idea and architecture is heavily inspired by pg_stat_statements and auto_explain.

I have a working prototype of a pg_tracing extension and wanted some feedback on the design and architecture.

Some definitions: (paraphrased from opentelemetry https://opentelemetry.io/docs/concepts/signals/traces/)

A trace groups multiple spans and a span represents a single specific operation. It is defined by:

- a trace_id

- a spanid (randomly generated int64)

- a name

- a parent id (a span can have children)

- a start timestamp

- a duration

- Attributes (key value metadatas)

We will use attributes to propagate the query's instrumentation details: buffer usage, jit, wal usage, first tuple, plan cost...

Triggering a trace:

===============

We rely on https://google.github.io/sqlcommenter/ to propagate trace information.

A query with sqlcommenter will look like: /*dddbs='postgres.db',traceparent='00-00000000000000000000000000000009-0000000000000005-01'*/ select 1;

The traceparent fields are detailed in https://www.w3.org/TR/trace-context/#traceparent-header-field-values

The 3 fields are

00000000000000000000000000000009: trace id

0000000000000005: parent id

01: trace flags (01 == sampled)

If the query is sampled, we instrument the query and generate spans for the following operations:

- Query Span: The top span for a query. They are created after extracting the traceid from traceparent or to represent a nested query. We always have at least one. We may have more if we have a nested query.

- Planner: We track the time spent in the planner and report the planner counters

- Executor: We trace the different steps of the Executor: Start, Run, Finish and End

- Node Span: Created from planstate. The name is extracted from the node type (IndexScan, SeqScan). For the operation name, we generate something similar to the explain.


Storing spans:

===============

Spans are stored in a shared memory buffer. The buffer size is fixed and if it is full, no further spans are generated until the buffer is freed.

We store multiple information with variable sized text in the spans: names, parameter values...

To avoid keeping those in the shared memory, we store them in an external file (similar to pg_stat_statements query text).

Processing spans:

===============

Traces need to be sent to a trace collector. We offload this logic to an external application which will:

- Collect the spans through a new pg_tracing_spans() function output

- Built spans in any format specific to the targeted trace collector (opentelemetry, datadog...)

- Send it to the targeted trace collector using the appropriate protocol (gRPC, http...)

I have an example of a span forwarder that I've been using in my tests available here: https://gist.github.com/bonnefoa/6ed24520bdac026d6a6a6992d308bd50.

This setup permits a lot of flexibility:

- The postgres extension doesn't need any vendor specific knowledge

- The trace forwarder can reuse existing libraries and any languages to send spans

Buffer management:

===============

I've tried to keep the memory management simple. Creating spans will add new elements to the shared_spans buffer.

Once a span is forwarded, there's no need to keep it in the shared memory.

Thus, pg_tracing_spans has a "consume" parameter which will completely empty the shared buffers when called by the forwarder.

If we have a regularly full buffers, then we can:

- increase forwarder's frequency

- increase the shared buffer size

- decrease the amount of generated spans

Statistics are available through pg_tracing_info to keep track of the number of generated spans and dropped events.

Overhead:

===============

I haven't run benchmarks yet, however, we will have multiple ways to control the overhead.

Traces rely heavily on sampling to keep the overhead low: we want extensive information on representative samples of our queries.

For now, we leave the sampling decision to the caller (through the sampled flag) but I will add an additional parameter to allow additional sampling if the rate of traced queries is too high.

Stopping tracing when we have a full buffer is also a good safeguard. If an application is misconfigured and starts sampling every queries, this will stop all tracing and won't impact query performances.

Query Instrumentation:

===============

I have been able to rely on most of existing query instrumentation except for one missing information: I needed to know the start of a node as we currently only have the first tuple and the duration. I added firsttime to Instrumentation. This is the only change done outside the extension's code.

Current status:

===============

The current code is able to generate spans which once forwarded to the trace collector, I was able to get traces with execution details with similar amount of information that an explain (analyze, buffers) would provide. I have provided screenshots of the result.

There's a lot of work left in cleaning, commenting, handling edge cases and adding tests.

Thoughts and feedback are welcome.

Regards,

Anthonin Bonnefoy

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Aleksander Alekseev
Дата:
Hi Anthonin,

> I have a working prototype of a pg_tracing extension and wanted some feedback on the design and architecture.

The patch looks very interesting, thanks for working on it and for
sharing. The facts that the patch doesn't change the core except for
two lines in instrument.{c.h} and that is uses pull-based model:

> - Collect the spans through a new pg_tracing_spans() function output

... IMO were the right design decisions. The patch lacks the
documentation, but this is OK for a PoC.

I added the patch to the nearest CF [1]. Let's see what the rest of
the community thinks.

[1] https://commitfest.postgresql.org/44/4456/

-- 
Best regards,
Aleksander Alekseev



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

This patch looks very interesting, I'm working on the same subject too. But I've used
another approach - I'm using C wrapper for C++ API library from OpenTelemetry, and
handle span storage and output to this library. There are some nuances though, but it
is possible. Have you tried to use OpenTelemetry APIs instead of implementing all
functionality around spans?

On Tue, Jul 25, 2023 at 1:16 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi Anthonin,

> I have a working prototype of a pg_tracing extension and wanted some feedback on the design and architecture.

The patch looks very interesting, thanks for working on it and for
sharing. The facts that the patch doesn't change the core except for
two lines in instrument.{c.h} and that is uses pull-based model:

> - Collect the spans through a new pg_tracing_spans() function output

... IMO were the right design decisions. The patch lacks the
documentation, but this is OK for a PoC.

I added the patch to the nearest CF [1]. Let's see what the rest of
the community thinks.

[1] https://commitfest.postgresql.org/44/4456/

--
Best regards,
Aleksander Alekseev




--
Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Aleksander Alekseev
Дата:
Nikita,

> This patch looks very interesting, I'm working on the same subject too. But I've used
> another approach - I'm using C wrapper for C++ API library from OpenTelemetry, and
> handle span storage and output to this library. There are some nuances though, but it
> is possible. Have you tried to use OpenTelemetry APIs instead of implementing all
> functionality around spans?

I don't think that PostgreSQL accepts such kind of C++ code, not to
mention the fact that the PostgreSQL license is not necessarily
compatible with Apache 2.0 (I'm not a lawyer; this is not a legal
advice). Such a design decision will probably require using separate
compile flags since the user doesn't necessarily have a corresponding
dependency installed. Similarly to how we do with LLVM, OpenSSL, etc.

So -1 to the OpenTelemetry C++ library and +1 to the properly licensed
C implementation without 3rd party dependencies from me. Especially
considering the fact that the implementation seems to be rather
simple.

-- 
Best regards,
Aleksander Alekseev



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:

I've initially thought of sending the spans from PostgreSQL since this is the usual behavior of tracing libraries. 

However, this created a lot potential issues:

- Protocol support and differences between trace collectors. OpenTelemetry seems to use gRPC, others are using http and those will require additional libraries (plus gRPC support in C doesn't look good) and any change in protobuf definition would require updating the extension.

- Do we send the spans within the query hooks? This means that we could block the process if the trace collector is slow to answer or we can’t connect. Sending spans from a background process sounded rather complex and resource heavy.

Moving to a pull model fixed those issues and felt more natural as this is the way PostgreSQL exposes its metrics.


On Wed, Jul 26, 2023 at 4:11 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Nikita,

> This patch looks very interesting, I'm working on the same subject too. But I've used
> another approach - I'm using C wrapper for C++ API library from OpenTelemetry, and
> handle span storage and output to this library. There are some nuances though, but it
> is possible. Have you tried to use OpenTelemetry APIs instead of implementing all
> functionality around spans?

I don't think that PostgreSQL accepts such kind of C++ code, not to
mention the fact that the PostgreSQL license is not necessarily
compatible with Apache 2.0 (I'm not a lawyer; this is not a legal
advice). Such a design decision will probably require using separate
compile flags since the user doesn't necessarily have a corresponding
dependency installed. Similarly to how we do with LLVM, OpenSSL, etc.

So -1 to the OpenTelemetry C++ library and +1 to the properly licensed
C implementation without 3rd party dependencies from me. Especially
considering the fact that the implementation seems to be rather
simple.

--
Best regards,
Aleksander Alekseev

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

I've tried to test the extension, but got errors calling make check and cannot install the extension.
I've applied the patch onto current master and configured it as:
./configure --enable-debug --enable-cassert --enable-depend --enable-tap-tests
Could you please advise if I'm doing something wrong?
For make check please see attached log.
For installing extension after setting shared preload library I've got an error in Postgres logfile:
2023-07-27 13:41:52.324 MSK [12738] LOG:  database system is shut down
2023-07-27 13:41:52.404 MSK [14126] FATAL:  could not load library "/usr/local/pgsql/lib/pg_tracing.so": /usr/local/pgsql/lib/pg_tracing.so: undefined symbol: get_operation_name
Thank you!

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi,

Also FYI, there are build warnings because functions 
const char * get_span_name(const Span * span, const char *qbuffer)
and
const char * get_operation_name(const Span * span, const char *qbuffer)
do not have default inside switch and no return outside of switch.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Aleksander Alekseev
Дата:
Hi,

> Also FYI, there are build warnings because functions
> const char * get_span_name(const Span * span, const char *qbuffer)
> and
> const char * get_operation_name(const Span * span, const char *qbuffer)
> do not have default inside switch and no return outside of switch.

You are right, there are a few warnings:

```
[1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
../contrib/pg_tracing/span.c: In function ‘get_span_name’:
../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
non-void function [-Wreturn-type]
  210 | }
      | ^
../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
non-void function [-Wreturn-type]
  249 | }
      | ^
```

Here is the patch v2 with a quick fix.

> but got errors calling make check and cannot install the extension

Agree, something goes wrong when using Autotools (but not Meson) on
both Linux and MacOS. I didn't investigate the issue though.

--
Best regards,
Aleksander Alekseev

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi,

I've fixed the Autotools build, please check patch below (v2).

On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi,

> Also FYI, there are build warnings because functions
> const char * get_span_name(const Span * span, const char *qbuffer)
> and
> const char * get_operation_name(const Span * span, const char *qbuffer)
> do not have default inside switch and no return outside of switch.

You are right, there are a few warnings:

```
[1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
../contrib/pg_tracing/span.c: In function ‘get_span_name’:
../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
non-void function [-Wreturn-type]
  210 | }
      | ^
../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
non-void function [-Wreturn-type]
  249 | }
      | ^
```

Here is the patch v2 with a quick fix.

> but got errors calling make check and cannot install the extension

Agree, something goes wrong when using Autotools (but not Meson) on
both Linux and MacOS. I didn't investigate the issue though.

--
Best regards,
Aleksander Alekseev


--
Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
> Agree, something goes wrong when using Autotools (but not Meson) on
> both Linux and MacOS. I didn't investigate the issue though.
I was only using meson and forgot to keep Automake files up to date when I've split pg_tracing.c in multiple files (span.c, explain.c...).

On Fri, Jul 28, 2023 at 8:10 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi,

I've fixed the Autotools build, please check patch below (v2).

On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi,

> Also FYI, there are build warnings because functions
> const char * get_span_name(const Span * span, const char *qbuffer)
> and
> const char * get_operation_name(const Span * span, const char *qbuffer)
> do not have default inside switch and no return outside of switch.

You are right, there are a few warnings:

```
[1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
../contrib/pg_tracing/span.c: In function ‘get_span_name’:
../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
non-void function [-Wreturn-type]
  210 | }
      | ^
../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
non-void function [-Wreturn-type]
  249 | }
      | ^
```

Here is the patch v2 with a quick fix.

> but got errors calling make check and cannot install the extension

Agree, something goes wrong when using Autotools (but not Meson) on
both Linux and MacOS. I didn't investigate the issue though.

--
Best regards,
Aleksander Alekseev


--
Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi,

I'd keep Autotools build up to date, because Meson is very limited in terms of not very
up-to-date OS versions. Anyway, it's OK now. I'm currently playing with the patch and
reviewing sources, if you need any cooperation - please let us know. I'm +1 for committing
this patch after review.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi, 

> I'd keep Autotools build up to date
Definitely. The patch is still in a rough state and updating Autotools fell through the cracks.

> I'm currently playing with the patch and
> reviewing sources, if you need any cooperation - please let us know.
The goal for me was to get validation on the design and to check if there was anything that would be a blocker before commiting more time on the project.

There are already several things I was planning on improving:
- The parse span is something I've added recently. I've realised that I could rely on the stmtStartTimestamp to get the time spent in parsing in the post parse hook so the code around this is still rough.
- There are probably race conditions on the reads and writes of the query file that need to be fixed
- I'm using the same Span structure for everything while Executor spans only need a small subset which is a bit wasteful. I've tried to use two different shared buffers (base spans and spans with counters) but it was making things more confusing.
- Finish commenting code and start writing the doc

So any comments, missing features and reviews on the current state of the project is welcome.

Regards,
Anthonin

On Fri, Jul 28, 2023 at 9:35 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi,

I'd keep Autotools build up to date, because Meson is very limited in terms of not very
up-to-date OS versions. Anyway, it's OK now. I'm currently playing with the patch and
reviewing sources, if you need any cooperation - please let us know. I'm +1 for committing
this patch after review.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC
macros for timing calculations?

Also, have you thought about a way to trace existing (running) queries without directly instrumenting them? It would
be much more usable for maintenance and support personnel, because in production environments you rarely could
change query text directly. For the current state the most simple solution is switch tracing on and off by calling SQL
function, and possibly switch tracing for prepared statement the same way, but not for any random query.

I'll check the patch for the race conditions.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
> What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC
> macros for timing calculations?
If you're talking of the two instances where I'm modifying the instr_time's ticks, it's because I can't use the macros there.
The first case is for the parse span. I only have the start timestamp using GetCurrentStatementStartTimestamp and don't 
have access to the start instr_time so I need to build the duration from 2 timestamps.
The second case is when building node spans from the planstate. I directly have the duration from Instrumentation.

I guess one fix would be to use an int64 for the span duration to directly store nanoseconds instead of an instr_time 
but I do use the instrumentation macros outside of those two cases to get the duration of other spans.

> Also, have you thought about a way to trace existing (running) queries without directly instrumenting them?
That's a good point. I was focusing on leaving the sampling decision to the caller through the sampled flag and 
only recently added the pg_tracing_sample_rate parameter to give more control. It should be straightforward to 
add an option to create standalone traces based on sample rate alone. This way, setting the sample rate to 1 
would force the queries running in the session to be traced.


On Fri, Jul 28, 2023 at 3:02 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi!

What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC
macros for timing calculations?

Also, have you thought about a way to trace existing (running) queries without directly instrumenting them? It would
be much more usable for maintenance and support personnel, because in production environments you rarely could
change query text directly. For the current state the most simple solution is switch tracing on and off by calling SQL
function, and possibly switch tracing for prepared statement the same way, but not for any random query.

I'll check the patch for the race conditions.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi, 

Here's a new patch with changes from the previous discussion: 
- I'm now directly storing nanoseconds duration in the span instead of the instr_time. Using the instr_time macros was a bit awkward as the durations I generate don't necessarily have a starting and ending instr_time. 
Moving to straight nanoseconds made things clearer from my point of view.
- I've added an additional sample rate pg_tracing.sample_rate (on top of the pg_tracing.caller_sample_rate). This one will allow queries to be sampled even without trace information propagated from the caller. 
Setting this sample rate to 1 will basically trace everything. For now, this will only work when going through the post parse hook. I will add support for prepared statements and cached plans for the next patch.
- I've improved how parse spans are created. It's a bit challenging to get the start of a parse as there's no pre parse hook or instrumentation around parse so it's only an estimation.

Regards,
Anthonin

On Fri, Jul 28, 2023 at 4:06 PM Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
> What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC
> macros for timing calculations?
If you're talking of the two instances where I'm modifying the instr_time's ticks, it's because I can't use the macros there.
The first case is for the parse span. I only have the start timestamp using GetCurrentStatementStartTimestamp and don't 
have access to the start instr_time so I need to build the duration from 2 timestamps.
The second case is when building node spans from the planstate. I directly have the duration from Instrumentation.

I guess one fix would be to use an int64 for the span duration to directly store nanoseconds instead of an instr_time 
but I do use the instrumentation macros outside of those two cases to get the duration of other spans.

> Also, have you thought about a way to trace existing (running) queries without directly instrumenting them?
That's a good point. I was focusing on leaving the sampling decision to the caller through the sampled flag and 
only recently added the pg_tracing_sample_rate parameter to give more control. It should be straightforward to 
add an option to create standalone traces based on sample rate alone. This way, setting the sample rate to 1 
would force the queries running in the session to be traced.


On Fri, Jul 28, 2023 at 3:02 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi!

What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC
macros for timing calculations?

Also, have you thought about a way to trace existing (running) queries without directly instrumenting them? It would
be much more usable for maintenance and support personnel, because in production environments you rarely could
change query text directly. For the current state the most simple solution is switch tracing on and off by calling SQL
function, and possibly switch tracing for prepared statement the same way, but not for any random query.

I'll check the patch for the race conditions.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

Thanks for the improvements!
>Here's a new patch with changes from the previous discussion: 
>- I'm now directly storing nanoseconds duration in the span instead of the instr_time. Using the instr_time macros was a bit awkward as the durations I generate don't necessarily have a starting and ending instr_time. 
>Moving to straight nanoseconds made things clearer from my point of view.
Cool. It could be easily casted to ms for the user.

>- I've added an additional sample rate pg_tracing.sample_rate (on top of the pg_tracing.caller_sample_rate). This one will allow queries to be sampled even without trace information propagated from the caller. 
>Setting this sample rate to 1 will basically trace everything. For now, this will only work when going through the post parse hook. I will add support for prepared statements and cached plans for the next patch.
Cool, I've just made the same improvement and wanted to send a patch a bit later after tests.

>- I've improved how parse spans are created. It's a bit challenging to get the start of a parse as there's no pre parse hook or instrumentation around parse so it's only an estimation.
I've also added a query id field to span and made a table and an sql function that flushes spans to this table instead of returning set or records - it is more convenient for the maintenance to query the table.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

Please check some suggested improvements -
1) query_id added so span to be able to join it with pg_stat_activity and pg_stat_statements;
2) table for storing spans added, to flush spans buffer, for maintenance reasons - to keep track of spans,
with SQL function that flushes buffer into table instead of recordset;
3) added setter function for sampling_rate GUC to tweak it on-the-fly without restart.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi!

> 1) query_id added so span to be able to join it with pg_stat_activity and pg_stat_statements;
Sounds good, I've added your changes with my code.

> 2) table for storing spans added, to flush spans buffer
I'm not sure about this. It means that this is something that would only be available on primary as replicas won't be able 
to write data in the table. It will also make version updates and migrations much more complex and I haven't seen a similar
pattern on other extensions.

> 3) added setter function for sampling_rate GUC to tweak it on-the-fly without restart
ok, I've added this in my branch.

On my side, I've made the following changes:
1) All spans are now kept in palloced buffers and only added during end_tracing. This way, we limit the shared_spans lock.
2) I've added a pg_tracing.drop_on_full_buffer parameter to drop all spans when the buffer is full. This could be useful to always keep 
the latest spans when the consuming app is not fast enough. This is also useful for testing.
3) I'm testing more complex queries. Most of my previous tests were using simple query protocol but extended protocol introduces
differences that break some assumptions I did. For example, with multi statement transaction like
BEGIN;
SELECT 1;
SELECT 2;
The parse of SELECT 2 will happen before the ExecutorEnd (and the end_tracing) of SELECT 1. For now, I'm skipping the post parse 
hook if we still have an ongoing tracing.
I've also started running https://github.com/anse1/sqlsmith on a db with full sample and it's currently failing some assertions and I'm 
working to fix those.


On Thu, Aug 3, 2023 at 9:13 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi!

Please check some suggested improvements -
1) query_id added so span to be able to join it with pg_stat_activity and pg_stat_statements;
2) table for storing spans added, to flush spans buffer, for maintenance reasons - to keep track of spans,
with SQL function that flushes buffer into table instead of recordset;
3) added setter function for sampling_rate GUC to tweak it on-the-fly without restart.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
API for pulling data, so from this point of view it is much more convenient to keep spans in separate
table. But in this case we come to another issue - how to flush this data into the table? Automatic
flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
make a GUC turned off by default to store spans into the table instead of buffer.

>3) I'm testing more complex queries. Most of my previous tests were using simple query protocol but extended protocol introduces
>differences that break some assumptions I did. For example, with multi statement transaction like
>BEGIN;
>SELECT 1;
>SELECT 2;
>The parse of SELECT 2 will happen before the ExecutorEnd (and the end_tracing) of SELECT 1. For now, I'm skipping the post parse 
>hook if we still have an ongoing tracing.

I've checked this behavior before and haven't noticed the case you mentioned, but for
loops like
        for i in 1..2 loop
                StartTime := clock_timestamp();
        insert into t2 values (i, a_data);
        EndTime := clock_timestamp();
        Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
        end loop;

I've got the following call sequence:
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <1>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <1>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <2>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <2>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <StartTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <StartTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorRun  <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <EndTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <EndTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) )>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) )>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorRun  <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi!

> Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
> API for pulling data, so from this point of view it is much more convenient to keep spans in separate
> table. But in this case we come to another issue - how to flush this data into the table? Automatic
> flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
> make a GUC turned off by default to store spans into the table instead of buffer.

My main issue with having the extension flush spans in a table is that
it will only work on primaries.
Replicas won't be able to do this as they can't write data on tables
and having this flush function is likely going to introduce more
confusion if it only works on primaries.

From my point of view, processing the spans should be done by an
external application.Similar to my initial example which forward spans
to a trace collector
(https://gist.github.com/bonnefoa/6ed24520bdac026d6a6a6992d308bd50#file-main-go),
this application could instead store spans in a dedicated table. This
way, the creation of the table is outside pg_tracing's scope and the
span_store application will be able to store spans on replicas and
primaries.
How frequently this application should pull data to avoid full buffer
and dropping spans is a tricky part. We have stats so we can monitor
if drops are happening and adjust the spans buffer size or increase
the application's pull frequency. Another possibility (though I'm not
familiar with that) could be to use notifications. The application
will listen to a channel and pg_tracing will notify when a
configurable threshold is reached (i.e.: if the buffer reaches 50%,
send a notification).

> I've checked this behavior before and haven't noticed the case you mentioned, but for
> loops like
>         for i in 1..2 loop
>                 StartTime := clock_timestamp();
>         insert into t2 values (i, a_data);
>         EndTime := clock_timestamp();
>         Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
>         end loop;

Was this run through psql? In this case, you would be using simple
protocol which always drops the portal at the end of the statement and
is not triggering the issue.

With extended protocol and a multi statement transactions, I have the
following backtrace:
* frame #0: 0x0000000104581318
postgres`ExecutorEnd(queryDesc=0x00000001420083b0) at execMain.c:471:6
  frame #1: 0x00000001044e1644
postgres`PortalCleanup(portal=0x0000000152031d00) at
portalcmds.c:299:4
  frame #2: 0x0000000104b0e77c
postgres`PortalDrop(portal=0x0000000152031d00, isTopCommit=false) at
portalmem.c:503:3
  frame #3: 0x0000000104b0e3e8 postgres`CreatePortal(name="",
allowDup=true, dupSilent=true) at portalmem.c:194:3
  frame #4: 0x000000010487a308
postgres`exec_bind_message(input_message=0x000000016bb4b398) at
postgres.c:1745:12
  frame #5: 0x000000010487846c
postgres`PostgresMain(dbname="postgres", username="postgres") at
postgres.c:4685:5
  frame #6: 0x0000000104773144
postgres`BackendRun(port=0x0000000141704080) at postmaster.c:4433:2
  frame #7: 0x000000010477044c
postgres`BackendStartup(port=0x0000000141704080) at
postmaster.c:4161:3
  frame #8: 0x000000010476d6fc postgres`ServerLoop at postmaster.c:1778:6
  frame #9: 0x000000010476c260 postgres`PostmasterMain(argc=3,
argv=0x0000600001cf72c0) at postmaster.c:1462:11
  frame #10: 0x0000000104625ca8 postgres`main(argc=3,
argv=0x0000600001cf72c0) at main.c:198:3
  frame #11: 0x00000001a61dbf28 dyld`start + 2236

At this point, the new statement is already parsed and it's only
during the bind that the previous' statement portal is dropped and
ExecutorEnd is called. Thus, we have overlapping statements which are
tricky to handle.

On my side, I've made the following changes:
Skip nodes that are not executed
Fix query_id propagation, it should now be associated with all the
spans of a query when available
I've forgotten to add the spans buffer in the shmem hook so it would
crash when pg_tracing.max_span was too high. Now it's fixed and we can
request more than 2000 spans. Currently, the size of Span is 320 bytes
so 50000 will take ~15MB of memory.
I've added the subplan/init plan spans and started experimenting
deparsing the plan to add more details on the nodes. Instead of
'NestedLoop', we will have something 'NestedLoop|Join Filter : (oid =
relnamespace)'. One consequence is that it will create a high number
of different operations which is something we want to avoid with trace
collectors. I'm probably gonna move this outside the operation name or
find a way to make this more stable.

I've started initial benchmarks and profiling. On a default
installation and running a default `pgbench -T15`, I have the
following:
Without pg_tracing: tps = 953.118759 (without initial connection time)
With pg_tracing: tps = 652.953858 (without initial connection time)
Most of the time is spent process_query_desc. I haven't tried to
optimize performances yet so there's probably a lot of room for
improvement.


On Mon, Aug 14, 2023 at 11:17 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> Hi!
>
> Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
> API for pulling data, so from this point of view it is much more convenient to keep spans in separate
> table. But in this case we come to another issue - how to flush this data into the table? Automatic
> flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
> make a GUC turned off by default to store spans into the table instead of buffer.
>
> >3) I'm testing more complex queries. Most of my previous tests were using simple query protocol but extended
protocolintroduces 
> >differences that break some assumptions I did. For example, with multi statement transaction like
> >BEGIN;
> >SELECT 1;
> >SELECT 2;
> >The parse of SELECT 2 will happen before the ExecutorEnd (and the end_tracing) of SELECT 1. For now, I'm skipping
thepost parse 
> >hook if we still have an ongoing tracing.
>
> I've checked this behavior before and haven't noticed the case you mentioned, but for
> loops like
>         for i in 1..2 loop
>                 StartTime := clock_timestamp();
>         insert into t2 values (i, a_data);
>         EndTime := clock_timestamp();
>         Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
>         end loop;
>
> I've got the following call sequence:
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <1>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <1>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <2>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <2>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <StartTime := clock_timestamp()>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <StartTime := clock_timestamp()>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <insert into t2 values (i,
a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorRun  <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <EndTime := clock_timestamp()>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <EndTime := clock_timestamp()>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <Delta := 1000 * ( extract(epoch
fromEndTime) - extract(epoch from StartTime) )> 
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <Delta := 1000 * ( extract(epoch from
EndTime)- extract(epoch from StartTime) )> 
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorRun  <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
> psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
And the latest version of the patch that I've forgotten in the previous message.

On Wed, Aug 30, 2023 at 11:09 AM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> Hi!
>
> > Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
> > API for pulling data, so from this point of view it is much more convenient to keep spans in separate
> > table. But in this case we come to another issue - how to flush this data into the table? Automatic
> > flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
> > make a GUC turned off by default to store spans into the table instead of buffer.
>
> My main issue with having the extension flush spans in a table is that
> it will only work on primaries.
> Replicas won't be able to do this as they can't write data on tables
> and having this flush function is likely going to introduce more
> confusion if it only works on primaries.
>
> From my point of view, processing the spans should be done by an
> external application.Similar to my initial example which forward spans
> to a trace collector
> (https://gist.github.com/bonnefoa/6ed24520bdac026d6a6a6992d308bd50#file-main-go),
> this application could instead store spans in a dedicated table. This
> way, the creation of the table is outside pg_tracing's scope and the
> span_store application will be able to store spans on replicas and
> primaries.
> How frequently this application should pull data to avoid full buffer
> and dropping spans is a tricky part. We have stats so we can monitor
> if drops are happening and adjust the spans buffer size or increase
> the application's pull frequency. Another possibility (though I'm not
> familiar with that) could be to use notifications. The application
> will listen to a channel and pg_tracing will notify when a
> configurable threshold is reached (i.e.: if the buffer reaches 50%,
> send a notification).
>
> > I've checked this behavior before and haven't noticed the case you mentioned, but for
> > loops like
> >         for i in 1..2 loop
> >                 StartTime := clock_timestamp();
> >         insert into t2 values (i, a_data);
> >         EndTime := clock_timestamp();
> >         Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
> >         end loop;
>
> Was this run through psql? In this case, you would be using simple
> protocol which always drops the portal at the end of the statement and
> is not triggering the issue.
>
> With extended protocol and a multi statement transactions, I have the
> following backtrace:
> * frame #0: 0x0000000104581318
> postgres`ExecutorEnd(queryDesc=0x00000001420083b0) at execMain.c:471:6
>   frame #1: 0x00000001044e1644
> postgres`PortalCleanup(portal=0x0000000152031d00) at
> portalcmds.c:299:4
>   frame #2: 0x0000000104b0e77c
> postgres`PortalDrop(portal=0x0000000152031d00, isTopCommit=false) at
> portalmem.c:503:3
>   frame #3: 0x0000000104b0e3e8 postgres`CreatePortal(name="",
> allowDup=true, dupSilent=true) at portalmem.c:194:3
>   frame #4: 0x000000010487a308
> postgres`exec_bind_message(input_message=0x000000016bb4b398) at
> postgres.c:1745:12
>   frame #5: 0x000000010487846c
> postgres`PostgresMain(dbname="postgres", username="postgres") at
> postgres.c:4685:5
>   frame #6: 0x0000000104773144
> postgres`BackendRun(port=0x0000000141704080) at postmaster.c:4433:2
>   frame #7: 0x000000010477044c
> postgres`BackendStartup(port=0x0000000141704080) at
> postmaster.c:4161:3
>   frame #8: 0x000000010476d6fc postgres`ServerLoop at postmaster.c:1778:6
>   frame #9: 0x000000010476c260 postgres`PostmasterMain(argc=3,
> argv=0x0000600001cf72c0) at postmaster.c:1462:11
>   frame #10: 0x0000000104625ca8 postgres`main(argc=3,
> argv=0x0000600001cf72c0) at main.c:198:3
>   frame #11: 0x00000001a61dbf28 dyld`start + 2236
>
> At this point, the new statement is already parsed and it's only
> during the bind that the previous' statement portal is dropped and
> ExecutorEnd is called. Thus, we have overlapping statements which are
> tricky to handle.
>
> On my side, I've made the following changes:
> Skip nodes that are not executed
> Fix query_id propagation, it should now be associated with all the
> spans of a query when available
> I've forgotten to add the spans buffer in the shmem hook so it would
> crash when pg_tracing.max_span was too high. Now it's fixed and we can
> request more than 2000 spans. Currently, the size of Span is 320 bytes
> so 50000 will take ~15MB of memory.
> I've added the subplan/init plan spans and started experimenting
> deparsing the plan to add more details on the nodes. Instead of
> 'NestedLoop', we will have something 'NestedLoop|Join Filter : (oid =
> relnamespace)'. One consequence is that it will create a high number
> of different operations which is something we want to avoid with trace
> collectors. I'm probably gonna move this outside the operation name or
> find a way to make this more stable.
>
> I've started initial benchmarks and profiling. On a default
> installation and running a default `pgbench -T15`, I have the
> following:
> Without pg_tracing: tps = 953.118759 (without initial connection time)
> With pg_tracing: tps = 652.953858 (without initial connection time)
> Most of the time is spent process_query_desc. I haven't tried to
> optimize performances yet so there's probably a lot of room for
> improvement.
>
>
> On Mon, Aug 14, 2023 at 11:17 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
> >
> > Hi!
> >
> > Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
> > API for pulling data, so from this point of view it is much more convenient to keep spans in separate
> > table. But in this case we come to another issue - how to flush this data into the table? Automatic
> > flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
> > make a GUC turned off by default to store spans into the table instead of buffer.
> >
> > >3) I'm testing more complex queries. Most of my previous tests were using simple query protocol but extended
protocolintroduces 
> > >differences that break some assumptions I did. For example, with multi statement transaction like
> > >BEGIN;
> > >SELECT 1;
> > >SELECT 2;
> > >The parse of SELECT 2 will happen before the ExecutorEnd (and the end_tracing) of SELECT 1. For now, I'm skipping
thepost parse 
> > >hook if we still have an ongoing tracing.
> >
> > I've checked this behavior before and haven't noticed the case you mentioned, but for
> > loops like
> >         for i in 1..2 loop
> >                 StartTime := clock_timestamp();
> >         insert into t2 values (i, a_data);
> >         EndTime := clock_timestamp();
> >         Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
> >         end loop;
> >
> > I've got the following call sequence:
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <1>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <1>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <2>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <2>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <StartTime :=
clock_timestamp()>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <StartTime := clock_timestamp()>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <insert into t2 values (i,
a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorRun  <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <EndTime := clock_timestamp()>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <EndTime := clock_timestamp()>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <Delta := 1000 * (
extract(epochfrom EndTime) - extract(epoch from StartTime) )> 
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <Delta := 1000 * ( extract(epoch from
EndTime)- extract(epoch from StartTime) )> 
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorRun  <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
> > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
> >
> > --
> > Regards,
> > Nikita Malakhov
> > Postgres Professional
> > The Russian Postgres Company
> > https://postgrespro.ru/

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi!

I've been working on performance improvements and features.
Performance improvements:
- I was calling select_rtable_names_for_explain on every
ExplainTargetRel which was rather costly and unnecessary. It's now
only done once.
- File writes are now batched and only a single file write is done at
the end of tracing.

With those improvements, pgbench now yields the following results:
0% tracing:
pgbench -T60 "options=-cpg_tracing.sample_rate=0"
number of transactions actually processed: 57293
tps = 954.915403 (without initial connection time)

100% tracing:
pgbench -T60 "options=-cpg_tracing.sample_rate=1"
number of transactions actually processed: 45963
tps = 766.068564 (without initial connection time)

I’ve also added tracing of parallel workers. If a query needs parallel
workers, the trace context is pushed in a specific shared buffer so
workers can pull trace context and generate spans with the correct
traceId and parentId. We are now able to see the start, run and finish
of workers (see image).
I'm currently focusing on adding more tests, comments, fixing TODOs
and known bugs.

Regards,
Anthonin

On Wed, Aug 30, 2023 at 11:15 AM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> And the latest version of the patch that I've forgotten in the previous message.
>
> On Wed, Aug 30, 2023 at 11:09 AM Anthonin Bonnefoy
> <anthonin.bonnefoy@datadoghq.com> wrote:
> >
> > Hi!
> >
> > > Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
> > > API for pulling data, so from this point of view it is much more convenient to keep spans in separate
> > > table. But in this case we come to another issue - how to flush this data into the table? Automatic
> > > flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
> > > make a GUC turned off by default to store spans into the table instead of buffer.
> >
> > My main issue with having the extension flush spans in a table is that
> > it will only work on primaries.
> > Replicas won't be able to do this as they can't write data on tables
> > and having this flush function is likely going to introduce more
> > confusion if it only works on primaries.
> >
> > From my point of view, processing the spans should be done by an
> > external application.Similar to my initial example which forward spans
> > to a trace collector
> > (https://gist.github.com/bonnefoa/6ed24520bdac026d6a6a6992d308bd50#file-main-go),
> > this application could instead store spans in a dedicated table. This
> > way, the creation of the table is outside pg_tracing's scope and the
> > span_store application will be able to store spans on replicas and
> > primaries.
> > How frequently this application should pull data to avoid full buffer
> > and dropping spans is a tricky part. We have stats so we can monitor
> > if drops are happening and adjust the spans buffer size or increase
> > the application's pull frequency. Another possibility (though I'm not
> > familiar with that) could be to use notifications. The application
> > will listen to a channel and pg_tracing will notify when a
> > configurable threshold is reached (i.e.: if the buffer reaches 50%,
> > send a notification).
> >
> > > I've checked this behavior before and haven't noticed the case you mentioned, but for
> > > loops like
> > >         for i in 1..2 loop
> > >                 StartTime := clock_timestamp();
> > >         insert into t2 values (i, a_data);
> > >         EndTime := clock_timestamp();
> > >         Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
> > >         end loop;
> >
> > Was this run through psql? In this case, you would be using simple
> > protocol which always drops the portal at the end of the statement and
> > is not triggering the issue.
> >
> > With extended protocol and a multi statement transactions, I have the
> > following backtrace:
> > * frame #0: 0x0000000104581318
> > postgres`ExecutorEnd(queryDesc=0x00000001420083b0) at execMain.c:471:6
> >   frame #1: 0x00000001044e1644
> > postgres`PortalCleanup(portal=0x0000000152031d00) at
> > portalcmds.c:299:4
> >   frame #2: 0x0000000104b0e77c
> > postgres`PortalDrop(portal=0x0000000152031d00, isTopCommit=false) at
> > portalmem.c:503:3
> >   frame #3: 0x0000000104b0e3e8 postgres`CreatePortal(name="",
> > allowDup=true, dupSilent=true) at portalmem.c:194:3
> >   frame #4: 0x000000010487a308
> > postgres`exec_bind_message(input_message=0x000000016bb4b398) at
> > postgres.c:1745:12
> >   frame #5: 0x000000010487846c
> > postgres`PostgresMain(dbname="postgres", username="postgres") at
> > postgres.c:4685:5
> >   frame #6: 0x0000000104773144
> > postgres`BackendRun(port=0x0000000141704080) at postmaster.c:4433:2
> >   frame #7: 0x000000010477044c
> > postgres`BackendStartup(port=0x0000000141704080) at
> > postmaster.c:4161:3
> >   frame #8: 0x000000010476d6fc postgres`ServerLoop at postmaster.c:1778:6
> >   frame #9: 0x000000010476c260 postgres`PostmasterMain(argc=3,
> > argv=0x0000600001cf72c0) at postmaster.c:1462:11
> >   frame #10: 0x0000000104625ca8 postgres`main(argc=3,
> > argv=0x0000600001cf72c0) at main.c:198:3
> >   frame #11: 0x00000001a61dbf28 dyld`start + 2236
> >
> > At this point, the new statement is already parsed and it's only
> > during the bind that the previous' statement portal is dropped and
> > ExecutorEnd is called. Thus, we have overlapping statements which are
> > tricky to handle.
> >
> > On my side, I've made the following changes:
> > Skip nodes that are not executed
> > Fix query_id propagation, it should now be associated with all the
> > spans of a query when available
> > I've forgotten to add the spans buffer in the shmem hook so it would
> > crash when pg_tracing.max_span was too high. Now it's fixed and we can
> > request more than 2000 spans. Currently, the size of Span is 320 bytes
> > so 50000 will take ~15MB of memory.
> > I've added the subplan/init plan spans and started experimenting
> > deparsing the plan to add more details on the nodes. Instead of
> > 'NestedLoop', we will have something 'NestedLoop|Join Filter : (oid =
> > relnamespace)'. One consequence is that it will create a high number
> > of different operations which is something we want to avoid with trace
> > collectors. I'm probably gonna move this outside the operation name or
> > find a way to make this more stable.
> >
> > I've started initial benchmarks and profiling. On a default
> > installation and running a default `pgbench -T15`, I have the
> > following:
> > Without pg_tracing: tps = 953.118759 (without initial connection time)
> > With pg_tracing: tps = 652.953858 (without initial connection time)
> > Most of the time is spent process_query_desc. I haven't tried to
> > optimize performances yet so there's probably a lot of room for
> > improvement.
> >
> >
> > On Mon, Aug 14, 2023 at 11:17 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
> > >
> > > Hi!
> > >
> > > Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
> > > API for pulling data, so from this point of view it is much more convenient to keep spans in separate
> > > table. But in this case we come to another issue - how to flush this data into the table? Automatic
> > > flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
> > > make a GUC turned off by default to store spans into the table instead of buffer.
> > >
> > > >3) I'm testing more complex queries. Most of my previous tests were using simple query protocol but extended
protocolintroduces 
> > > >differences that break some assumptions I did. For example, with multi statement transaction like
> > > >BEGIN;
> > > >SELECT 1;
> > > >SELECT 2;
> > > >The parse of SELECT 2 will happen before the ExecutorEnd (and the end_tracing) of SELECT 1. For now, I'm
skippingthe post parse 
> > > >hook if we still have an ongoing tracing.
> > >
> > > I've checked this behavior before and haven't noticed the case you mentioned, but for
> > > loops like
> > >         for i in 1..2 loop
> > >                 StartTime := clock_timestamp();
> > >         insert into t2 values (i, a_data);
> > >         EndTime := clock_timestamp();
> > >         Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
> > >         end loop;
> > >
> > > I've got the following call sequence:
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <1>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <1>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <2>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <2>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <StartTime :=
clock_timestamp()>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <StartTime := clock_timestamp()>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <insert into t2 values (i,
a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorRun  <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <EndTime :=
clock_timestamp()>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <EndTime := clock_timestamp()>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_post_parse_analyze hook <Delta := 1000 * (
extract(epochfrom EndTime) - extract(epoch from StartTime) )> 
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <Delta := 1000 * ( extract(epoch from
EndTime)- extract(epoch from StartTime) )> 
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_planner_hook <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorRun  <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE:  pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
> > >
> > > --
> > > Regards,
> > > Nikita Malakhov
> > > Postgres Professional
> > > The Russian Postgres Company
> > > https://postgrespro.ru/

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi!

Here's a new version with a good batch of changes.

Renaming/Refactoring:
- All spans are now tracked in the palloced current_trace_spans buffer
compared to top_span and parse_span being kept in a static variable
before.
- I've renamed query_spans to top_span. A top_span serves as the
parent for all spans in a specific nested level.
- More debugging information and assertions. Spans now track their
nested level, if they've been ended and if they are a top_span.

Changes:
- I've added the subxact_count to the span's metadata. This can help
identify the moment a subtransaction was started.
- I've reworked nested queries and utility statements. Previously, I
made the assumptions that we could only have one top_span per nested
level which is not the case. Some utility statements can execute
multiple queries in the same nested level. Tracing utility statement
now works better (see image of tracing a create extension).

Regards,
Anthonin

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Aleksander Alekseev
Дата:
Hi,

> Renaming/Refactoring:
> - All spans are now tracked in the palloced current_trace_spans buffer
> compared to top_span and parse_span being kept in a static variable
> before.
> - I've renamed query_spans to top_span. A top_span serves as the
> parent for all spans in a specific nested level.
> - More debugging information and assertions. Spans now track their
> nested level, if they've been ended and if they are a top_span.
>
> Changes:
> - I've added the subxact_count to the span's metadata. This can help
> identify the moment a subtransaction was started.
> - I've reworked nested queries and utility statements. Previously, I
> made the assumptions that we could only have one top_span per nested
> level which is not the case. Some utility statements can execute
> multiple queries in the same nested level. Tracing utility statement
> now works better (see image of tracing a create extension).

Many thanks for the updated patch.

If it's not too much trouble, please use `git format-patch`. This
makes applying and testing the patch much easier. Also you can provide
a commit message which 1. will simplify the work for the committer and
2. can be reviewed as well.

The tests fail on CI [1]. I tried to run them locally and got the same
results. I'm using full-build.sh from this repository [2] for
Autotools and the following one-liner for Meson:

```
time CPPFLAGS="-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS" sh -c 'git
clean -dfx && meson setup --buildtype debug -Dcassert=true
-DPG_TEST_EXTRA="kerberos ldap ssl load_balance" -Dldap=disabled
-Dssl=openssl -Dtap_tests=enabled
-Dprefix=/home/eax/projects/pginstall build && ninja -C build && ninja
-C build docs && PG_TEST_EXTRA=1 meson test -C build'
```

The comments for pg_tracing.c don't seem to be formatted properly.
Please make sure to run pg_indent. See src/tools/pgindent/README

The interface pg_tracing_spans(true | false) doesn't strike me as
particularly readable. Maybe this function should be private and the
interface be like pg_tracing_spans() and pg_trancing_consume_spans().
Alternatively you could use named arguments in the tests, but I don't
think this is a preferred solution.

Speaking of the tests I suggest adding a bit more comments before
every (or most) of the queries. Figuring out what they test could be
not particularly straightforward for somebody who will make changes
after the patch will be accepted.

[1]: http://cfbot.cputube.org/
[2]: https://github.com/afiskon/pgscripts/

-- 
Best regards,
Aleksander Alekseev



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!
Great you continue to work on this patch!
I'm checking out the newest changes.


On Fri, Sep 15, 2023 at 2:32 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi,

> Renaming/Refactoring:
> - All spans are now tracked in the palloced current_trace_spans buffer
> compared to top_span and parse_span being kept in a static variable
> before.
> - I've renamed query_spans to top_span. A top_span serves as the
> parent for all spans in a specific nested level.
> - More debugging information and assertions. Spans now track their
> nested level, if they've been ended and if they are a top_span.
>
> Changes:
> - I've added the subxact_count to the span's metadata. This can help
> identify the moment a subtransaction was started.
> - I've reworked nested queries and utility statements. Previously, I
> made the assumptions that we could only have one top_span per nested
> level which is not the case. Some utility statements can execute
> multiple queries in the same nested level. Tracing utility statement
> now works better (see image of tracing a create extension).

Many thanks for the updated patch.

If it's not too much trouble, please use `git format-patch`. This
makes applying and testing the patch much easier. Also you can provide
a commit message which 1. will simplify the work for the committer and
2. can be reviewed as well.

The tests fail on CI [1]. I tried to run them locally and got the same
results. I'm using full-build.sh from this repository [2] for
Autotools and the following one-liner for Meson:

```
time CPPFLAGS="-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS" sh -c 'git
clean -dfx && meson setup --buildtype debug -Dcassert=true
-DPG_TEST_EXTRA="kerberos ldap ssl load_balance" -Dldap=disabled
-Dssl=openssl -Dtap_tests=enabled
-Dprefix=/home/eax/projects/pginstall build && ninja -C build && ninja
-C build docs && PG_TEST_EXTRA=1 meson test -C build'
```

The comments for pg_tracing.c don't seem to be formatted properly.
Please make sure to run pg_indent. See src/tools/pgindent/README

The interface pg_tracing_spans(true | false) doesn't strike me as
particularly readable. Maybe this function should be private and the
interface be like pg_tracing_spans() and pg_trancing_consume_spans().
Alternatively you could use named arguments in the tests, but I don't
think this is a preferred solution.

Speaking of the tests I suggest adding a bit more comments before
every (or most) of the queries. Figuring out what they test could be
not particularly straightforward for somebody who will make changes
after the patch will be accepted.

[1]: http://cfbot.cputube.org/
[2]: https://github.com/afiskon/pgscripts/

--
Best regards,
Aleksander Alekseev


--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi!

> If it's not too much trouble, please use `git format-patch`. This
> makes applying and testing the patch much easier. Also you can provide
> a commit message which 1. will simplify the work for the committer and
> 2. can be reviewed as well.
No problem, I will do that from now on.

> The tests fail on CI [1]. I tried to run them locally and got the same
> results.
I've fixed the compilation issue, my working laptop is a mac so I've missed the
linux errors. I've set up the CI on my side to avoid those issues in the future.
I've also fixed most of the test issues.
One test is still flaky: the parallel worker test can have one less
worker reporting
from time to time, I will try to find a way to make it more stable.

> The comments for pg_tracing.c don't seem to be formatted properly.
> Please make sure to run pg_indent. See src/tools/pgindent/README
Done.

> The interface pg_tracing_spans(true | false) doesn't strike me as
> particularly readable. Maybe this function should be private and the
> interface be like pg_tracing_spans() and pg_trancing_consume_spans().
Agree. I wasn't super happy about the format. I've created
pg_tracing_peek_spans() and pg_tracing_consume_spans() to make it
more explicit.

> Speaking of the tests I suggest adding a bit more comments before
> every (or most) of the queries. Figuring out what they test could be
> not particularly straightforward for somebody who will make changes
> after the patch will be accepted.
I've added more comments and a representation of the expected span
hierarchy for the tests involving nested queries. It helps me a lot to keep
track of the spans to check.

I've done some additional changes:
- I've added a start_ns to spans which is the nanosecond remainder of the
time since the start of the trace. Since some spans can have a very short
duration, some spans would have the same start if we use a microsecond
precision. Having a start with a nanosecond precision allowed me to remove
some hacks I was using and makes tests more stable.
- I've also added a start_ns_time to span which is the value of the monotonic
clock at the start of the span. I used to use duration_ns as a
temporary storage
for this value but it made things more confusing and hard to follow.
- Fix error handling with nested queries created with ProcessUtility (like
calling "create extension pg_tracing" twice).

Regards,
Anthonin

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi!

I've made a new batch of changes and improvements.
New features:
- Triggers are now correctly supported. They were not correctly
attached to the ExecutorFinish Span before.
- Additional configuration: exporting parameter values in span
metadata can be disabled. Deparsing can also be disabled now.
- Dbid and userid are now exported in span's metadata
- New Notify channel and threshold parameters. This channel will
receive a notification when the span buffer usage crosses the
threshold.
- Explicit transaction with extended protocol is now correctly
handled. This is done by keeping 2 different trace contexts, one for
parser/planner trace context at the root level and the other for
nested queries and executors. The spans now correctly show the parse
of the next statement happening before the ExecutorEnd of the previous
statement (see screenshot).

Changes:
- Parse span is now outside of the top span. When multiple statements
are processed, they are parsed together so it didn't make sense to
attach Parse to a specific statement.
- Deparse info is now exported in its own column. This will leave the
possibility to the trace forwarder to either use it as metadata or put
it in the span name.
- Renaming: Spans now have a span_type (Select, Executor, Parser...)
and a span_operation (ExecutorRun, Select $1...)
- For spans created without propagated trace context, the same traceid
will be used for statements within the same transaction.
- pg_tracing_consume_spans and pg_tracing_peek_spans are now
restricted to users with pg_read_all_stats role
- In instrument.h, instr->firsttime is only set if timer was requested

The code should be more stable from now on. Most of the features I had
planned are implemented.
I've also started writing the module's documentation. It's still a bit
bare but I will be working on completing this.

Regards,
Anthonin

On Fri, Sep 22, 2023 at 5:45 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> Hi!
>
> > If it's not too much trouble, please use `git format-patch`. This
> > makes applying and testing the patch much easier. Also you can provide
> > a commit message which 1. will simplify the work for the committer and
> > 2. can be reviewed as well.
> No problem, I will do that from now on.
>
> > The tests fail on CI [1]. I tried to run them locally and got the same
> > results.
> I've fixed the compilation issue, my working laptop is a mac so I've missed the
> linux errors. I've set up the CI on my side to avoid those issues in the future.
> I've also fixed most of the test issues.
> One test is still flaky: the parallel worker test can have one less
> worker reporting
> from time to time, I will try to find a way to make it more stable.
>
> > The comments for pg_tracing.c don't seem to be formatted properly.
> > Please make sure to run pg_indent. See src/tools/pgindent/README
> Done.
>
> > The interface pg_tracing_spans(true | false) doesn't strike me as
> > particularly readable. Maybe this function should be private and the
> > interface be like pg_tracing_spans() and pg_trancing_consume_spans().
> Agree. I wasn't super happy about the format. I've created
> pg_tracing_peek_spans() and pg_tracing_consume_spans() to make it
> more explicit.
>
> > Speaking of the tests I suggest adding a bit more comments before
> > every (or most) of the queries. Figuring out what they test could be
> > not particularly straightforward for somebody who will make changes
> > after the patch will be accepted.
> I've added more comments and a representation of the expected span
> hierarchy for the tests involving nested queries. It helps me a lot to keep
> track of the spans to check.
>
> I've done some additional changes:
> - I've added a start_ns to spans which is the nanosecond remainder of the
> time since the start of the trace. Since some spans can have a very short
> duration, some spans would have the same start if we use a microsecond
> precision. Having a start with a nanosecond precision allowed me to remove
> some hacks I was using and makes tests more stable.
> - I've also added a start_ns_time to span which is the value of the monotonic
> clock at the start of the span. I used to use duration_ns as a
> temporary storage
> for this value but it made things more confusing and hard to follow.
> - Fix error handling with nested queries created with ProcessUtility (like
> calling "create extension pg_tracing" twice).
>
> Regards,
> Anthonin

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Shlok Kyal
Дата:
Hi,

On Thu, 12 Oct 2023 at 14:32, Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> Hi!
>
> I've made a new batch of changes and improvements.
> New features:
> - Triggers are now correctly supported. They were not correctly
> attached to the ExecutorFinish Span before.
> - Additional configuration: exporting parameter values in span
> metadata can be disabled. Deparsing can also be disabled now.
> - Dbid and userid are now exported in span's metadata
> - New Notify channel and threshold parameters. This channel will
> receive a notification when the span buffer usage crosses the
> threshold.
> - Explicit transaction with extended protocol is now correctly
> handled. This is done by keeping 2 different trace contexts, one for
> parser/planner trace context at the root level and the other for
> nested queries and executors. The spans now correctly show the parse
> of the next statement happening before the ExecutorEnd of the previous
> statement (see screenshot).
>
> Changes:
> - Parse span is now outside of the top span. When multiple statements
> are processed, they are parsed together so it didn't make sense to
> attach Parse to a specific statement.
> - Deparse info is now exported in its own column. This will leave the
> possibility to the trace forwarder to either use it as metadata or put
> it in the span name.
> - Renaming: Spans now have a span_type (Select, Executor, Parser...)
> and a span_operation (ExecutorRun, Select $1...)
> - For spans created without propagated trace context, the same traceid
> will be used for statements within the same transaction.
> - pg_tracing_consume_spans and pg_tracing_peek_spans are now
> restricted to users with pg_read_all_stats role
> - In instrument.h, instr->firsttime is only set if timer was requested
>
> The code should be more stable from now on. Most of the features I had
> planned are implemented.
> I've also started writing the module's documentation. It's still a bit
> bare but I will be working on completing this.

In CFbot, I see that build is failing for this patch (link:
https://cirrus-ci.com/task/5378223450619904?logs=build#L1532) with
following error:
[07:58:05.037] In file included from ../src/include/executor/instrument.h:16,
[07:58:05.037] from ../src/include/jit/jit.h:14,
[07:58:05.037] from ../contrib/pg_tracing/span.h:16,
[07:58:05.037] from ../contrib/pg_tracing/pg_tracing_explain.h:4,
[07:58:05.037] from ../contrib/pg_tracing/pg_tracing.c:43:
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c: In function
‘add_node_counters’:
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:70: error:
‘BufferUsage’ has no member named ‘blk_read_time’; did you mean
‘temp_blk_read_time’?
[07:58:05.037] 2330 | blk_read_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
[07:58:05.037] | ^~~~~~~~~~~~~
[07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
in definition of macro ‘INSTR_TIME_GET_NANOSEC’
[07:58:05.037] 126 | ((int64) (t).ticks)
[07:58:05.037] | ^
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:18: note: in
expansion of macro ‘INSTR_TIME_GET_MILLISEC’
[07:58:05.037] 2330 | blk_read_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
[07:58:05.037] | ^~~~~~~~~~~~~~~~~~~~~~~
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:71: error:
‘BufferUsage’ has no member named ‘blk_write_time’; did you mean
‘temp_blk_write_time’?
[07:58:05.037] 2331 | blk_write_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
[07:58:05.037] | ^~~~~~~~~~~~~~
[07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
in definition of macro ‘INSTR_TIME_GET_NANOSEC’
[07:58:05.037] 126 | ((int64) (t).ticks)
[07:58:05.037] | ^
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:19: note: in
expansion of macro ‘INSTR_TIME_GET_MILLISEC’
[07:58:05.037] 2331 | blk_write_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
[07:58:05.037] | ^~~~~~~~~~~~~~~~~~~~~~~

I also tried to build this patch locally and the build is failing with
the same error.

Thanks
Shlok Kumar Kyal



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi!

The commit 13d00729d422c84b1764c24251abcc785ea4adb1 renamed some
fields in the BufferUsage struct. I've updated those which should fix the
compilation errors.

I've also added some changes, mainly around how lazy functions are managed.
Some queries like 'select character_maximum_length from
information_schema.element_types limit 1 offset 3;' could generate 50K spans
due to lazy calls of generate_series. Tracing of lazy functions is now disabled.

I've also written an otel version of the trace forwarder available in
https://github.com/bonnefoa/pg-tracing-otel-forwarder. The repository
contains a docker compose to start a local otel collector and jaeger
interface and will be able to consume the spans from a test PostgreSQL
instance with pg_tracing.

Regards,
Anthonin

On Mon, Nov 6, 2023 at 7:19 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> Hi,
>
> On Thu, 12 Oct 2023 at 14:32, Anthonin Bonnefoy
> <anthonin.bonnefoy@datadoghq.com> wrote:
> >
> > Hi!
> >
> > I've made a new batch of changes and improvements.
> > New features:
> > - Triggers are now correctly supported. They were not correctly
> > attached to the ExecutorFinish Span before.
> > - Additional configuration: exporting parameter values in span
> > metadata can be disabled. Deparsing can also be disabled now.
> > - Dbid and userid are now exported in span's metadata
> > - New Notify channel and threshold parameters. This channel will
> > receive a notification when the span buffer usage crosses the
> > threshold.
> > - Explicit transaction with extended protocol is now correctly
> > handled. This is done by keeping 2 different trace contexts, one for
> > parser/planner trace context at the root level and the other for
> > nested queries and executors. The spans now correctly show the parse
> > of the next statement happening before the ExecutorEnd of the previous
> > statement (see screenshot).
> >
> > Changes:
> > - Parse span is now outside of the top span. When multiple statements
> > are processed, they are parsed together so it didn't make sense to
> > attach Parse to a specific statement.
> > - Deparse info is now exported in its own column. This will leave the
> > possibility to the trace forwarder to either use it as metadata or put
> > it in the span name.
> > - Renaming: Spans now have a span_type (Select, Executor, Parser...)
> > and a span_operation (ExecutorRun, Select $1...)
> > - For spans created without propagated trace context, the same traceid
> > will be used for statements within the same transaction.
> > - pg_tracing_consume_spans and pg_tracing_peek_spans are now
> > restricted to users with pg_read_all_stats role
> > - In instrument.h, instr->firsttime is only set if timer was requested
> >
> > The code should be more stable from now on. Most of the features I had
> > planned are implemented.
> > I've also started writing the module's documentation. It's still a bit
> > bare but I will be working on completing this.
>
> In CFbot, I see that build is failing for this patch (link:
> https://cirrus-ci.com/task/5378223450619904?logs=build#L1532) with
> following error:
> [07:58:05.037] In file included from ../src/include/executor/instrument.h:16,
> [07:58:05.037] from ../src/include/jit/jit.h:14,
> [07:58:05.037] from ../contrib/pg_tracing/span.h:16,
> [07:58:05.037] from ../contrib/pg_tracing/pg_tracing_explain.h:4,
> [07:58:05.037] from ../contrib/pg_tracing/pg_tracing.c:43:
> [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c: In function
> ‘add_node_counters’:
> [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:70: error:
> ‘BufferUsage’ has no member named ‘blk_read_time’; did you mean
> ‘temp_blk_read_time’?
> [07:58:05.037] 2330 | blk_read_time =
> INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
> [07:58:05.037] | ^~~~~~~~~~~~~
> [07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
> in definition of macro ‘INSTR_TIME_GET_NANOSEC’
> [07:58:05.037] 126 | ((int64) (t).ticks)
> [07:58:05.037] | ^
> [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:18: note: in
> expansion of macro ‘INSTR_TIME_GET_MILLISEC’
> [07:58:05.037] 2330 | blk_read_time =
> INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
> [07:58:05.037] | ^~~~~~~~~~~~~~~~~~~~~~~
> [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:71: error:
> ‘BufferUsage’ has no member named ‘blk_write_time’; did you mean
> ‘temp_blk_write_time’?
> [07:58:05.037] 2331 | blk_write_time =
> INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
> [07:58:05.037] | ^~~~~~~~~~~~~~
> [07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
> in definition of macro ‘INSTR_TIME_GET_NANOSEC’
> [07:58:05.037] 126 | ((int64) (t).ticks)
> [07:58:05.037] | ^
> [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:19: note: in
> expansion of macro ‘INSTR_TIME_GET_MILLISEC’
> [07:58:05.037] 2331 | blk_write_time =
> INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
> [07:58:05.037] | ^~~~~~~~~~~~~~~~~~~~~~~
>
> I also tried to build this patch locally and the build is failing with
> the same error.
>
> Thanks
> Shlok Kumar Kyal

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi,

Some small changes, mostly around making tests less flaky:
- Removed the top_span and nested_level in the span output, those were
mostly used for debugging
- More tests around Parse span in nested queries
- Remove explicit queryId in the tests

Regards,
Anthonin

On Mon, Nov 6, 2023 at 4:08 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> Hi!
>
> The commit 13d00729d422c84b1764c24251abcc785ea4adb1 renamed some
> fields in the BufferUsage struct. I've updated those which should fix the
> compilation errors.
>
> I've also added some changes, mainly around how lazy functions are managed.
> Some queries like 'select character_maximum_length from
> information_schema.element_types limit 1 offset 3;' could generate 50K spans
> due to lazy calls of generate_series. Tracing of lazy functions is now disabled.
>
> I've also written an otel version of the trace forwarder available in
> https://github.com/bonnefoa/pg-tracing-otel-forwarder. The repository
> contains a docker compose to start a local otel collector and jaeger
> interface and will be able to consume the spans from a test PostgreSQL
> instance with pg_tracing.
>
> Regards,
> Anthonin
>
> On Mon, Nov 6, 2023 at 7:19 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> >
> > Hi,
> >
> > On Thu, 12 Oct 2023 at 14:32, Anthonin Bonnefoy
> > <anthonin.bonnefoy@datadoghq.com> wrote:
> > >
> > > Hi!
> > >
> > > I've made a new batch of changes and improvements.
> > > New features:
> > > - Triggers are now correctly supported. They were not correctly
> > > attached to the ExecutorFinish Span before.
> > > - Additional configuration: exporting parameter values in span
> > > metadata can be disabled. Deparsing can also be disabled now.
> > > - Dbid and userid are now exported in span's metadata
> > > - New Notify channel and threshold parameters. This channel will
> > > receive a notification when the span buffer usage crosses the
> > > threshold.
> > > - Explicit transaction with extended protocol is now correctly
> > > handled. This is done by keeping 2 different trace contexts, one for
> > > parser/planner trace context at the root level and the other for
> > > nested queries and executors. The spans now correctly show the parse
> > > of the next statement happening before the ExecutorEnd of the previous
> > > statement (see screenshot).
> > >
> > > Changes:
> > > - Parse span is now outside of the top span. When multiple statements
> > > are processed, they are parsed together so it didn't make sense to
> > > attach Parse to a specific statement.
> > > - Deparse info is now exported in its own column. This will leave the
> > > possibility to the trace forwarder to either use it as metadata or put
> > > it in the span name.
> > > - Renaming: Spans now have a span_type (Select, Executor, Parser...)
> > > and a span_operation (ExecutorRun, Select $1...)
> > > - For spans created without propagated trace context, the same traceid
> > > will be used for statements within the same transaction.
> > > - pg_tracing_consume_spans and pg_tracing_peek_spans are now
> > > restricted to users with pg_read_all_stats role
> > > - In instrument.h, instr->firsttime is only set if timer was requested
> > >
> > > The code should be more stable from now on. Most of the features I had
> > > planned are implemented.
> > > I've also started writing the module's documentation. It's still a bit
> > > bare but I will be working on completing this.
> >
> > In CFbot, I see that build is failing for this patch (link:
> > https://cirrus-ci.com/task/5378223450619904?logs=build#L1532) with
> > following error:
> > [07:58:05.037] In file included from ../src/include/executor/instrument.h:16,
> > [07:58:05.037] from ../src/include/jit/jit.h:14,
> > [07:58:05.037] from ../contrib/pg_tracing/span.h:16,
> > [07:58:05.037] from ../contrib/pg_tracing/pg_tracing_explain.h:4,
> > [07:58:05.037] from ../contrib/pg_tracing/pg_tracing.c:43:
> > [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c: In function
> > ‘add_node_counters’:
> > [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:70: error:
> > ‘BufferUsage’ has no member named ‘blk_read_time’; did you mean
> > ‘temp_blk_read_time’?
> > [07:58:05.037] 2330 | blk_read_time =
> > INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
> > [07:58:05.037] | ^~~~~~~~~~~~~
> > [07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
> > in definition of macro ‘INSTR_TIME_GET_NANOSEC’
> > [07:58:05.037] 126 | ((int64) (t).ticks)
> > [07:58:05.037] | ^
> > [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:18: note: in
> > expansion of macro ‘INSTR_TIME_GET_MILLISEC’
> > [07:58:05.037] 2330 | blk_read_time =
> > INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
> > [07:58:05.037] | ^~~~~~~~~~~~~~~~~~~~~~~
> > [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:71: error:
> > ‘BufferUsage’ has no member named ‘blk_write_time’; did you mean
> > ‘temp_blk_write_time’?
> > [07:58:05.037] 2331 | blk_write_time =
> > INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
> > [07:58:05.037] | ^~~~~~~~~~~~~~
> > [07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
> > in definition of macro ‘INSTR_TIME_GET_NANOSEC’
> > [07:58:05.037] 126 | ((int64) (t).ticks)
> > [07:58:05.037] | ^
> > [07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:19: note: in
> > expansion of macro ‘INSTR_TIME_GET_MILLISEC’
> > [07:58:05.037] 2331 | blk_write_time =
> > INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
> > [07:58:05.037] | ^~~~~~~~~~~~~~~~~~~~~~~
> >
> > I also tried to build this patch locally and the build is failing with
> > the same error.
> >
> > Thanks
> > Shlok Kumar Kyal

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Aleksander Alekseev
Дата:
Hi,

Thanks for the updated patch!

> Some small changes, mostly around making tests less flaky:
> - Removed the top_span and nested_level in the span output, those were
> mostly used for debugging
> - More tests around Parse span in nested queries
> - Remove explicit queryId in the tests

```
+-- Worker can take some additional time to end and report their spans
+SELECT pg_sleep(0.2);
+ pg_sleep
+----------
+
+(1 row)
```

Pretty sure this will fail on buildfarm from time to time. Maybe this
is a case when it is worth using TAP tests? Especially considering the
support of pg_tracing.notify_channel.

```
+      Only superusers can change this setting.
+      This parameter can only be set at server start.
```

I find this confusing. If the parameter can only be set at server
start doesn't it mean that whoever has access to postgresql.conf can
change it?

```
+      <varname>pg_tracing.track_utility</varname> controls whether spans
+      should be generated for utility statements. Utility commands are
+      all those other than <command>SELECT</command>,
<command>INSERT</command>,
+      <command>UPDATE</command>, <command>DELETE</command>, and
<command>MERGE</command>.
+      The default value is <literal>on</literal>.
```

Perhaps TABLE command should be added to the list, since it's
basically a syntax sugar for SELECT:

```
=# create table t(x int);
=# insert into t values (1), (2), (3);
=# table t;
 x
---
 1
 2
 3
(3 rows)
```

This part worries me a bit:

```
+    <listitem>
+     <para>
+      <varname>pg_tracing.drop_on_full_buffer</varname> controls whether
+      span buffer should be emptied when <varname>pg_tracing.max_span</varname>
+      spans is reached. If <literal>off</literal>, new spans are dropped.
+      The default value is <literal>on</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
```

Obviously pg_tracing.drop_on_full_buffer = on will not work properly
with pg_tracing.notify_threshold because there is a race condition
possible. By the time one receives and processes the notification the
buffer can be already cleaned. This makes me think that 1)
pg_tracing.drop_on_full_buffer = on could be not the best default
value 2) As a user I would like a support of ring buffer mode. Even if
implementing this mode is out of scope of this particular patch, I
would argue that instead of drop_on_full_buffer boolean parameter we
need something like:

pg_tracing.buffer_mode = keep_on_full | drop_on_full | ring_buffer |
perhaps_other_future_modes

```
+  <para>
+   The module requires additional shared memory proportional to
+   <varname>pg_tracing.max_span</varname>.  Note that this
+   memory is consumed whenever the module is loaded, even if
+   no spans are generated.
+  </para>
```

Do you think it possible to document how much shared memory does the
extension consume depending on the parameters? Alternatively maybe
it's worth providing a SQL function that returns this value? As a user
I would like to know whether 10 MB, 500 MB or 1.5 GB of shared memory
are consumed.

```
+   The start time of a span has two parts: <literal>span_start</literal> and
+   <literal>span_start_ns</literal>. <literal>span_start</literal> is
a Timestamp
+   with time zone providing up a microsecond precision.
<literal>span_start_ns</literal>
+   provides the nanosecond part of the span start.
<literal>span_start + span_start_ns</literal>
+   can be used to get the start with nanosecond precision.
```

I would argue that ns ( 1 / 1_000_000_000 sec! ) precision is
redundant and only complicates the interface. This is way below the
noise floor formed from network delays, process switching by the OS,
noisy neighbours under the same supervisor, time slowdown by ntpd,
etc. I don't recall ever needing spans with better than 1 us
precision.

On top of that if you have `span_start_ns` you should probably rename
`duration` to `duration_ns` for consistency. I believe it would be
more convenient to have `duration` in microseconds. Nothing prevents
us from using fractions of microseconds (e.g. 12.345 us) if necessary.

```
+=# select trace_id, parent_id, span_id, span_start, duration, span_type, ...
+       trace_id       |      parent_id       |       span_id        |
        ...
+----------------------+----------------------+----------------------+----
+   479062278978111363 |   479062278978111363 |   479062278978111363 ...
+   479062278978111363 |   479062278978111363 |   479062278978111363 ...
+   479062278978111363 |   479062278978111363 | -7732425595282414585 ...
```

Negative span ids, for sure, are possible, but I don't recall seeing
many negative ids in my life. I suggest not to surprise the user
unless this is really necessary.

-- 
Best regards,
Aleksander Alekseev



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi,

Thanks for the review!

> ```
> +-- Worker can take some additional time to end and report their spans
> +SELECT pg_sleep(0.2);
> + pg_sleep
> +----------
> +
> +(1 row)
> ```
>
> Pretty sure this will fail on buildfarm from time to time. Maybe this
> is a case when it is worth using TAP tests? Especially considering the
> support of pg_tracing.notify_channel.

The sleep was unnecessary as the leader waits for all workers to finish
and report. All workers should have added their spans in the shared buffer
before the leader and I removed the sleep from the test. I haven't seen a recent
failure on this test since then.

I did add a TAP test for the notify part since it seems to be the only way
to test it.

> ```
> +      Only superusers can change this setting.
> +      This parameter can only be set at server start.
> ```
>
> I find this confusing. If the parameter can only be set at server
> start doesn't it mean that whoever has access to postgresql.conf can
> change it?

The 'only superusers...' part was incorrect. This setting was needed to compute
the size of the shmem request so modifying it would require a server start.
Though for this parameter, I've realised that I could use the global's
max_parallel_workers
instead, removing the need to have a max_traced_parallel_workers parameter.

> Perhaps TABLE command should be added to the list, since it's
> basically a syntax sugar for SELECT:

Added.

> Obviously pg_tracing.drop_on_full_buffer = on will not work properly
> with pg_tracing.notify_threshold because there is a race condition
> possible. By the time one receives and processes the notification the
> buffer can be already cleaned.

That's definitely a possibility. Lowering the notification threshold can
lower the chance of the buffer being cleared when handling the notification
but the race can still happen.

For my use cases, spans should be sent as soon as possible and old spans
are less valuable than new spans but other usage would want the opposite,
thus the parameters to control the behavior.

> This makes me think that 1) pg_tracing.drop_on_full_buffer = on could
> be not the best default value

I don't have a strong opinion on the default behaviour, I will switch to
keep buffer on full.

> 2) As a user I would like a support of ring buffer mode. Even if
> implementing this mode is out of scope of this particular patch, I
> would argue that instead of drop_on_full_buffer boolean parameter we
> need something like:
>
> pg_tracing.buffer_mode = keep_on_full | drop_on_full | ring_buffer |
> perhaps_other_future_modes

Having a buffer ring was also my first approach but it adds some complexity
and I've switched to the full drop + truncate query file for the first
iteration.
Ring buffer would definitely be a better experience, I've switched the
parameter to an enum with only keep_on_full and drop_on_full for the time being.

> Do you think it possible to document how much shared memory does the
> extension consume depending on the parameters? Alternatively maybe
> it's worth providing a SQL function that returns this value? As a user
> I would like to know whether 10 MB, 500 MB or 1.5 GB of shared memory
> are consumed.

Good point. I've updated the doc to give some approximations with a query
to detail the memory usage.

> I would argue that ns ( 1 / 1_000_000_000 sec! ) precision is
> redundant and only complicates the interface. This is way below the
> noise floor formed from network delays, process switching by the OS,
> noisy neighbours under the same supervisor, time slowdown by ntpd,
> etc. I don't recall ever needing spans with better than 1 us
> precision.
>
> On top of that if you have `span_start_ns` you should probably rename
> `duration` to `duration_ns` for consistency. I believe it would be
> more convenient to have `duration` in microseconds. Nothing prevents
> us from using fractions of microseconds (e.g. 12.345 us) if necessary.

I was definitely on the fence on this. I've initially tried to match the
precision provided by query instrumentation but that definitely introduced
a lot of complexity. Also, while trace standard specify that start and end
should be in nanoseconds, it turns out that the nanosecond precision is
generally ignored.

I've switched to a start+end in TimestampTz, dropping the ns precision.
Tests had to be rewritten to have a stable order with spans having that can
have the same start/end.

I've also removed some spans: ExecutorStart, ExecutorEnd, PostParse.
Those spans generally had a sub microsecond duration and didn't bring much
value except for debugging pg_tracing itself. To avoid the confusion of
having spans with a 0us duration, removing them altogether seemed better.
ExecutorFinish also has a sub microsecond duration unless it has a nested
query (triggers) so I'm only emitting them in those cases.

> Negative span ids, for sure, are possible, but I don't recall seeing
> many negative ids in my life. I suggest not to surprise the user
> unless this is really necessary.

That one is gonna be difficult. Traceid and spanid are 8 bytes so I'm exporting
them using bigint. PostgreSQL doesn't support unsigned bigint thus the negative
span ids. An alternative would be to use a bytea but bigint are easier to read
and interact with.

The latest patch contains changes around the nanosecond precision drop, test
updates to make them more stable and simplification around span management.

Regards,
Anthonin

On Fri, Nov 17, 2023 at 3:34 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi,
>
> Thanks for the updated patch!
>
> > Some small changes, mostly around making tests less flaky:
> > - Removed the top_span and nested_level in the span output, those were
> > mostly used for debugging
> > - More tests around Parse span in nested queries
> > - Remove explicit queryId in the tests
>
> ```
> +-- Worker can take some additional time to end and report their spans
> +SELECT pg_sleep(0.2);
> + pg_sleep
> +----------
> +
> +(1 row)
> ```
>
> Pretty sure this will fail on buildfarm from time to time. Maybe this
> is a case when it is worth using TAP tests? Especially considering the
> support of pg_tracing.notify_channel.
>
> ```
> +      Only superusers can change this setting.
> +      This parameter can only be set at server start.
> ```
>
> I find this confusing. If the parameter can only be set at server
> start doesn't it mean that whoever has access to postgresql.conf can
> change it?
>
> ```
> +      <varname>pg_tracing.track_utility</varname> controls whether spans
> +      should be generated for utility statements. Utility commands are
> +      all those other than <command>SELECT</command>,
> <command>INSERT</command>,
> +      <command>UPDATE</command>, <command>DELETE</command>, and
> <command>MERGE</command>.
> +      The default value is <literal>on</literal>.
> ```
>
> Perhaps TABLE command should be added to the list, since it's
> basically a syntax sugar for SELECT:
>
> ```
> =# create table t(x int);
> =# insert into t values (1), (2), (3);
> =# table t;
>  x
> ---
>  1
>  2
>  3
> (3 rows)
> ```
>
> This part worries me a bit:
>
> ```
> +    <listitem>
> +     <para>
> +      <varname>pg_tracing.drop_on_full_buffer</varname> controls whether
> +      span buffer should be emptied when <varname>pg_tracing.max_span</varname>
> +      spans is reached. If <literal>off</literal>, new spans are dropped.
> +      The default value is <literal>on</literal>.
> +     </para>
> +    </listitem>
> +   </varlistentry>
> ```
>
> Obviously pg_tracing.drop_on_full_buffer = on will not work properly
> with pg_tracing.notify_threshold because there is a race condition
> possible. By the time one receives and processes the notification the
> buffer can be already cleaned. This makes me think that 1)
> pg_tracing.drop_on_full_buffer = on could be not the best default
> value 2) As a user I would like a support of ring buffer mode. Even if
> implementing this mode is out of scope of this particular patch, I
> would argue that instead of drop_on_full_buffer boolean parameter we
> need something like:
>
> pg_tracing.buffer_mode = keep_on_full | drop_on_full | ring_buffer |
> perhaps_other_future_modes
>
> ```
> +  <para>
> +   The module requires additional shared memory proportional to
> +   <varname>pg_tracing.max_span</varname>.  Note that this
> +   memory is consumed whenever the module is loaded, even if
> +   no spans are generated.
> +  </para>
> ```
>
> Do you think it possible to document how much shared memory does the
> extension consume depending on the parameters? Alternatively maybe
> it's worth providing a SQL function that returns this value? As a user
> I would like to know whether 10 MB, 500 MB or 1.5 GB of shared memory
> are consumed.
>
> ```
> +   The start time of a span has two parts: <literal>span_start</literal> and
> +   <literal>span_start_ns</literal>. <literal>span_start</literal> is
> a Timestamp
> +   with time zone providing up a microsecond precision.
> <literal>span_start_ns</literal>
> +   provides the nanosecond part of the span start.
> <literal>span_start + span_start_ns</literal>
> +   can be used to get the start with nanosecond precision.
> ```
>
> I would argue that ns ( 1 / 1_000_000_000 sec! ) precision is
> redundant and only complicates the interface. This is way below the
> noise floor formed from network delays, process switching by the OS,
> noisy neighbours under the same supervisor, time slowdown by ntpd,
> etc. I don't recall ever needing spans with better than 1 us
> precision.
>
> On top of that if you have `span_start_ns` you should probably rename
> `duration` to `duration_ns` for consistency. I believe it would be
> more convenient to have `duration` in microseconds. Nothing prevents
> us from using fractions of microseconds (e.g. 12.345 us) if necessary.
>
> ```
> +=# select trace_id, parent_id, span_id, span_start, duration, span_type, ...
> +       trace_id       |      parent_id       |       span_id        |
>         ...
> +----------------------+----------------------+----------------------+----
> +   479062278978111363 |   479062278978111363 |   479062278978111363 ...
> +   479062278978111363 |   479062278978111363 |   479062278978111363 ...
> +   479062278978111363 |   479062278978111363 | -7732425595282414585 ...
> ```
>
> Negative span ids, for sure, are possible, but I don't recall seeing
> many negative ids in my life. I suggest not to surprise the user
> unless this is really necessary.
>
> --
> Best regards,
> Aleksander Alekseev

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Craig Ringer
Дата:
Hi all

Just saw this thread.

I cooked up a PoC distributed tracing support in Pg years ago as part of the 2ndQuadrant BDR project.

I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`. These can be sent as protocol-level messages by the client driver if the client driver has native trace integration enabled, in which case the user doesn't need to see or care. And for other drivers, the application can use `SET` or `SET LOCAL` to assign them.

This approach avoids the need to rewrite SQL or give special meaning to SQL comments.

Within the server IIRC I used the existing postgres resource owner and memory context stacks to maintain some internal nested traces.

My implementation used the C++ opentelemetry client library to emit traces, so it was never going to be able to land in core. But IIRC the current BDR/PGD folks are now using an OpenTelemetry sidecar process instead. I think they send it UDP packets to emit metrics and events. Petr or Markus might be able to tell you more about how they're doing that.

I'd love to see OpenTelemetry integration in Pg.

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

Overall solution looks good for me except SQL Commenter - query instrumentation
with SQL comments is often not possible on production systems. Instead 
the very often requested feature is to enable tracing for a given single query ID,
or several (very limited number of) queries by IDs. It could be done by adding
SQL function to add and remove query ID into a list (even array would do)
stored in top tracing context.

Great job, thank you!

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Aleksander Alekseev
Дата:
Hi,

> Overall solution looks good for me except SQL Commenter - query instrumentation
> with SQL comments is often not possible on production systems. Instead
> the very often requested feature is to enable tracing for a given single query ID,
> or several (very limited number of) queries by IDs. It could be done by adding
> SQL function to add and remove query ID into a list (even array would do)
> stored in top tracing context.

Not 100% sure if I follow.

By queries you mean particular queries, not transactions? And since
they have an assigned ID it means that the query is already executing
and we want to enable the tracking in another session, right? If this
is the case I would argue that enabling/disabling tracing for an
_already_ running query (or transaction) would be way too complicated.

I wouldn't mind enabling/disabling tracing for the current session
and/or a given session ID. In the latter case it can have an effect
only for the new transactions. This however could be implemented
separately from the proposed patchset. I suggest keeping the scope
small.

-- 
Best regards,
Aleksander Alekseev



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi,

> This approach avoids the need to rewrite SQL or give special meaning to SQL comments.

SQLCommenter already has a good amount of support and is referenced in
opentelemetry https://github.com/open-telemetry/opentelemetry-sqlcommenter
So the goal was more to leverage the existing trace propagation
systems as much as possible.

> I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`.
> These can be sent as protocol-level messages by the client driver if the
> client driver has native trace integration enabled, in which case the user
> doesn't need to see or care. And for other drivers, the application can use
> `SET` or `SET LOCAL` to assign them.

Emitting `SET` and `SET LOCAL` for every traced statement sounds more
disruptive and expensive than relying on SQLCommenter. When not using
`SET LOCAL`, the variables also need to be cleared.
This will also introduce a lot of headaches as the `SET` themselves
could be traced and generate spans when tracing utility statements is
already tricky enough.

> But IIRC the current BDR/PGD folks are now using an OpenTelemetry
> sidecar process instead. I think they send it UDP packets to emit
> metrics and events.

I would be curious to hear more about this. I didn't want to be tied
to a specific protocol (I've only seen gRPC and http support on latest
opentelemetry collectors) and I fear that relying on Postgres sending
traces would lower the chance of having this supported on managed
Postgres solutions (like RDS and CloudSQL).

> By queries you mean particular queries, not transactions? And since
> they have an assigned ID it means that the query is already executing
> and we want to enable the tracking in another session, right?

I think that was the idea. The query identifier generated for a
specific statement is stable so we could have a GUC variable with a
list of query id and only queries matching the provided query ids
would be sampled. This would make it easier to generate traces for a
specific query within a session.


On Tue, Jan 2, 2024 at 1:14 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi,
>
> > Overall solution looks good for me except SQL Commenter - query instrumentation
> > with SQL comments is often not possible on production systems. Instead
> > the very often requested feature is to enable tracing for a given single query ID,
> > or several (very limited number of) queries by IDs. It could be done by adding
> > SQL function to add and remove query ID into a list (even array would do)
> > stored in top tracing context.
>
> Not 100% sure if I follow.
>
> By queries you mean particular queries, not transactions? And since
> they have an assigned ID it means that the query is already executing
> and we want to enable the tracking in another session, right? If this
> is the case I would argue that enabling/disabling tracing for an
> _already_ running query (or transaction) would be way too complicated.
>
> I wouldn't mind enabling/disabling tracing for the current session
> and/or a given session ID. In the latter case it can have an effect
> only for the new transactions. This however could be implemented
> separately from the proposed patchset. I suggest keeping the scope
> small.
>
> --
> Best regards,
> Aleksander Alekseev



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

I've meant exactly the thing you mentioned -

> By queries you mean particular queries, not transactions? And since
> they have an assigned ID it means that the query is already executing
> and we want to enable the tracking in another session, right?

I think that was the idea. The query identifier generated for a
specific statement is stable so we could have a GUC variable with a
list of query id and only queries matching the provided query ids
would be sampled. This would make it easier to generate traces for a
specific query within a session.

This one. When maintaining production systems with high load rate
we often encountered necessity to trace a number of queries with
known IDs (that was the case for Oracle environments, but Postgres
is not very different from this point of view), because enabling tracing
for the whole session would immediately result in thousands of trace
spans in the system with throughput like hundreds or even thousands
of tps, when we need, say, to trace a single problematic query.

Thank you!

--

Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
vignesh C
Дата:
On Thu, 7 Dec 2023 at 20:06, Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> Hi,
>
> Thanks for the review!
>
> > ```
> > +-- Worker can take some additional time to end and report their spans
> > +SELECT pg_sleep(0.2);
> > + pg_sleep
> > +----------
> > +
> > +(1 row)
> > ```
> >
> > Pretty sure this will fail on buildfarm from time to time. Maybe this
> > is a case when it is worth using TAP tests? Especially considering the
> > support of pg_tracing.notify_channel.
>
> The sleep was unnecessary as the leader waits for all workers to finish
> and report. All workers should have added their spans in the shared buffer
> before the leader and I removed the sleep from the test. I haven't seen a recent
> failure on this test since then.
>
> I did add a TAP test for the notify part since it seems to be the only way
> to test it.
>
> > ```
> > +      Only superusers can change this setting.
> > +      This parameter can only be set at server start.
> > ```
> >
> > I find this confusing. If the parameter can only be set at server
> > start doesn't it mean that whoever has access to postgresql.conf can
> > change it?
>
> The 'only superusers...' part was incorrect. This setting was needed to compute
> the size of the shmem request so modifying it would require a server start.
> Though for this parameter, I've realised that I could use the global's
> max_parallel_workers
> instead, removing the need to have a max_traced_parallel_workers parameter.
>
> > Perhaps TABLE command should be added to the list, since it's
> > basically a syntax sugar for SELECT:
>
> Added.
>
> > Obviously pg_tracing.drop_on_full_buffer = on will not work properly
> > with pg_tracing.notify_threshold because there is a race condition
> > possible. By the time one receives and processes the notification the
> > buffer can be already cleaned.
>
> That's definitely a possibility. Lowering the notification threshold can
> lower the chance of the buffer being cleared when handling the notification
> but the race can still happen.
>
> For my use cases, spans should be sent as soon as possible and old spans
> are less valuable than new spans but other usage would want the opposite,
> thus the parameters to control the behavior.
>
> > This makes me think that 1) pg_tracing.drop_on_full_buffer = on could
> > be not the best default value
>
> I don't have a strong opinion on the default behaviour, I will switch to
> keep buffer on full.
>
> > 2) As a user I would like a support of ring buffer mode. Even if
> > implementing this mode is out of scope of this particular patch, I
> > would argue that instead of drop_on_full_buffer boolean parameter we
> > need something like:
> >
> > pg_tracing.buffer_mode = keep_on_full | drop_on_full | ring_buffer |
> > perhaps_other_future_modes
>
> Having a buffer ring was also my first approach but it adds some complexity
> and I've switched to the full drop + truncate query file for the first
> iteration.
> Ring buffer would definitely be a better experience, I've switched the
> parameter to an enum with only keep_on_full and drop_on_full for the time being.
>
> > Do you think it possible to document how much shared memory does the
> > extension consume depending on the parameters? Alternatively maybe
> > it's worth providing a SQL function that returns this value? As a user
> > I would like to know whether 10 MB, 500 MB or 1.5 GB of shared memory
> > are consumed.
>
> Good point. I've updated the doc to give some approximations with a query
> to detail the memory usage.
>
> > I would argue that ns ( 1 / 1_000_000_000 sec! ) precision is
> > redundant and only complicates the interface. This is way below the
> > noise floor formed from network delays, process switching by the OS,
> > noisy neighbours under the same supervisor, time slowdown by ntpd,
> > etc. I don't recall ever needing spans with better than 1 us
> > precision.
> >
> > On top of that if you have `span_start_ns` you should probably rename
> > `duration` to `duration_ns` for consistency. I believe it would be
> > more convenient to have `duration` in microseconds. Nothing prevents
> > us from using fractions of microseconds (e.g. 12.345 us) if necessary.
>
> I was definitely on the fence on this. I've initially tried to match the
> precision provided by query instrumentation but that definitely introduced
> a lot of complexity. Also, while trace standard specify that start and end
> should be in nanoseconds, it turns out that the nanosecond precision is
> generally ignored.
>
> I've switched to a start+end in TimestampTz, dropping the ns precision.
> Tests had to be rewritten to have a stable order with spans having that can
> have the same start/end.
>
> I've also removed some spans: ExecutorStart, ExecutorEnd, PostParse.
> Those spans generally had a sub microsecond duration and didn't bring much
> value except for debugging pg_tracing itself. To avoid the confusion of
> having spans with a 0us duration, removing them altogether seemed better.
> ExecutorFinish also has a sub microsecond duration unless it has a nested
> query (triggers) so I'm only emitting them in those cases.
>
> > Negative span ids, for sure, are possible, but I don't recall seeing
> > many negative ids in my life. I suggest not to surprise the user
> > unless this is really necessary.
>
> That one is gonna be difficult. Traceid and spanid are 8 bytes so I'm exporting
> them using bigint. PostgreSQL doesn't support unsigned bigint thus the negative
> span ids. An alternative would be to use a bytea but bigint are easier to read
> and interact with.
>
> The latest patch contains changes around the nanosecond precision drop, test
> updates to make them more stable and simplification around span management.

One of the test has failed in CFBOT [1] with:
-select count(span_id) from pg_tracing_consume_spans group by span_id;
- count
--------
-     1
-     1
-     1
-     1
-     1
-(5 rows)
-
--- Cleanup
-SET pg_tracing.sample_rate = 0.0;
+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
+connection to server was lost

More details are available at [2]

[1] - https://cirrus-ci.com/task/5876369167482880
[2] -
https://api.cirrus-ci.com/v1/artifact/task/5876369167482880/testrun/build/testrun/pg_tracing/regress/regression.diffs



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Jelte Fennema-Nio
Дата:
On Thu, Jan 4, 2024, 16:36 Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
> > I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`.
> > These can be sent as protocol-level messages by the client driver if the
> > client driver has native trace integration enabled, in which case the user
> > doesn't need to see or care. And for other drivers, the application can use
> > `SET` or `SET LOCAL` to assign them.
>
> Emitting `SET` and `SET LOCAL` for every traced statement sounds more
> disruptive and expensive than relying on SQLCommenter. When not using
> `SET LOCAL`, the variables also need to be cleared.
> This will also introduce a lot of headaches as the `SET` themselves
> could be traced and generate spans when tracing utility statements is
> already tricky enough.

I think one hugely important benefit of using GUCs over sql comments
is, that comments and named protocol-level prepared statements cannot
be combined afaict. Since with named protocol level prepared
statements no query is sent, only arguments. So there's no place to
attach a comment too.



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi,

> I think one hugely important benefit of using GUCs over sql comments
> is, that comments and named protocol-level prepared statements cannot
> be combined afaict. Since with named protocol level prepared
> statements no query is sent, only arguments. So there's no place to
> attach a comment too.

Handling named prepared statements has definitely been a pain point.
I've been experimenting and it is possible to combine comments with
prepared statements by passing the comment's content as a parameter.

PREPARE test_prepared (text, integer) AS /*$1*/ SELECT $2;
EXECUTE test_prepared('dddbs=''postgres.db'',traceparent=''00-00000000000000000000000000000009-0000000000000009-01''',
1);

Though adding a parameter is definitely more intrusive than just
adding a comment and this is not defined and supported anywhere. I
will try to see how much work and complexity it would involve to add
the trace context propagation through GUC variables.
On one hand, having more options would allow more flexibility and ease
to implement the trace propagation. On the other hand, I would like to
keep things simple for a first iteration.

> One of the test has failed in CFBOT [1] with

I've identified and fixed the issue. An assertion checking that spans
generated from planstate didn't end before the parent's end could be
triggered.

The new patch also contains the following changes:
- Support for 128 bits trace ids. While w3c defines exceptions to
allow compatibility with systems that only support 64 bits trace ids,
supporting 128 bits trace ids early should remove the possible pain of
migrating or interoperating issues.
- Trace ids, span ids and parent ids outputs are now hexadecimal
characters instead of bigints. This removes the negative ids issue due
to not having unsigned bigint and was necessary for the 128 bits trace
id support. It also matches the format that's propagated through the
traceparent value. They are still stored as uint64 to keep memory
footprint low.
- Added support for before trigger.
- Refactored all headers to a single header file.
- Added Query id filtering. Adding this turned out to be rather straightforward.
- Additional edge cases found and fixed (planner error could trigger
an assertion failure).

Regards,
Anthonin

On Sun, Jan 7, 2024 at 12:48 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>
> On Thu, Jan 4, 2024, 16:36 Anthonin Bonnefoy
> <anthonin.bonnefoy@datadoghq.com> wrote:
> > > I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`.
> > > These can be sent as protocol-level messages by the client driver if the
> > > client driver has native trace integration enabled, in which case the user
> > > doesn't need to see or care. And for other drivers, the application can use
> > > `SET` or `SET LOCAL` to assign them.
> >
> > Emitting `SET` and `SET LOCAL` for every traced statement sounds more
> > disruptive and expensive than relying on SQLCommenter. When not using
> > `SET LOCAL`, the variables also need to be cleared.
> > This will also introduce a lot of headaches as the `SET` themselves
> > could be traced and generate spans when tracing utility statements is
> > already tricky enough.
>
> I think one hugely important benefit of using GUCs over sql comments
> is, that comments and named protocol-level prepared statements cannot
> be combined afaict. Since with named protocol level prepared
> statements no query is sent, only arguments. So there's no place to
> attach a comment too.

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Jelte Fennema-Nio
Дата:
On Wed, 10 Jan 2024 at 15:55, Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
> PREPARE test_prepared (text, integer) AS /*$1*/ SELECT $2;
> EXECUTE
test_prepared('dddbs=''postgres.db'',traceparent=''00-00000000000000000000000000000009-0000000000000009-01''',
> 1);

Wow, I had no idea that parameters could be expanded in comments. I'm
honestly wondering if that is supported on purpose.



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Peter Smith
Дата:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4456/
[2] https://cirrus-ci.com/task/5581154296791040

Kind Regards,
Peter Smith.



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi,

The last CFbot failure was on pg_upgrade/002_pg_upgrade and doesn't
seem related to the patch.

#   Failed test 'regression tests pass'
#   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 212.
#          got: '256'
#     expected: '0'
# Looks like you failed 1 test of 18.

Given that the patch grew a bit in complexity and features, I've tried
to reduce it to a minimum to make reviewing easier. The goal is to
keep the scope focused for a first version while additional features
and changes could be handled afterwards.

For this version:
- Trace context can be propagated through SQLCommenter
- Sampling decision uses either a global sampling rate or on
SQLCommenter's sampled flag
- We generate spans for Planner, ExecutorRun and ExecutorFinish

What was put on hold for now:
- Generate spans from planstate. This means we don't need the change
in instrument.h to track the start of a node for the first version.
- Generate spans for parallel processes
- Using a GUC variable to propagate trace context
- Filtering sampling on queryId

With this, the current patch provides the core to:
- Generate, store and export spans with basic buffer management (drop
old spans when full or drop new spans when full)
- Keep variable strings in a separate file (a la pg_stat_statements)
- Track tracing state and top spans across nested execution levels

Thoughts on this approach?

On Mon, Jan 22, 2024 at 7:45 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.
>
> ======
> [1] https://commitfest.postgresql.org/46/4456/
> [2] https://cirrus-ci.com/task/5581154296791040
>
> Kind Regards,
> Peter Smith.

Вложения

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

It's a good idea to split a big patch into several smaller ones.
But you've already implemented these features - you could provide them
as sequential small patches (i.e. v13-0002-guc-context-propagation.patch and so on)

Great job! I'm both hands for committing your patch set.

On Fri, Jan 26, 2024 at 1:17 PM Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
Hi,

The last CFbot failure was on pg_upgrade/002_pg_upgrade and doesn't
seem related to the patch.

#   Failed test 'regression tests pass'
#   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 212.
#          got: '256'
#     expected: '0'
# Looks like you failed 1 test of 18.

Given that the patch grew a bit in complexity and features, I've tried
to reduce it to a minimum to make reviewing easier. The goal is to
keep the scope focused for a first version while additional features
and changes could be handled afterwards.

For this version:
- Trace context can be propagated through SQLCommenter
- Sampling decision uses either a global sampling rate or on
SQLCommenter's sampled flag
- We generate spans for Planner, ExecutorRun and ExecutorFinish

What was put on hold for now:
- Generate spans from planstate. This means we don't need the change
in instrument.h to track the start of a node for the first version.
- Generate spans for parallel processes
- Using a GUC variable to propagate trace context
- Filtering sampling on queryId

With this, the current patch provides the core to:
- Generate, store and export spans with basic buffer management (drop
old spans when full or drop new spans when full)
- Keep variable strings in a separate file (a la pg_stat_statements)
- Track tracing state and top spans across nested execution levels

Thoughts on this approach?

On Mon, Jan 22, 2024 at 7:45 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.
>
> ======
> [1] https://commitfest.postgresql.org/46/4456/
> [2] https://cirrus-ci.com/task/5581154296791040
>
> Kind Regards,
> Peter Smith.


--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Heikki Linnakangas
Дата:
Hi Anthonin,

I'm only now catching up on this thread. Very exciting feature!

My first observation is that you were able to implement this purely as 
an extension, without any core changes. That's very cool, because:

- You don't need buy-in or blessing from anyone else. You can just put 
this on github and people can immediately start using it.
- As an extension, it's not tied to the PostgreSQL release cycle. Users 
can immediately start using it with existing PostgreSQL versions.

There are benefits to being in contrib/ vs. an extension outside the 
PostgreSQL source tree, but there are significant downsides too. 
Out-of-tree, you get more flexibility, and you can develop much faster. 
I think this should live out-of-tree, at least until it's relatively 
stable. By stable, I mean "not changing much", not that it's bug-free.

# Core changes

That said, there are a lot of things that would make sense to integrate 
in PostgreSQL itself. For example:

- You're relying on various existing hooks, but it might be better to 
have the 'begin_span'/'end_span' calls directly in PostgreSQL source 
code in the right places. There are more places where spans might be 
nice, like when performing a large sort in tuplesort.c to name one 
example. Or even across every I/O; not sure how much overhead the spans 
incur and how much we'd be willing to accept. 'begin_span'/'end_span' 
could be new hooks that normally do nothing, but your extension would 
implement them.

- Other extensions could include begin_span/end_span calls too. (It's 
complicated for an extension to call functions in another extension.)

- Extensions like postgres_fdw could get the tracing context and 
propagate it to the remote system too.

- How to pass the tracing context from the client? You went with 
SQLCommenter and others proposed a GUC. A protocol extension would make 
sense too. I can see merits to all of those. It probably would be good 
to support multiple mechanisms, but some might need core changes. It 
might be good to implement the mechanism for accepting trace context in 
core. Without the extension, we could still include the trace ID in the 
logs, for example.

# Further ideas

Some ideas on cool extra features on top of this:

- SQL functions to begin/end a span. Could be handy if you have 
complicated PL/pgSQL functions, for example.
- Have spans over subtransactions.
- Include elog() messages in the traces. You might want to have a lower 
elevel for what's included in traces; something like the 
client_min_messages and log_min_messages settings.
- Include EXPLAIN plans in the traces.

# Comments on the implementation

There was discussion already on push vs pull model. Currently, you 
collect the traces in memory / files, and require a client to poll them. 
A push model is more common in applications that support tracing. If 
Postgres could connect directly to the OTEL collector, you'd need one 
fewer running component. You could keep the traces in backend-private 
memory, no need to deal with shared memory and spilling to files.

Admittedly supporting the OTEL wire protocol is complicated unless you 
use an existing library. Nevertheless, it might be a better tradeoff. 
OTEL/HTTP with JSON format seems just about feasible to implement from 
scratch. Or bite the bullet and use some external library. If this lives 
as an extension, you have more freedom to rely on 3rd party libraries.

(If you publish this as an out-of-tree extension, then this is up to 
you, of course, and doesn't need consensus here on pgsql-hackers)

# Suggested next steps with this

Here's how I'd suggest to proceed with this:

1. Publish this as an extension on github, as it is. I think you'll get 
a lot of immediate adoption.

2. Start a new pgsql-hackers thread on in-core changes that would 
benefit the extension. Write patches for them. I'm thinking of the 
things I listed in the Core changes section above, but maybe there are 
others.


PS. Does any other DBMS implement this? Any lessons to be learned from them?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: POC: Extension for adding distributed tracing - pg_tracing

От
Jan Katins
Дата:
Hi!

On Fri, 2 Feb 2024 at 13:41, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
PS. Does any other DBMS implement this? Any lessons to be learned from them?

Mysql 8.3 has open telemetrie component, so very fresh: 


Mysql ppl are already waiting for PG to catch up (in German): https://chaos.social/@isotopp/111421160719915296 :-)

Best regards,

Jan

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi!

On Fri, Jan 26, 2024 at 1:43 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
> It's a good idea to split a big patch into several smaller ones.
> But you've already implemented these features - you could provide them
> as sequential small patches (i.e. v13-0002-guc-context-propagation.patch and so on)

Keeping a consistent set of patches is going to be hard as any change
in the first patch would require rebasing that are likely going to
conflict. I haven't discarded the features but keeping them separated
could help with the discussion.

On Fri, Feb 2, 2024 at 1:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Hi Anthonin,
>
> I'm only now catching up on this thread. Very exciting feature!
Thanks! And thanks for the comments.

> - You're relying on various existing hooks, but it might be better to
> have the 'begin_span'/'end_span' calls directly in PostgreSQL source
> code in the right places. There are more places where spans might be
> nice, like when performing a large sort in tuplesort.c to name one
> example. Or even across every I/O; not sure how much overhead the spans
> incur and how much we'd be willing to accept. 'begin_span'/'end_span'
> could be new hooks that normally do nothing, but your extension would
> implement them.

Having information that would only be available to tracing doesn't
sound great. It probably should be added to the core instrumentation
which could be then leveraged by tracing and other systems.
One of the features I initially implemented was to track parse start
and end. However, since only the post parse hook is available, I could
only infer when the parse started. Changing the core to track parse
start would make this more easier and it could be used by EXPLAIN and
pg_stat_statements.
In the end, I've tried to match what's done in pg_stat_statements
since the core idea is the same, except pg_tracing outputs spans
instead of statistics.

> - How to pass the tracing context from the client? You went with
> SQLCommenter and others proposed a GUC. A protocol extension would make
> sense too. I can see merits to all of those. It probably would be good
> to support multiple mechanisms, but some might need core changes. It
> might be good to implement the mechanism for accepting trace context in
> core. Without the extension, we could still include the trace ID in the
> logs, for example.

I went with SQLCommenter for the initial implementation since it has
the benefit of already being implemented and working out of the box.
Jelte has an ongoing patch to add the possibility to modify GUC values
through a protocol message [1]. I've tested it and it worked well
without adding too much complexity in pg_tracing. I imagine having a
dedicated traceparent GUC variable could be also used in the core to
propagate in the logs and other places.

> - Include EXPLAIN plans in the traces.
This was already implemented in the first versions (or rather, the
planstate was translated to spans) of the patch. I've removed it in
the latest patch as it was contributing to 50% of the code and also
required a core change, namely a change in instrument.c to track the
first start of a node. Though if you were thinking of the raw EXPLAIN
ANALYZE text dumped as a span metadata, that could be doable.

> # Comments on the implementation
>
> There was discussion already on push vs pull model. Currently, you
> collect the traces in memory / files, and require a client to poll them.
> A push model is more common in applications that support tracing. If
> Postgres could connect directly to the OTEL collector, you'd need one
> fewer running component. You could keep the traces in backend-private
> memory, no need to deal with shared memory and spilling to files.

PostgreSQL metrics are currently exposed through tables like pg_stat_*
and users likely have collectors pulling those metrics (like OTEL
PostgreSQL Receiver [2]). Having this collector also pulling traces
would keep the logic in a single place and would avoid a situation
where both pull and push is used.

> 2. Start a new pgsql-hackers thread on in-core changes that would
> benefit the extension. Write patches for them. I'm thinking of the
> things I listed in the Core changes section above, but maybe there are
> others.
I already have an ongoing patch (more support for extended protocol in
psql [3] needed for tests) and plan to do others (the parse changes
I've mentioned). Though I also don't want to spread myself too thin.


[1]: https://commitfest.postgresql.org/46/4736/
[2]: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/postgresqlreceiver/README.md
[3]: https://commitfest.postgresql.org/46/4650/

On Fri, Feb 2, 2024 at 5:42 PM Jan Katins <jasc@gmx.net> wrote:
>
> Hi!
>
> On Fri, 2 Feb 2024 at 13:41, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> PS. Does any other DBMS implement this? Any lessons to be learned from them?
>
>
> Mysql 8.3 has open telemetrie component, so very fresh:
>
> https://dev.mysql.com/doc/refman/8.3/en/telemetry-trace-install.html
> https://blogs.oracle.com/mysql/post/mysql-telemetry-tracing-with-oci-apm
>
> Mysql ppl are already waiting for PG to catch up (in German): https://chaos.social/@isotopp/111421160719915296 :-)
>
> Best regards,
>
> Jan



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Nikita Malakhov
Дата:
Hi!

I agree with Heikki on most topics and especially the one he recommended
to publish your extension on GitHub, this tool is very promising for highly loaded
systems, you will get a lot of feedback very soon.

I'm curious about SpinLock - it is recommended for very short operations,
taking up to several instructions, and docs say that for longer ones it will be
too expensive, and recommends using LWLock. Why have you chosen SpinLock?
Does it have some benefits here?

Thank you!

PS: process_query_desc function has trace_context argument that is neither
used nor passed anywhere.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Andres Freund
Дата:
Hi,

On 2024-02-09 19:46:58 +0300, Nikita Malakhov wrote:
> I agree with Heikki on most topics and especially the one he recommended
> to publish your extension on GitHub, this tool is very promising for highly
> loaded
> systems, you will get a lot of feedback very soon.
> 
> I'm curious about SpinLock - it is recommended for very short operations,
> taking up to several instructions, and docs say that for longer ones it
> will be
> too expensive, and recommends using LWLock. Why have you chosen SpinLock?
> Does it have some benefits here?

Indeed - e.g. end_tracing() looks to hold the spinlock for far too long for
spinlocks to be appropriate. Use an lwlock.

Random stuff I noticed while skimming:
- pg_tracing.c should include postgres.h as the first thing

- afaict none of the use of volatile is required, spinlocks have been barriers
  for a long time now

- you acquire the spinlock for single increments of n_writers, perhaps that
  could become an atomic, to reduce contention?

- I don't think it's a good idea to do memory allocations in the middle of a
  PG_CATCH. If the error was due to out-of-memory, you'll throw another error.


Greetings,

Andres Freund



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi all!

Thanks for the reviews and comments.

> - pg_tracing.c should include postgres.h as the first thing
Will do.

> afaict none of the use of volatile is required, spinlocks have been barriers
> for a long time now
Got it, I will remove them. I've been mimicking what was done in
pg_stat_statements and all spinlocks are done on volatile variables
[1].

> - I don't think it's a good idea to do memory allocations in the middle of a
> PG_CATCH. If the error was due to out-of-memory, you'll throw another error.
Good point. I was wondering what were the risks of generating spans
for errors. I will try to find a better way to handle this.

To give a quick update: following Heikki's suggestion, I'm currently
working on getting pg_tracing as a separate extension on github. I
will send an update when it's ready.

[1]: https://github.com/postgres/postgres/blob/master/contrib/pg_stat_statements/pg_stat_statements.c#L2000

On Fri, Feb 9, 2024 at 7:50 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2024-02-09 19:46:58 +0300, Nikita Malakhov wrote:
> > I agree with Heikki on most topics and especially the one he recommended
> > to publish your extension on GitHub, this tool is very promising for highly
> > loaded
> > systems, you will get a lot of feedback very soon.
> >
> > I'm curious about SpinLock - it is recommended for very short operations,
> > taking up to several instructions, and docs say that for longer ones it
> > will be
> > too expensive, and recommends using LWLock. Why have you chosen SpinLock?
> > Does it have some benefits here?
>
> Indeed - e.g. end_tracing() looks to hold the spinlock for far too long for
> spinlocks to be appropriate. Use an lwlock.
>
> Random stuff I noticed while skimming:
> - pg_tracing.c should include postgres.h as the first thing
>
> - afaict none of the use of volatile is required, spinlocks have been barriers
>   for a long time now
>
> - you acquire the spinlock for single increments of n_writers, perhaps that
>   could become an atomic, to reduce contention?
>
> - I don't think it's a good idea to do memory allocations in the middle of a
>   PG_CATCH. If the error was due to out-of-memory, you'll throw another error.
>
>
> Greetings,
>
> Andres Freund



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Craig Ringer
Дата:


On Tue, 12 Mar 2024, 23:45 Anthonin Bonnefoy, <anthonin.bonnefoy@datadoghq.com> wrote:

> - I don't think it's a good idea to do memory allocations in the middle of a
> PG_CATCH. If the error was due to out-of-memory, you'll throw another error.
Good point. I was wondering what were the risks of generating spans
for errors. I will try to find a better way to handle this.

The usual approach is to have pre-allocated memory. This must actually be written (zeroed usually) or it might be lazily allocated only on page fault. And it can't be copy on write memory allocated once in postmaster startup then inherited by fork.

That imposes an overhead for every single postgres backend. So maybe there's a better solution.

Re: POC: Extension for adding distributed tracing - pg_tracing

От
Aleksander Alekseev
Дата:
Hi Anthonin,

> [...]
>
> The usual approach is to have pre-allocated memory. This must actually be written (zeroed usually) or it might be
lazilyallocated only on page fault. And it can't be copy on write memory allocated once in postmaster startup then
inheritedby fork.
 
>
> That imposes an overhead for every single postgres backend. So maybe there's a better solution.

It looks like the patch rotted a bit, see cfbot. Could you please
submit an updated version?

Best regards,
Aleksander Alekseev (wearing co-CFM hat)



Re: POC: Extension for adding distributed tracing - pg_tracing

От
Anthonin Bonnefoy
Дата:
Hi,

I've rebased and updated the patch with some fixes:
- Removed volatile
- Switched from spinlock to LWLock and removed n_writer
- Moved postgres.h as first header on all files
- Removed possible allocations in PG_CATCH. Though I would probably need to run some more thorough tests on this

I should also have the GitHub repository version of the extension ready this week.

Regards,
Anthonin

On Fri, Mar 15, 2024 at 3:10 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi Anthonin,

> [...]
>
> The usual approach is to have pre-allocated memory. This must actually be written (zeroed usually) or it might be lazily allocated only on page fault. And it can't be copy on write memory allocated once in postmaster startup then inherited by fork.
>
> That imposes an overhead for every single postgres backend. So maybe there's a better solution.

It looks like the patch rotted a bit, see cfbot. Could you please
submit an updated version?

Best regards,
Aleksander Alekseev (wearing co-CFM hat)
Вложения