Обсуждение: Make query ID more portable

Поиск
Список
Период
Сортировка

Make query ID more portable

От
"Andrey V. Lepikhov"
Дата:
Hi,

QueryID is good tool for query analysis. I want to improve core jumbling 
machinery in two ways:
1. QueryID value should survive dump/restore of a database (use fully 
qualified name of table instead of relid).
2. QueryID could represent more general class of queries: for example, 
it can be independent from permutation of tables in a FROM clause.

See the patch in attachment as an POC. Main idea here is to break 
JumbleState down to a 'clocations' part that can be really interested in
a post parse hook and a 'context data', that needed to build query or 
subquery signature (hash) and, I guess, isn't really needed in any 
extensions.

I think, it adds not much complexity and overhead. It still not 
guaranteed equality of queryid on two instances with an equal schema, 
but survives across an instance upgrade and allows to do some query 
analysis on a replica node.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Вложения

Re: Make query ID more portable

От
Julien Rouhaud
Дата:
Hi,

On Tue, Oct 12, 2021 at 4:12 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
> QueryID is good tool for query analysis. I want to improve core jumbling
> machinery in two ways:
> 1. QueryID value should survive dump/restore of a database (use fully
> qualified name of table instead of relid).
> 2. QueryID could represent more general class of queries: for example,
> it can be independent from permutation of tables in a FROM clause.
>
> See the patch in attachment as an POC. Main idea here is to break
> JumbleState down to a 'clocations' part that can be really interested in
> a post parse hook and a 'context data', that needed to build query or
> subquery signature (hash) and, I guess, isn't really needed in any
> extensions.

There have been quite a lot of threads about that in the past, and
almost every time people wanted to change how the hash was computed.
So it seems to me that extensions would actually be quite interested
in that.  This is even more the case now that an extension can be used
to replace the queryid calculation only and keep the rest of the
extension relying on it as is.

> I think, it adds not much complexity and overhead.

I think the biggest change in your patch is:

  case RTE_RELATION:
- APP_JUMB(rte->relid);
- JumbleExpr(jstate, (Node *) rte->tablesample);
+ {
+ char *relname = regclassout_ext(rte->relid, true);
+
+ APP_JUMB_STRING(relname);
+ JumbleExpr(jstate, (Node *) rte->tablesample, ctx);
  APP_JUMB(rte->inh);
  break;

Have you done any benchmark on OLTP workload?  Adding catalog access
there is likely to add significant overhead.

Also, why only using the fully qualified relation name for stable
hashes?  At least operators and functions should also be treated the
same way.  If you do that you will probably have way too much overhead
to be usable in a busy production environment.  Why not using the new
possibility of 3rd party extension for the queryid calculation that
exactly suits your need?



Re: Make query ID more portable

От
Andrey Lepikhov
Дата:
On 12/10/21 13:35, Julien Rouhaud wrote:
> Hi,
> 
> On Tue, Oct 12, 2021 at 4:12 PM Andrey V. Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> See the patch in attachment as an POC. Main idea here is to break
>> JumbleState down to a 'clocations' part that can be really interested in
>> a post parse hook and a 'context data', that needed to build query or
>> subquery signature (hash) and, I guess, isn't really needed in any
>> extensions.
> 
> There have been quite a lot of threads about that in the past, and
> almost every time people wanted to change how the hash was computed.
> So it seems to me that extensions would actually be quite interested
> in that.  This is even more the case now that an extension can be used
> to replace the queryid calculation only and keep the rest of the
> extension relying on it as is.
Yes, I know. I have been using such self-made queryID for four years. 
And I will use it further.
But core jumbling code is good, fast and much easier in support. The 
purpose of this work is extending of jumbling to use in more flexible 
way to avoid meaningless copying of this code to an extension.
>> I think, it adds not much complexity and overhead.
> 
> I think the biggest change in your patch is:
> 
>    case RTE_RELATION:
> - APP_JUMB(rte->relid);
> - JumbleExpr(jstate, (Node *) rte->tablesample);
> + {
> + char *relname = regclassout_ext(rte->relid, true);
> +
> + APP_JUMB_STRING(relname);
> + JumbleExpr(jstate, (Node *) rte->tablesample, ctx);
>    APP_JUMB(rte->inh);
>    break;
> 
> Have you done any benchmark on OLTP workload?  Adding catalog access
> there is likely to add significant overhead.
Yes, I should do benchmarking. But I guess, main goal of Query ID is 
monitoring, that can be switched off, if necessary.
This part made for a demo. It can be replaced by a hook, for example.
> 
> Also, why only using the fully qualified relation name for stable
> hashes?  At least operators and functions should also be treated the
> same way.  If you do that you will probably have way too much overhead
> to be usable in a busy production environment.  Why not using the new
> possibility of 3rd party extension for the queryid calculation that
> exactly suits your need?
> 
I fully agree with these arguments. This code is POC. Main part here is 
breaking down JumbleState, using a local context for subqueries and 
sorting of a range table entries hashes.
I think, we can call one routine (APP_JUMB_OBJECT(), as an example) for 
all oids in this code. It would allow an extension to intercept this 
call and replace oid with an arbitrary value.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Make query ID more portable

От
Tom Lane
Дата:
Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> But core jumbling code is good, fast and much easier in support.

It won't be fast once you stick a bunch of catalog lookups into it.
I think this is fine as an extension, but it has no chance of being
accepted in core, just on performance grounds.

(I'm also not sure that the query ID calculation code is always/only
invoked in contexts where it's safe to do catalog accesses.)

A bigger issue is that query ID stability isn't something we are going
to promise on a large scale --- for example, what if a new release adds
some new fields to struct Query?  So I'm not sure that "query IDs should
survive dump/reload" is a useful goal to consider.  It's certainly not
something that could be reached by anything even remotely like the
existing code.

            regards, tom lane



Re: Make query ID more portable

От
Bruce Momjian
Дата:
On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote:
> Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> > But core jumbling code is good, fast and much easier in support.
> 
> It won't be fast once you stick a bunch of catalog lookups into it.
> I think this is fine as an extension, but it has no chance of being
> accepted in core, just on performance grounds.
> 
> (I'm also not sure that the query ID calculation code is always/only
> invoked in contexts where it's safe to do catalog accesses.)
> 
> A bigger issue is that query ID stability isn't something we are going
> to promise on a large scale --- for example, what if a new release adds
> some new fields to struct Query?  So I'm not sure that "query IDs should
> survive dump/reload" is a useful goal to consider.  It's certainly not
> something that could be reached by anything even remotely like the
> existing code.

Also, the current code handles renames of schemas and objects, but this
would not.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Make query ID more portable

От
Andrey Lepikhov
Дата:
On 12/10/21 18:45, Bruce Momjian wrote:
> On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote:
>> Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
>>> But core jumbling code is good, fast and much easier in support.
> Also, the current code handles renames of schemas and objects, but this
> would not.
Yes, It is good option if an extension works only in the context of one 
node. But my efforts are directed to the cross-instance usage of a 
monitoring data. As an example, it may be useful for sharding.
Also, I guess for an user is essential to think that if he changed a 
name of any object he also would change queries and reset monitoring 
data, related on this object.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Make query ID more portable

От
Julien Rouhaud
Дата:
On Thu, Oct 14, 2021 at 12:37 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
> On 12/10/21 18:45, Bruce Momjian wrote:
> > On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote:
> >> Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> >>> But core jumbling code is good, fast and much easier in support.
> > Also, the current code handles renames of schemas and objects, but this
> > would not.
> Yes, It is good option if an extension works only in the context of one
> node. But my efforts are directed to the cross-instance usage of a
> monitoring data. As an example, it may be useful for sharding.
> Also, I guess for an user is essential to think that if he changed a
> name of any object he also would change queries and reset monitoring
> data, related on this object.

What if someone wants to allow any form of partitioning without
changing to the ID, or ignore the schema because it's a multi tenant
db with dedicated roles?

I think that there are just too many arbitrary decisions that could be
made on what exactly should be a query identifier to have a single
in-core implementation.  If you do sharding, you already have to
properly configure each node, so configuring your custom query id
extension shouldn't be a big problem.



Re: Make query ID more portable

От
Andrey Lepikhov
Дата:
On 12/10/21 18:40, Tom Lane wrote:
> Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
>> But core jumbling code is good, fast and much easier in support.
> A bigger issue is that query ID stability isn't something we are going
> to promise on a large scale --- for example, what if a new release adds
> some new fields to struct Query?  So I'm not sure that "query IDs should
> survive dump/reload" is a useful goal to consider.  It's certainly not
> something that could be reached by anything even remotely like the
> existing code.
Thank you for the explanation.
I think the problem of queryId is that is encapsulates two different 
meanings:
1. It allows an extension to match an query on post parse and execution 
stages. In this sense, queryId should be as unique as possible for each 
query.
2. For pg_stat_statements purposes (and my project too) it represents an 
query class and should be stable against permutations of range table 
entries, clauses, e.t.c. For example:
"SELECT * FROM a,b;" and "SELECT * FROM b,a;" should have the same queryId.

This issue may be solved in an extension with next approach:
1. Force as unique value for queryId as extension wants in a post parse hook
2. Generalize the JumbleQuery routine code to generate a kind of query 
class signature.

The attached patch is a first sketch for such change.

-- 
regards,
Andrey Lepikhov
Postgres Professional
Вложения

Re: Make query ID more portable

От
Andrey Lepikhov
Дата:
On 14/10/21 10:40, Julien Rouhaud wrote:
> On Thu, Oct 14, 2021 at 12:37 PM Andrey Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>>
>> On 12/10/21 18:45, Bruce Momjian wrote:
>>> On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote:
>>>> Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> I think that there are just too many arbitrary decisions that could be
> made on what exactly should be a query identifier to have a single
> in-core implementation.
Yes, and I use such custom decision too. But core jumbling code 
implements good idea and can be generalized for reuse. Patch from 
previous letter and breaking down of JumbleState can allow coders to 
implement their codes based on queryjumble.c module with smaller changes.

>  If you do sharding, you already have to
> properly configure each node, so configuring your custom query id
> extension shouldn't be a big problem.
My project is about adaptive query optimization techniques. It is not 
obvious how to match (without a field in Query struct) a post parse and 
an execution phases because of nested queries.
Also, if we use queryId in an extension, we interfere with 
pg_stat_statements.

-- 
regards,
Andrey Lepikhov
Postgres Professional