Обсуждение: Cluster table based on grand parent?

Поиск
Список
Период
Сортировка

Cluster table based on grand parent?

От
Dominique Devienne
Дата:
Hi again,

I just sent a question regarding parent/child and cascading FKs.
But in reality, our schema has not 2 but 3 "layers",
with an additional grandchild "leaf" table (see below).

Given that many acces patterns are parent-based, i.e. get all
child of given parent, or get all grandchild of given child, I can
use [CLUSTER][1] leveraging the natural-key (parent, name) constraint/index.

But for grandchild rows, doesn't mean the rows for a given (grand)parent
won't be fully clustered? Yes, our software often accesses rows in child and grandchild
for a given parent row.

So can grandchild table(s) be "fully" clustered per-(grand)parent?
Would that require denormalizing, and adding an extra grandparent column FK in grandchild, to achieve that?
And if that's the case, then there are two "paths" to CASCADE a delete from parent; Would that be a problem?  (w.r.t. performance or otherwise?)

Finally, does cluster affect associated toast tables too? (the doc doesn't say)

Thanks for any insights. --DD

PS: At this point, I don't even know how much cluster affects performance.
    But because it can affect the schema structure (by denormalizing), i'd rather know early.


```
dd=> create table parent (id int generated always as identity primary key, name text not null unique);
CREATE TABLE

dd=> create table child (id int generated always as identity primary key, parent int not null references parent(id) on delete cascade, name text not null, unique(parent, name));
CREATE TABLE

dd=> create table grandchild (id int generated always as identity primary key, parent int not null references child(id) on delete cascade, name text not null, unique(parent, name));
CREATE TABLE
```

Re: Cluster table based on grand parent?

От
Adrian Klaver
Дата:
On 3/28/23 06:17, Dominique Devienne wrote:
> Hi again,

> 
> Thanks for any insights. --DD
> 
> PS: At this point, I don't even know how much cluster affects performance.
>      But because it can affect the schema structure (by denormalizing), 
> i'd rather know early.

You will need to explain to me how it denormalizes? It reorders rows by 
index definition and does not maintain that order over updates and inserts.

> 
> [1]: https://www.postgresql.org/docs/current/sql-cluster.html 
> <https://www.postgresql.org/docs/current/sql-cluster.html>
> 
> ```
> dd=> create table parent (id int generated always as identity primary 
> key, name text not null unique);
> CREATE TABLE
> 
> dd=> create table child (id int generated always as identity primary 
> key, parent int not null references parent(id) on delete cascade, name 
> text not null, unique(parent, name));
> CREATE TABLE
> 
> dd=> create table grandchild (id int generated always as identity 
> primary key, parent int not null references child(id) on delete cascade, 
> name text not null, unique(parent, name));
> CREATE TABLE
> ```

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Cluster table based on grand parent?

От
"Peter J. Holzer"
Дата:
On 2023-03-28 07:26:47 -0700, Adrian Klaver wrote:
> On 3/28/23 06:17, Dominique Devienne wrote:
> > PS: At this point, I don't even know how much cluster affects performance.

I think that this depends a lot on your access patterns (especially on
how much you update the grandchild table and whether those updates can
be HOT), so you will probably have to measure it yourself with a
realistic work load.

(Personally I doubt the impact is large, but I don't know your data or
your access patterns.)

> >      But because it can affect the schema structure (by denormalizing),
> > i'd rather know early.
>
> You will need to explain to me how it denormalizes? It reorders rows by
> index definition and does not maintain that order over updates and inserts.

I think he means that in order to cluster the grandchild table by the parent.id
he would have to denormalize the table. I.e. instead of like this:

> > ```
> > dd=> create table parent (id int generated always as identity primary
> > key, name text not null unique);
> > CREATE TABLE
> >
> > dd=> create table child (id int generated always as identity primary
> > key, parent int not null references parent(id) on delete cascade, name
> > text not null, unique(parent, name));
> > CREATE TABLE
> >
> > dd=> create table grandchild (id int generated always as identity
> > primary key, parent int not null references child(id) on delete cascade,
> > name text not null, unique(parent, name));
> > CREATE TABLE
> > ```

The last create statement would have to be like this:

create table grandchild (
    id int generated always as identity primary key,
    parent int not null references child(id) on delete cascade,
    grandparent int not null references parent(id) on delete cascade,
    name text not null,
    unique(grandparent, parent, name)
);

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Cluster table based on grand parent?

От
Dominique Devienne
Дата:
On Tue, Mar 28, 2023 at 5:08 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2023-03-28 07:26:47 -0700, Adrian Klaver wrote:
> On 3/28/23 06:17, Dominique Devienne wrote:
> > PS: At this point, I don't even know how much cluster affects performance.

I think that this depends a lot on your access patterns

As I wrote, per-parent access to child and grandchild rows is typical.
So w/o parent-based clustering of grandchild table(s), access those rows
could potential seek to several (~50, see below) smaller clusters with arbitrary gaps.

Cardinality is a few to ~20K on parent, x10-x50 on child, x20 - x100 on grandchild.
So total row count rarely exceeds the 1M - 10M range. But there are LOBs/BYTEa...
 
(especially on
how much you update the grandchild table and whether those updates can
be HOT), so you will probably have to measure it yourself with a
realistic work load.

In this particular case, there aren't much UPDATEs, because of a deficiency
of the client applications, which mostly do DELETE+INSERT instead of UPDATEs.
Although we have to cascade modified dates up the parent hierarchy,
so some UPDATEs do occur, but mostly on the less numerous child and parent tables.
 
(Personally I doubt the impact is large, but I don't know your data or
your access patterns.)

OK.
 
> >      But because it can affect the schema structure (by denormalizing),
> > i'd rather know early.
>
> You will need to explain to me how it denormalizes? It reorders rows by
> index definition and does not maintain that order over updates and inserts.

I think he means that in order to cluster the grandchild table by the parent.id
he would have to denormalize the table.

exactly.

Thanks for your input.

Re: Cluster table based on grand parent?

От
Ron
Дата:
On 3/28/23 08:17, Dominique Devienne wrote:
Hi again,

I just sent a question regarding parent/child and cascading FKs.
But in reality, our schema has not 2 but 3 "layers",
with an additional grandchild "leaf" table (see below).

Given that many acces patterns are parent-based, i.e. get all
child of given parent, or get all grandchild of given child, I can
use [CLUSTER][1] leveraging the natural-key (parent, name) constraint/index.

But for grandchild rows, doesn't mean the rows for a given (grand)parent
won't be fully clustered? Yes, our software often accesses rows in child and grandchild
for a given parent row.

So can grandchild table(s) be "fully" clustered per-(grand)parent?
Would that require denormalizing, and adding an extra grandparent column FK in grandchild, to achieve that?
And if that's the case, then there are two "paths" to CASCADE a delete from parent; Would that be a problem?  (w.r.t. performance or otherwise?)

Finally, does cluster affect associated toast tables too? (the doc doesn't say)

Thanks for any insights. --DD

PS: At this point, I don't even know how much cluster affects performance.
    But because it can affect the schema structure (by denormalizing), i'd rather know early.


```
dd=> create table parent (id int generated always as identity primary key, name text not null unique);
CREATE TABLE

dd=> create table child (id int generated always as identity primary key, parent int not null references parent(id) on delete cascade, name text not null, unique(parent, name));
CREATE TABLE

dd=> create table grandchild (id int generated always as identity primary key, parent int not null references child(id) on delete cascade, name text not null, unique(parent, name));
CREATE TABLE
```

You can only get from parent to grandchild via child.id to grandchild.parent, so why not cluster grandchild on grandchild.parent?

--
Born in Arizona, moved to Babylonia.

Re: Cluster table based on grand parent?

От
Dominique Devienne
Дата:
On Tue, Mar 28, 2023 at 6:06 PM Ron <ronljohnsonjr@gmail.com> wrote:
You can only get from parent to grandchild via child.id to grandchild.parent, so why not cluster grandchild on grandchild.parent?

Hi. I don't understand your question. Yes, of course, if I want all grand-children of a given parent, I'll do for example:

select p.id, c.id, c.name, gc.*
  from  grandchild gc
   join child c on gc.parent = c.id
   join parent p on c.parent = p.id
where p.name = $1

But w/o clustering on a denormalized grandchild.grandparent FK column, as Peter showed,
and cluster only on grandchild.parent, that's not going to access a mostly continuous range
of pages to fetch those all grandchild rows for that one parent. But probably 10 to 50 "row-clusters",
given the fan-out I mentioned earlier at the child-table level. Or am I missing something?

Re: Cluster table based on grand parent?

От
Rob Sargent
Дата:
On 3/28/23 10:28, Dominique Devienne wrote:
On Tue, Mar 28, 2023 at 6:06 PM Ron <ronljohnsonjr@gmail.com> wrote:
You can only get from parent to grandchild via child.id to grandchild.parent, so why not cluster grandchild on grandchild.parent?

Hi. I don't understand your question. Yes, of course, if I want all grand-children of a given parent, I'll do for example:

select p.id, c.id, c.name, gc.*
  from  grandchild gc
   join child c on gc.parent = c.id
   join parent p on c.parent = p.id
where p.name = $1

But w/o clustering on a denormalized grandchild.grandparent FK column, as Peter showed,
and cluster only on grandchild.parent, that's not going to access a mostly continuous range
of pages to fetch those all grandchild rows for that one parent. But probably 10 to 50 "row-clusters",
given the fan-out I mentioned earlier at the child-table level. Or am I missing something?
Are you using HDD (spinning) or SSD discs?
Is you world strictly three levels: grand,parent,child?
What tests have you done so far to compare clustered to non-clustered?

Re: Cluster table based on grand parent?

От
Ron
Дата:
On 3/28/23 11:28, Dominique Devienne wrote:
On Tue, Mar 28, 2023 at 6:06 PM Ron <ronljohnsonjr@gmail.com> wrote:
You can only get from parent to grandchild via child.id to grandchild.parent, so why not cluster grandchild on grandchild.parent?

Hi. I don't understand your question. Yes, of course, if I want all grand-children of a given parent, I'll do for example:

select p.id, c.id, c.name, gc.*
  from  grandchild gc
   join child c on gc.parent = c.id
   join parent p on c.parent = p.id
where p.name = $1

But w/o clustering on a denormalized grandchild.grandparent FK column, as Peter showed,
and cluster only on grandchild.parent, that's not going to access a mostly continuous range
of pages to fetch those all grandchild rows for that one parent. But probably 10 to 50 "row-clusters",
given the fan-out I mentioned earlier at the child-table level. Or am I missing something?

No, you're not missing something.  If you want to go directly from grandparent to grandchild, then you need to put grandparent_id in the grandchild table.

Rob Sargent is right, too, though: practically it might not make a difference.  You've got to test.

--
Born in Arizona, moved to Babylonia.