Обсуждение: When to store data that could be derived

Поиск
Список
Период
Сортировка

When to store data that could be derived

От
Frank
Дата:
Hi all

As I understand it, a  general rule of thumb is that you should never 
create a physical column if the data could be derived from existing 
columns. A possible reason for breaking this rule is for performance 
reasons.

I have a situation where I am considering breaking the rule, but I am 
not experienced enough in SQL to know if my reason is valid. I would 
appreciate it if someone could glance at my 'before' and 'after' 
scenarios and see if, from a 'gut-feel' point of view, I should proceed.

I have a VIEW constructed as follows -

CREATE VIEW view_name AS
[select statement 1]
UNION ALL
[select statement 2]
etc.

This is one of the select statements. I will give the 'after' scenario 
first -

SELECT
     'arec' AS tran_type, a.row_id AS tran_row_id,
     a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
     CASE
         WHEN a.tran_type = 'ar_rec' THEN y.tran_date
         WHEN a.tran_type = 'cb_rec' THEN w.tran_date
     END AS tran_date,
     CASE
         WHEN a.tran_type = 'ar_rec' THEN y.text
         WHEN a.tran_type = 'cb_rec' THEN w.text
     END AS text,
     0 - a.arec_cust AS amount_cust,
     0 - a.arec_local AS amount_local
FROM ar_rec_subtran a
LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
WHERE
     CASE
         WHEN a.tran_type = 'ar_rec' THEN y.posted
         WHEN a.tran_type = 'cb_rec' THEN w.posted
     END = '1'

The two columns a.arec_cust and a.arec_local *can* be derived from other 
columns, and in fact that is how it is working at the moment, so here is 
the 'before' scenario -

SELECT
     'arec' AS tran_type, a.row_id AS tran_row_id,
     a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
     CASE
         WHEN a.tran_type = 'ar_rec' THEN y.tran_date
         WHEN a.tran_type = 'cb_rec' THEN w.tran_date
     END AS tran_date,
     CASE
         WHEN a.tran_type = 'ar_rec' THEN y.text
         WHEN a.tran_type = 'cb_rec' THEN w.text
     END AS text,
     ROUND(0 - (ROUND(a.arec_amount / CASE
         WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
         WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
     END * a.cust_exch_rate, u.scale)), 2) AS amount_cust,
     ROUND(0 - (ROUND(a.arec_amount / CASE
         WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
         WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
     END, s.scale)), 2) AS amount_local
FROM ar_rec_subtran a
LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
LEFT JOIN ar_customers v ON v.row_id = a.cust_row_id
LEFT JOIN adm_currencies u ON u.row_id = v.currency_id
LEFT JOIN adm_params t ON t.row_id = 1
LEFT JOIN adm_currencies s ON s.row_id = t.local_curr_id
WHERE
     CASE
         WHEN a.tran_type = 'ar_rec' THEN y.posted
         WHEN a.tran_type = 'cb_rec' THEN w.posted
     END = '1'

As you can see, complexity has increased and there are four additional 
JOINs.

I am expecting the VIEW to be used extensively for query purposes, and 
my gut-feel says that the second one is likely to lead to performance 
problems in a system with a lot of data and a lot of users.

I am not looking for an answer - I know that I should create dummy data 
and run some timing tests. I was just wondering if someone more 
experienced would wince when they look at the second SELECT, or if they 
would shrug and think that it looks fine.

Any input will be appreciated.

Frank Millman




Re: When to store data that could be derived

От
Ron
Дата:
On 3/24/19 1:42 AM, Frank wrote:
> Hi all
>
> As I understand it, a  general rule of thumb is that you should never 
> create a physical column if the data could be derived from existing 
> columns. A possible reason for breaking this rule is for performance reasons.
>
> I have a situation where I am considering breaking the rule, but I am not 
> experienced enough in SQL to know if my reason is valid. I would 
> appreciate it if someone could glance at my 'before' and 'after' scenarios 
> and see if, from a 'gut-feel' point of view, I should proceed.
>
> I have a VIEW constructed as follows -
>
> CREATE VIEW view_name AS
> [select statement 1]
> UNION ALL
> [select statement 2]
> etc.
>
> This is one of the select statements. I will give the 'after' scenario 
> first -
>
> SELECT
>     'arec' AS tran_type, a.row_id AS tran_row_id,
>     a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.tran_date
>         WHEN a.tran_type = 'cb_rec' THEN w.tran_date
>     END AS tran_date,
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.text
>         WHEN a.tran_type = 'cb_rec' THEN w.text
>     END AS text,
>     0 - a.arec_cust AS amount_cust,
>     0 - a.arec_local AS amount_local
> FROM ar_rec_subtran a
> LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
> LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
> LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
> LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
> WHERE
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.posted
>         WHEN a.tran_type = 'cb_rec' THEN w.posted
>     END = '1'
>
> The two columns a.arec_cust and a.arec_local *can* be derived from other 
> columns, and in fact that is how it is working at the moment, so here is 
> the 'before' scenario -
>
> SELECT
>     'arec' AS tran_type, a.row_id AS tran_row_id,
>     a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.tran_date
>         WHEN a.tran_type = 'cb_rec' THEN w.tran_date
>     END AS tran_date,
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.text
>         WHEN a.tran_type = 'cb_rec' THEN w.text
>     END AS text,
>     ROUND(0 - (ROUND(a.arec_amount / CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
>         WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
>     END * a.cust_exch_rate, u.scale)), 2) AS amount_cust,
>     ROUND(0 - (ROUND(a.arec_amount / CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
>         WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
>     END, s.scale)), 2) AS amount_local
> FROM ar_rec_subtran a
> LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
> LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
> LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
> LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
> LEFT JOIN ar_customers v ON v.row_id = a.cust_row_id
> LEFT JOIN adm_currencies u ON u.row_id = v.currency_id
> LEFT JOIN adm_params t ON t.row_id = 1
> LEFT JOIN adm_currencies s ON s.row_id = t.local_curr_id
> WHERE
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.posted
>         WHEN a.tran_type = 'cb_rec' THEN w.posted
>     END = '1'
>
> As you can see, complexity has increased and there are four additional JOINs.
>
> I am expecting the VIEW to be used extensively for query purposes, and my 
> gut-feel says that the second one is likely to lead to performance 
> problems in a system with a lot of data and a lot of users.

Generate an artificial load and test it?

> I am not looking for an answer - I know that I should create dummy data 
> and run some timing tests. I was just wondering if someone more 
> experienced would wince when they look at the second SELECT, or if they 
> would shrug and think that it looks fine.

Sure the second query joins a lot of tables, but is pretty straightforward.

What REALLY worries me is whether or not the query optimiser would look at 
the WHERE CASE, run away screaming and then make it use sequential scans.  
Thus, even query #1 would be slow.

>
> Any input will be appreciated.


Is this a historical data set that's never updated, or current data that's 
constantly added to?

>
> Frank Millman




-- 
Angular momentum makes the world go 'round.



Re: When to store data that could be derived

От
Frank
Дата:

On 2019-03-24 9:25 AM, Ron wrote:
> On 3/24/19 1:42 AM, Frank wrote:
>> Hi all
>>
>> As I understand it, a  general rule of thumb is that you should never 
>> create a physical column if the data could be derived from existing 
>> columns. A possible reason for breaking this rule is for performance 
>> reasons.
>>
>> I have a situation where I am considering breaking the rule, but I am 
>> not experienced enough in SQL to know if my reason is valid. I would 
>> appreciate it if someone could glance at my 'before' and 'after' 
>> scenarios and see if, from a 'gut-feel' point of view, I should proceed.
>>

[snip]

> 
> Sure the second query joins a lot of tables, but is pretty straightforward.
> 
> What REALLY worries me is whether or not the query optimiser would look 
> at the WHERE CASE, run away screaming and then make it use sequential 
> scans. Thus, even query #1 would be slow.
> 

I had not realised that. I hope someone else chimes in on this.

> 
> Is this a historical data set that's never updated, or current data 
> that's constantly added to?
> 

It is the latter - current data constantly added to.

Frank


Re: When to store data that could be derived

От
Ron
Дата:
On 3/24/19 3:05 AM, Frank wrote:
>
>
> On 2019-03-24 9:25 AM, Ron wrote:
>> On 3/24/19 1:42 AM, Frank wrote:
>>> Hi all
>>>
>>> As I understand it, a  general rule of thumb is that you should never 
>>> create a physical column if the data could be derived from existing 
>>> columns. A possible reason for breaking this rule is for performance 
>>> reasons.
>>>
>>> I have a situation where I am considering breaking the rule, but I am 
>>> not experienced enough in SQL to know if my reason is valid. I would 
>>> appreciate it if someone could glance at my 'before' and 'after' 
>>> scenarios and see if, from a 'gut-feel' point of view, I should proceed.
>>>
>
> [snip]
>
>>
>> Sure the second query joins a lot of tables, but is pretty straightforward.
>>
>> What REALLY worries me is whether or not the query optimiser would look 
>> at the WHERE CASE, run away screaming and then make it use sequential 
>> scans. Thus, even query #1 would be slow.
>>
>
> I had not realised that. I hope someone else chimes in on this.

In every DBMS that I've used, the lside (left side) needs to be static (not 
"a" static) instead of variable (like a function).

For example, this always leads to a sequential scan:
    WHERE EXTRACT(DAY FROM DATE_FIELD) = 5

>
>>
>> Is this a historical data set that's never updated, or current data 
>> that's constantly added to?
>>
>
> It is the latter - current data constantly added to.
>
> Frank
>

-- 
Angular momentum makes the world go 'round.


Re: When to store data that could be derived

От
Chris Travers
Дата:


On Sun, Mar 24, 2019 at 9:05 AM Frank <frank@chagford.com> wrote:


On 2019-03-24 9:25 AM, Ron wrote:
> On 3/24/19 1:42 AM, Frank wrote:
>> Hi all
>>
>> As I understand it, a  general rule of thumb is that you should never
>> create a physical column if the data could be derived from existing
>> columns. A possible reason for breaking this rule is for performance
>> reasons.
>>
>> I have a situation where I am considering breaking the rule, but I am
>> not experienced enough in SQL to know if my reason is valid. I would
>> appreciate it if someone could glance at my 'before' and 'after'
>> scenarios and see if, from a 'gut-feel' point of view, I should proceed.
>>

[snip]

>
> Sure the second query joins a lot of tables, but is pretty straightforward.
>
> What REALLY worries me is whether or not the query optimiser would look
> at the WHERE CASE, run away screaming and then make it use sequential
> scans. Thus, even query #1 would be slow.
>

I had not realised that. I hope someone else chimes in on this.

There are a few other things to note here.

1  If the data is frequently used in its derivative form, putting it in a function helps, and
2.  You can index the output of the function which means you run it on insert/update time and can often avoid running it on selection time if it is just a part of the where clause.

In my experience usually  we have used trigger-updated functions when the field values are very large or expensive, and may need to be part of the column list, and functions with functional indexes when we don't need to ever put them in the select column list or where the columns are small and easy to calculate.

--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: When to store data that could be derived

От
Tony Shelver
Дата:
Not the answer you are looking for, but...

I'd suggest trying to create a non-trivial set of dummy data to test your assumptions before deciding on a route.
It's saved my (professional) life a few times over the years when dealing with untested designs and new (to us) technology.

Some years ago we were implementing an identity management system for a large US bank, with SQL Server as the store, with a planned integration to the id / access / permissions of some 300+ systems, targeting 30k plus users.

Requirements kept changing as we added new systems to the initial mix, which the Id management package couldn't handle out the box, so we had to implement a custom design.  We had to choose between 2 database designs, one being fully 'normalized' (considering that everything was an object') and one where we made some assumptions and fixed some table structures in the interest of performance.

Eventually we spent a few days adding non-trivial amounts of test data to the proposed designs and it became quickly became very apparent that option 1 was unworkable once we got beyond 10 systems or so.


Re: When to store data that could be derived

От
Frank
Дата:

On 2019-03-24 11:11 AM, Tony Shelver wrote:
> Not the answer you are looking for, but...
> 
> I'd suggest trying to create a non-trivial set of dummy data to test your
> assumptions before deciding on a route.
> It's saved my (professional) life a few times over the years when dealing
> with untested designs and new (to us) technology.
> 
> Some years ago we were implementing an identity management system for a
> large US bank, with SQL Server as the store, with a planned integration to
> the id / access / permissions of some 300+ systems, targeting 30k plus
> users.
> 
> Requirements kept changing as we added new systems to the initial mix,
> which the Id management package couldn't handle out the box, so we had to
> implement a custom design.  We had to choose between 2 database designs,
> one being fully 'normalized' (considering that everything was an object')
> and one where we made some assumptions and fixed some table structures in
> the interest of performance.
> 
> Eventually we spent a few days adding non-trivial amounts of test data to
> the proposed designs and it became quickly became very apparent that option
> 1 was unworkable once we got beyond 10 systems or so.
> 

Good advice - much appreciated.

Frank


Re: When to store data that could be derived

От
Ron
Дата:
On 3/24/19 3:45 AM, Ron wrote:
[snip]
> In every DBMS that I've used, the lside (left side) needs to be static 
> (not "a" static) instead of variable (like a function).

Thanks to Chris Travers for reminding me that the word is "immutable", not 
"static".

-- 
Angular momentum makes the world go 'round.


Re: When to store data that could be derived

От
"Peter J. Holzer"
Дата:
On 2019-03-24 10:05:02 +0200, Frank wrote:
>
>
> On 2019-03-24 9:25 AM, Ron wrote:
> > On 3/24/19 1:42 AM, Frank wrote:
> > > As I understand it, a  general rule of thumb is that you should
> > > never create a physical column if the data could be derived from
> > > existing columns.

The main reason for this rule (which leads to the 3rd normal form) is to
avoid inconsistencies when data is changed.

> > Is this a historical data set that's never updated, or current data
> > that's constantly added to?
> >
>
> It is the latter - current data constantly added to.

So the important part here is not whether data is added, but whether
data is changed. Sure, new transactions are added all the time. But is
it expected that the data used to derive amount_cust and amount_local
(e.g. the exchange rate) is changed retroactively, and if it is should
the computed amount change? (I'm a bit worried about the join with the
customers table here - what happens when a customer moves their
headquarters to a country with a different currency?)

> > Sure the second query joins a lot of tables, but is pretty straightforward.
> >
> > What REALLY worries me is whether or not the query optimiser would look
> > at the WHERE CASE, run away screaming and then make it use sequential
> > scans. Thus, even query #1 would be slow.
> >
>
> I had not realised that. I hope someone else chimes in on this.

Your condition is:

    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.posted
        WHEN a.tran_type = 'cb_rec' THEN w.posted
    END = '1'

Think about how you would check that. The straightforward way is to
compute the value of the case/end clause and compare that to 1. But to
compute that value you first need the value of a.tran_type. There are
two possible values here, so maybe an index scan on a.tran_type might be
possible, but I'm not sure whether the optimizer is even smart enought
to figure that out and if it is, whether those to values are selective
enough (Maybe all or most records are either ar_rec or cb_rec). After
that you can retrieve the posted value from the correct table.

As a human I see that the condition can only ever be true for records
from y and w with posted = 1. So it might be better to use index scans
on those columns. But this is not that easy to see, and I don't know
whether the optimizer can do it.

Rewriting the condition as

    (a.tran_type = 'ar_rec' and y.posted = 1) or
    (a.tran_type = 'cb_rec' and w.posted = 1)

might make it easier for the optimizer to find a good plan.

(Please note that I haven't tested any of this. Use EXPLAIN to check
what the optimizer really does.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Вложения

Re: When to store data that could be derived

От
Chris Travers
Дата:
Meant to send this to the list but hit the wrong button.

On Sun, Mar 24, 2019 at 9:45 AM Ron <ronljohnsonjr@gmail.com> wrote:
On 3/24/19 3:05 AM, Frank wrote:
>
>
> On 2019-03-24 9:25 AM, Ron wrote:
>> On 3/24/19 1:42 AM, Frank wrote:
>>> Hi all
>>>
>>> As I understand it, a  general rule of thumb is that you should never
>>> create a physical column if the data could be derived from existing
>>> columns. A possible reason for breaking this rule is for performance
>>> reasons.
>>>
>>> I have a situation where I am considering breaking the rule, but I am
>>> not experienced enough in SQL to know if my reason is valid. I would
>>> appreciate it if someone could glance at my 'before' and 'after'
>>> scenarios and see if, from a 'gut-feel' point of view, I should proceed.
>>>
>
> [snip]
>
>>
>> Sure the second query joins a lot of tables, but is pretty straightforward.
>>
>> What REALLY worries me is whether or not the query optimiser would look
>> at the WHERE CASE, run away screaming and then make it use sequential
>> scans. Thus, even query #1 would be slow.
>>
>
> I had not realised that. I hope someone else chimes in on this.

In every DBMS that I've used, the lside (left side) needs to be static (not
"a" static) instead of variable (like a function).

For example, this always leads to a sequential scan:
    WHERE EXTRACT(DAY FROM DATE_FIELD) = 5


PostgreSQL allows expression indexes

So you can:

create index foo on bar ((id % 1000));

And then use the index on:

select * from bar where id % 1000 = 45; 

You could similarly

create index foo on bar (extract(day from date_field));

The left side needs to be indexed (and an immutable expression) but beyond that..... 

>
>>
>> Is this a historical data set that's never updated, or current data
>> that's constantly added to?
>>
>
> It is the latter - current data constantly added to.
>
> Frank
>

--
Angular momentum makes the world go 'round.



--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: When to store data that could be derived

От
Frank
Дата:
On 2019-03-24 2:41 PM, Peter J. Holzer wrote:
> On 2019-03-24 10:05:02 +0200, Frank wrote:

Many thanks to Peter et al for their valuable insights. I have learned a 
lot.

 > So the important part here is not whether data is added, but whether
 > data is changed. Sure, new transactions are added all the time. But is
 > it expected that the data used to derive amount_cust and amount_local
 > (e.g. the exchange rate) is changed retroactively, and if it is should
 > the computed amount change? (I'm a bit worried about the join with the
 > customers table here - what happens when a customer moves their
 > headquarters to a country with a different currency?)

I think I have got both of those covered. I store the exchange rates in 
physical columns on the transaction, so the compute expressions will 
always return the same values. I have separate tables for 
'organisations' (O) and for 'customers' (C). C has a foreign key 
reference to O, and most static data such as addresses and contact 
details are stored on O. So if a customer moved, I would create a new C 
record with the new currency, and flag the old C record as inactive. 
They can happily co-exist, so receipts can be processed against the old 
C record until it is paid up.

I have started doing some volume tests, and at this stage, for the kind 
of volumes I am concerned about, it looks as if performance is a non-issue.

I generated about 22000 invoices and 22000 receipts, over 12 customers 
and 6 months. Invoices and receipts are stored in separate tables, and a 
VIEW presents them as a single table.

Using the VIEW, I selected all transactions for a given customer for a 
given month. It returned 620 rows and (on my slow desktop computer) it 
took 20ms. I can live with that.

I will generate some higher volumes overnight, and see if it makes a big 
difference. If you do not hear from me, you can consider it 'problem 
solved' :-)

Again, thanks to all.

Frank


Re: When to store data that could be derived

От
Ron
Дата:
On 3/25/19 8:15 AM, Frank wrote:
> On 2019-03-24 2:41 PM, Peter J. Holzer wrote:
>> On 2019-03-24 10:05:02 +0200, Frank wrote:
>
> Many thanks to Peter et al for their valuable insights. I have learned a lot.
>
> > So the important part here is not whether data is added, but whether
> > data is changed. Sure, new transactions are added all the time. But is
> > it expected that the data used to derive amount_cust and amount_local
> > (e.g. the exchange rate) is changed retroactively, and if it is should
> > the computed amount change? (I'm a bit worried about the join with the
> > customers table here - what happens when a customer moves their
> > headquarters to a country with a different currency?)
>
> I think I have got both of those covered. I store the exchange rates in 
> physical columns on the transaction, so the compute expressions will 
> always return the same values. I have separate tables for 'organisations' 
> (O) and for 'customers' (C). C has a foreign key reference to O, and most 
> static data such as addresses and contact details are stored on O. So if a 
> customer moved, I would create a new C record with the new currency, and 
> flag the old C record as inactive. They can happily co-exist, so receipts 
> can be processed against the old C record until it is paid up.
>
> I have started doing some volume tests, and at this stage, for the kind of 
> volumes I am concerned about, it looks as if performance is a non-issue.
>
> I generated about 22000 invoices and 22000 receipts, over 12 customers and 
> 6 months. Invoices and receipts are stored in separate tables, and a VIEW 
> presents them as a single table.
>
> Using the VIEW, I selected all transactions for a given customer for a 
> given month. It returned 620 rows and (on my slow desktop computer) it 
> took 20ms. I can live with that.
>
> I will generate some higher volumes overnight, and see if it makes a big 
> difference. If you do not hear from me, you can consider it 'problem 
> solved' :-)

It would be interesting to see what the query planner tries to do with this:

WHERE
     CASE
         WHEN a.tran_type = 'ar_rec' THEN y.posted
         WHEN a.tran_type = 'cb_rec' THEN w.posted
     END = '1'

-- 
Angular momentum makes the world go 'round.


Re: When to store data that could be derived

От
Frank
Дата:

On 2019-03-25 4:06 PM, Ron wrote:
> On 3/25/19 8:15 AM, Frank wrote:
> 
> It would be interesting to see what the query planner tries to do with 
> this:
> 
> WHERE
>      CASE
>          WHEN a.tran_type = 'ar_rec' THEN y.posted
>          WHEN a.tran_type = 'cb_rec' THEN w.posted
>      END = '1'
> 

I have attached the schema showing the full VIEW definition, and the 
result of the following EXPLAIN -

EXPLAIN SELECT * FROM ccc.ar_trans WHERE cust_row_id = 4 AND tran_date 
BETWEEN '2015-06-01' AND '2015-06-30'.

Because I have used 'WHERE tran_date' in the query, and tran_date is 
also derived from a CASE expression, I imagine that will also add some 
complication.

I am running PostgreSQL 11.1 on Fedora 29.

Frank

Вложения

Re: When to store data that could be derived

От
Frank
Дата:
On 2019-03-25 5:11 PM, Frank wrote:
> 
> 
> On 2019-03-25 4:06 PM, Ron wrote:
>> On 3/25/19 8:15 AM, Frank wrote:
>>
>> It would be interesting to see what the query planner tries to do with 
>> this:
>>
>> WHERE
>>      CASE
>>          WHEN a.tran_type = 'ar_rec' THEN y.posted
>>          WHEN a.tran_type = 'cb_rec' THEN w.posted
>>      END = '1'
>>
> 
> I have attached the schema showing the full VIEW definition, and the 
> result of the following EXPLAIN -
> 
> EXPLAIN SELECT * FROM ccc.ar_trans WHERE cust_row_id = 4 AND tran_date 
> BETWEEN '2015-06-01' AND '2015-06-30'.
> 
> Because I have used 'WHERE tran_date' in the query, and tran_date is 
> also derived from a CASE expression, I imagine that will also add some 
> complication.
> 
> I am running PostgreSQL 11.1 on Fedora 29.
> 
> Frank

On reflection, I have not been consistent with my use of indexes, and I 
think that will affect the query plan.

There are at least two issues -

1. The first table defined in the VIEW is ccc.ar_tran_inv. It has the 
following index -
     "arinv_cust_date" btree (cust_row_id NULLS FIRST, tran_date NULLS 
FIRST) WHERE deleted_id = 0

I have not used "WHERE deleted_id = 0" when constructing the VIEW, but I 
have used "WHERE posted = '1'". I don't think the index can be used with 
this setup.

2. The complicated table in the VIEW is ccc.ar_rec_subtran. Various 
columns such as tran_date and posted are retrieved via CASE expressions 
from two underlying tables. Those tables have certain indexes defined, 
but I cannot see how they can be utilised from my current setup.

I think I should spend some time tidying this up before you try to make 
sense of the query plan. Any tips on how to improve it will be appreciated.

Frank