Обсуждение: When to store data that could be derived
Hi all As I understand it, a general rule of thumb is that you should never create a physical column if the data could be derived from existing columns. A possible reason for breaking this rule is for performance reasons. I have a situation where I am considering breaking the rule, but I am not experienced enough in SQL to know if my reason is valid. I would appreciate it if someone could glance at my 'before' and 'after' scenarios and see if, from a 'gut-feel' point of view, I should proceed. I have a VIEW constructed as follows - CREATE VIEW view_name AS [select statement 1] UNION ALL [select statement 2] etc. This is one of the select statements. I will give the 'after' scenario first - SELECT 'arec' AS tran_type, a.row_id AS tran_row_id, a.tran_number AS tran_number, a.cust_row_id AS cust_row_id, CASE WHEN a.tran_type = 'ar_rec' THEN y.tran_date WHEN a.tran_type = 'cb_rec' THEN w.tran_date END AS tran_date, CASE WHEN a.tran_type = 'ar_rec' THEN y.text WHEN a.tran_type = 'cb_rec' THEN w.text END AS text, 0 - a.arec_cust AS amount_cust, 0 - a.arec_local AS amount_local FROM ar_rec_subtran a LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id WHERE CASE WHEN a.tran_type = 'ar_rec' THEN y.posted WHEN a.tran_type = 'cb_rec' THEN w.posted END = '1' The two columns a.arec_cust and a.arec_local *can* be derived from other columns, and in fact that is how it is working at the moment, so here is the 'before' scenario - SELECT 'arec' AS tran_type, a.row_id AS tran_row_id, a.tran_number AS tran_number, a.cust_row_id AS cust_row_id, CASE WHEN a.tran_type = 'ar_rec' THEN y.tran_date WHEN a.tran_type = 'cb_rec' THEN w.tran_date END AS tran_date, CASE WHEN a.tran_type = 'ar_rec' THEN y.text WHEN a.tran_type = 'cb_rec' THEN w.text END AS text, ROUND(0 - (ROUND(a.arec_amount / CASE WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate END * a.cust_exch_rate, u.scale)), 2) AS amount_cust, ROUND(0 - (ROUND(a.arec_amount / CASE WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate END, s.scale)), 2) AS amount_local FROM ar_rec_subtran a LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id LEFT JOIN ar_customers v ON v.row_id = a.cust_row_id LEFT JOIN adm_currencies u ON u.row_id = v.currency_id LEFT JOIN adm_params t ON t.row_id = 1 LEFT JOIN adm_currencies s ON s.row_id = t.local_curr_id WHERE CASE WHEN a.tran_type = 'ar_rec' THEN y.posted WHEN a.tran_type = 'cb_rec' THEN w.posted END = '1' As you can see, complexity has increased and there are four additional JOINs. I am expecting the VIEW to be used extensively for query purposes, and my gut-feel says that the second one is likely to lead to performance problems in a system with a lot of data and a lot of users. I am not looking for an answer - I know that I should create dummy data and run some timing tests. I was just wondering if someone more experienced would wince when they look at the second SELECT, or if they would shrug and think that it looks fine. Any input will be appreciated. Frank Millman
On 3/24/19 1:42 AM, Frank wrote: > Hi all > > As I understand it, a general rule of thumb is that you should never > create a physical column if the data could be derived from existing > columns. A possible reason for breaking this rule is for performance reasons. > > I have a situation where I am considering breaking the rule, but I am not > experienced enough in SQL to know if my reason is valid. I would > appreciate it if someone could glance at my 'before' and 'after' scenarios > and see if, from a 'gut-feel' point of view, I should proceed. > > I have a VIEW constructed as follows - > > CREATE VIEW view_name AS > [select statement 1] > UNION ALL > [select statement 2] > etc. > > This is one of the select statements. I will give the 'after' scenario > first - > > SELECT > 'arec' AS tran_type, a.row_id AS tran_row_id, > a.tran_number AS tran_number, a.cust_row_id AS cust_row_id, > CASE > WHEN a.tran_type = 'ar_rec' THEN y.tran_date > WHEN a.tran_type = 'cb_rec' THEN w.tran_date > END AS tran_date, > CASE > WHEN a.tran_type = 'ar_rec' THEN y.text > WHEN a.tran_type = 'cb_rec' THEN w.text > END AS text, > 0 - a.arec_cust AS amount_cust, > 0 - a.arec_local AS amount_local > FROM ar_rec_subtran a > LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id > LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id > LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id > LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id > WHERE > CASE > WHEN a.tran_type = 'ar_rec' THEN y.posted > WHEN a.tran_type = 'cb_rec' THEN w.posted > END = '1' > > The two columns a.arec_cust and a.arec_local *can* be derived from other > columns, and in fact that is how it is working at the moment, so here is > the 'before' scenario - > > SELECT > 'arec' AS tran_type, a.row_id AS tran_row_id, > a.tran_number AS tran_number, a.cust_row_id AS cust_row_id, > CASE > WHEN a.tran_type = 'ar_rec' THEN y.tran_date > WHEN a.tran_type = 'cb_rec' THEN w.tran_date > END AS tran_date, > CASE > WHEN a.tran_type = 'ar_rec' THEN y.text > WHEN a.tran_type = 'cb_rec' THEN w.text > END AS text, > ROUND(0 - (ROUND(a.arec_amount / CASE > WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate > WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate > END * a.cust_exch_rate, u.scale)), 2) AS amount_cust, > ROUND(0 - (ROUND(a.arec_amount / CASE > WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate > WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate > END, s.scale)), 2) AS amount_local > FROM ar_rec_subtran a > LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id > LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id > LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id > LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id > LEFT JOIN ar_customers v ON v.row_id = a.cust_row_id > LEFT JOIN adm_currencies u ON u.row_id = v.currency_id > LEFT JOIN adm_params t ON t.row_id = 1 > LEFT JOIN adm_currencies s ON s.row_id = t.local_curr_id > WHERE > CASE > WHEN a.tran_type = 'ar_rec' THEN y.posted > WHEN a.tran_type = 'cb_rec' THEN w.posted > END = '1' > > As you can see, complexity has increased and there are four additional JOINs. > > I am expecting the VIEW to be used extensively for query purposes, and my > gut-feel says that the second one is likely to lead to performance > problems in a system with a lot of data and a lot of users. Generate an artificial load and test it? > I am not looking for an answer - I know that I should create dummy data > and run some timing tests. I was just wondering if someone more > experienced would wince when they look at the second SELECT, or if they > would shrug and think that it looks fine. Sure the second query joins a lot of tables, but is pretty straightforward. What REALLY worries me is whether or not the query optimiser would look at the WHERE CASE, run away screaming and then make it use sequential scans. Thus, even query #1 would be slow. > > Any input will be appreciated. Is this a historical data set that's never updated, or current data that's constantly added to? > > Frank Millman -- Angular momentum makes the world go 'round.
On 2019-03-24 9:25 AM, Ron wrote: > On 3/24/19 1:42 AM, Frank wrote: >> Hi all >> >> As I understand it, a general rule of thumb is that you should never >> create a physical column if the data could be derived from existing >> columns. A possible reason for breaking this rule is for performance >> reasons. >> >> I have a situation where I am considering breaking the rule, but I am >> not experienced enough in SQL to know if my reason is valid. I would >> appreciate it if someone could glance at my 'before' and 'after' >> scenarios and see if, from a 'gut-feel' point of view, I should proceed. >> [snip] > > Sure the second query joins a lot of tables, but is pretty straightforward. > > What REALLY worries me is whether or not the query optimiser would look > at the WHERE CASE, run away screaming and then make it use sequential > scans. Thus, even query #1 would be slow. > I had not realised that. I hope someone else chimes in on this. > > Is this a historical data set that's never updated, or current data > that's constantly added to? > It is the latter - current data constantly added to. Frank
On 3/24/19 3:05 AM, Frank wrote: > > > On 2019-03-24 9:25 AM, Ron wrote: >> On 3/24/19 1:42 AM, Frank wrote: >>> Hi all >>> >>> As I understand it, a general rule of thumb is that you should never >>> create a physical column if the data could be derived from existing >>> columns. A possible reason for breaking this rule is for performance >>> reasons. >>> >>> I have a situation where I am considering breaking the rule, but I am >>> not experienced enough in SQL to know if my reason is valid. I would >>> appreciate it if someone could glance at my 'before' and 'after' >>> scenarios and see if, from a 'gut-feel' point of view, I should proceed. >>> > > [snip] > >> >> Sure the second query joins a lot of tables, but is pretty straightforward. >> >> What REALLY worries me is whether or not the query optimiser would look >> at the WHERE CASE, run away screaming and then make it use sequential >> scans. Thus, even query #1 would be slow. >> > > I had not realised that. I hope someone else chimes in on this. In every DBMS that I've used, the lside (left side) needs to be static (not "a" static) instead of variable (like a function). For example, this always leads to a sequential scan: WHERE EXTRACT(DAY FROM DATE_FIELD) = 5 > >> >> Is this a historical data set that's never updated, or current data >> that's constantly added to? >> > > It is the latter - current data constantly added to. > > Frank > -- Angular momentum makes the world go 'round.
On Sun, Mar 24, 2019 at 9:05 AM Frank <frank@chagford.com> wrote:
On 2019-03-24 9:25 AM, Ron wrote:
> On 3/24/19 1:42 AM, Frank wrote:
>> Hi all
>>
>> As I understand it, a general rule of thumb is that you should never
>> create a physical column if the data could be derived from existing
>> columns. A possible reason for breaking this rule is for performance
>> reasons.
>>
>> I have a situation where I am considering breaking the rule, but I am
>> not experienced enough in SQL to know if my reason is valid. I would
>> appreciate it if someone could glance at my 'before' and 'after'
>> scenarios and see if, from a 'gut-feel' point of view, I should proceed.
>>
[snip]
>
> Sure the second query joins a lot of tables, but is pretty straightforward.
>
> What REALLY worries me is whether or not the query optimiser would look
> at the WHERE CASE, run away screaming and then make it use sequential
> scans. Thus, even query #1 would be slow.
>
I had not realised that. I hope someone else chimes in on this.
There are a few other things to note here.
1 If the data is frequently used in its derivative form, putting it in a function helps, and
2. You can index the output of the function which means you run it on insert/update time and can often avoid running it on selection time if it is just a part of the where clause.
In my experience usually we have used trigger-updated functions when the field values are very large or expensive, and may need to be part of the column list, and functions with functional indexes when we don't need to ever put them in the select column list or where the columns are small and easy to calculate.
Best Wishes,
Chris Travers
Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.
Not the answer you are looking for, but...
I'd suggest trying to create a non-trivial set of dummy data to test your assumptions before deciding on a route.
It's saved my (professional) life a few times over the years when dealing with untested designs and new (to us) technology.
Some years ago we were implementing an identity management system for a large US bank, with SQL Server as the store, with a planned integration to the id / access / permissions of some 300+ systems, targeting 30k plus users.
Requirements kept changing as we added new systems to the initial mix, which the Id management package couldn't handle out the box, so we had to implement a custom design. We had to choose between 2 database designs, one being fully 'normalized' (considering that everything was an object') and one where we made some assumptions and fixed some table structures in the interest of performance.
Eventually we spent a few days adding non-trivial amounts of test data to the proposed designs and it became quickly became very apparent that option 1 was unworkable once we got beyond 10 systems or so.
On 2019-03-24 11:11 AM, Tony Shelver wrote: > Not the answer you are looking for, but... > > I'd suggest trying to create a non-trivial set of dummy data to test your > assumptions before deciding on a route. > It's saved my (professional) life a few times over the years when dealing > with untested designs and new (to us) technology. > > Some years ago we were implementing an identity management system for a > large US bank, with SQL Server as the store, with a planned integration to > the id / access / permissions of some 300+ systems, targeting 30k plus > users. > > Requirements kept changing as we added new systems to the initial mix, > which the Id management package couldn't handle out the box, so we had to > implement a custom design. We had to choose between 2 database designs, > one being fully 'normalized' (considering that everything was an object') > and one where we made some assumptions and fixed some table structures in > the interest of performance. > > Eventually we spent a few days adding non-trivial amounts of test data to > the proposed designs and it became quickly became very apparent that option > 1 was unworkable once we got beyond 10 systems or so. > Good advice - much appreciated. Frank
On 3/24/19 3:45 AM, Ron wrote: [snip] > In every DBMS that I've used, the lside (left side) needs to be static > (not "a" static) instead of variable (like a function). Thanks to Chris Travers for reminding me that the word is "immutable", not "static". -- Angular momentum makes the world go 'round.
On 2019-03-24 10:05:02 +0200, Frank wrote: > > > On 2019-03-24 9:25 AM, Ron wrote: > > On 3/24/19 1:42 AM, Frank wrote: > > > As I understand it, a general rule of thumb is that you should > > > never create a physical column if the data could be derived from > > > existing columns. The main reason for this rule (which leads to the 3rd normal form) is to avoid inconsistencies when data is changed. > > Is this a historical data set that's never updated, or current data > > that's constantly added to? > > > > It is the latter - current data constantly added to. So the important part here is not whether data is added, but whether data is changed. Sure, new transactions are added all the time. But is it expected that the data used to derive amount_cust and amount_local (e.g. the exchange rate) is changed retroactively, and if it is should the computed amount change? (I'm a bit worried about the join with the customers table here - what happens when a customer moves their headquarters to a country with a different currency?) > > Sure the second query joins a lot of tables, but is pretty straightforward. > > > > What REALLY worries me is whether or not the query optimiser would look > > at the WHERE CASE, run away screaming and then make it use sequential > > scans. Thus, even query #1 would be slow. > > > > I had not realised that. I hope someone else chimes in on this. Your condition is: CASE WHEN a.tran_type = 'ar_rec' THEN y.posted WHEN a.tran_type = 'cb_rec' THEN w.posted END = '1' Think about how you would check that. The straightforward way is to compute the value of the case/end clause and compare that to 1. But to compute that value you first need the value of a.tran_type. There are two possible values here, so maybe an index scan on a.tran_type might be possible, but I'm not sure whether the optimizer is even smart enought to figure that out and if it is, whether those to values are selective enough (Maybe all or most records are either ar_rec or cb_rec). After that you can retrieve the posted value from the correct table. As a human I see that the condition can only ever be true for records from y and w with posted = 1. So it might be better to use index scans on those columns. But this is not that easy to see, and I don't know whether the optimizer can do it. Rewriting the condition as (a.tran_type = 'ar_rec' and y.posted = 1) or (a.tran_type = 'cb_rec' and w.posted = 1) might make it easier for the optimizer to find a good plan. (Please note that I haven't tested any of this. Use EXPLAIN to check what the optimizer really does. hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Вложения
Meant to send this to the list but hit the wrong button.
On Sun, Mar 24, 2019 at 9:45 AM Ron <ronljohnsonjr@gmail.com> wrote:
On 3/24/19 3:05 AM, Frank wrote:
>
>
> On 2019-03-24 9:25 AM, Ron wrote:
>> On 3/24/19 1:42 AM, Frank wrote:
>>> Hi all
>>>
>>> As I understand it, a general rule of thumb is that you should never
>>> create a physical column if the data could be derived from existing
>>> columns. A possible reason for breaking this rule is for performance
>>> reasons.
>>>
>>> I have a situation where I am considering breaking the rule, but I am
>>> not experienced enough in SQL to know if my reason is valid. I would
>>> appreciate it if someone could glance at my 'before' and 'after'
>>> scenarios and see if, from a 'gut-feel' point of view, I should proceed.
>>>
>
> [snip]
>
>>
>> Sure the second query joins a lot of tables, but is pretty straightforward.
>>
>> What REALLY worries me is whether or not the query optimiser would look
>> at the WHERE CASE, run away screaming and then make it use sequential
>> scans. Thus, even query #1 would be slow.
>>
>
> I had not realised that. I hope someone else chimes in on this.
In every DBMS that I've used, the lside (left side) needs to be static (not
"a" static) instead of variable (like a function).
For example, this always leads to a sequential scan:
WHERE EXTRACT(DAY FROM DATE_FIELD) = 5
PostgreSQL allows expression indexes
So you can:
create index foo on bar ((id % 1000));
And then use the index on:
select * from bar where id % 1000 = 45;
You could similarly
create index foo on bar (extract(day from date_field));
The left side needs to be indexed (and an immutable expression) but beyond that.....
>
>>
>> Is this a historical data set that's never updated, or current data
>> that's constantly added to?
>>
>
> It is the latter - current data constantly added to.
>
> Frank
>
--
Angular momentum makes the world go 'round.
Best Wishes,
Chris Travers
Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.
On 2019-03-24 2:41 PM, Peter J. Holzer wrote: > On 2019-03-24 10:05:02 +0200, Frank wrote: Many thanks to Peter et al for their valuable insights. I have learned a lot. > So the important part here is not whether data is added, but whether > data is changed. Sure, new transactions are added all the time. But is > it expected that the data used to derive amount_cust and amount_local > (e.g. the exchange rate) is changed retroactively, and if it is should > the computed amount change? (I'm a bit worried about the join with the > customers table here - what happens when a customer moves their > headquarters to a country with a different currency?) I think I have got both of those covered. I store the exchange rates in physical columns on the transaction, so the compute expressions will always return the same values. I have separate tables for 'organisations' (O) and for 'customers' (C). C has a foreign key reference to O, and most static data such as addresses and contact details are stored on O. So if a customer moved, I would create a new C record with the new currency, and flag the old C record as inactive. They can happily co-exist, so receipts can be processed against the old C record until it is paid up. I have started doing some volume tests, and at this stage, for the kind of volumes I am concerned about, it looks as if performance is a non-issue. I generated about 22000 invoices and 22000 receipts, over 12 customers and 6 months. Invoices and receipts are stored in separate tables, and a VIEW presents them as a single table. Using the VIEW, I selected all transactions for a given customer for a given month. It returned 620 rows and (on my slow desktop computer) it took 20ms. I can live with that. I will generate some higher volumes overnight, and see if it makes a big difference. If you do not hear from me, you can consider it 'problem solved' :-) Again, thanks to all. Frank
On 3/25/19 8:15 AM, Frank wrote: > On 2019-03-24 2:41 PM, Peter J. Holzer wrote: >> On 2019-03-24 10:05:02 +0200, Frank wrote: > > Many thanks to Peter et al for their valuable insights. I have learned a lot. > > > So the important part here is not whether data is added, but whether > > data is changed. Sure, new transactions are added all the time. But is > > it expected that the data used to derive amount_cust and amount_local > > (e.g. the exchange rate) is changed retroactively, and if it is should > > the computed amount change? (I'm a bit worried about the join with the > > customers table here - what happens when a customer moves their > > headquarters to a country with a different currency?) > > I think I have got both of those covered. I store the exchange rates in > physical columns on the transaction, so the compute expressions will > always return the same values. I have separate tables for 'organisations' > (O) and for 'customers' (C). C has a foreign key reference to O, and most > static data such as addresses and contact details are stored on O. So if a > customer moved, I would create a new C record with the new currency, and > flag the old C record as inactive. They can happily co-exist, so receipts > can be processed against the old C record until it is paid up. > > I have started doing some volume tests, and at this stage, for the kind of > volumes I am concerned about, it looks as if performance is a non-issue. > > I generated about 22000 invoices and 22000 receipts, over 12 customers and > 6 months. Invoices and receipts are stored in separate tables, and a VIEW > presents them as a single table. > > Using the VIEW, I selected all transactions for a given customer for a > given month. It returned 620 rows and (on my slow desktop computer) it > took 20ms. I can live with that. > > I will generate some higher volumes overnight, and see if it makes a big > difference. If you do not hear from me, you can consider it 'problem > solved' :-) It would be interesting to see what the query planner tries to do with this: WHERE CASE WHEN a.tran_type = 'ar_rec' THEN y.posted WHEN a.tran_type = 'cb_rec' THEN w.posted END = '1' -- Angular momentum makes the world go 'round.
On 2019-03-25 4:06 PM, Ron wrote: > On 3/25/19 8:15 AM, Frank wrote: > > It would be interesting to see what the query planner tries to do with > this: > > WHERE > CASE > WHEN a.tran_type = 'ar_rec' THEN y.posted > WHEN a.tran_type = 'cb_rec' THEN w.posted > END = '1' > I have attached the schema showing the full VIEW definition, and the result of the following EXPLAIN - EXPLAIN SELECT * FROM ccc.ar_trans WHERE cust_row_id = 4 AND tran_date BETWEEN '2015-06-01' AND '2015-06-30'. Because I have used 'WHERE tran_date' in the query, and tran_date is also derived from a CASE expression, I imagine that will also add some complication. I am running PostgreSQL 11.1 on Fedora 29. Frank
Вложения
On 2019-03-25 5:11 PM, Frank wrote: > > > On 2019-03-25 4:06 PM, Ron wrote: >> On 3/25/19 8:15 AM, Frank wrote: >> >> It would be interesting to see what the query planner tries to do with >> this: >> >> WHERE >> CASE >> WHEN a.tran_type = 'ar_rec' THEN y.posted >> WHEN a.tran_type = 'cb_rec' THEN w.posted >> END = '1' >> > > I have attached the schema showing the full VIEW definition, and the > result of the following EXPLAIN - > > EXPLAIN SELECT * FROM ccc.ar_trans WHERE cust_row_id = 4 AND tran_date > BETWEEN '2015-06-01' AND '2015-06-30'. > > Because I have used 'WHERE tran_date' in the query, and tran_date is > also derived from a CASE expression, I imagine that will also add some > complication. > > I am running PostgreSQL 11.1 on Fedora 29. > > Frank On reflection, I have not been consistent with my use of indexes, and I think that will affect the query plan. There are at least two issues - 1. The first table defined in the VIEW is ccc.ar_tran_inv. It has the following index - "arinv_cust_date" btree (cust_row_id NULLS FIRST, tran_date NULLS FIRST) WHERE deleted_id = 0 I have not used "WHERE deleted_id = 0" when constructing the VIEW, but I have used "WHERE posted = '1'". I don't think the index can be used with this setup. 2. The complicated table in the VIEW is ccc.ar_rec_subtran. Various columns such as tran_date and posted are retrieved via CASE expressions from two underlying tables. Those tables have certain indexes defined, but I cannot see how they can be utilised from my current setup. I think I should spend some time tidying this up before you try to make sense of the query plan. Any tips on how to improve it will be appreciated. Frank