Обсуждение: extract(year from date) doesn't use index but maybe could?
Given the table: CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL) With an *index* on field d. The following two queries are functionally equivalent: 1. SELECT * FROM dates WHERE d >= '1900-01-01' 2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900' By functionally equivalent, they will return the same result set. Query 2 does not use the index, adding a performance cost. It seems there is an opportunity for optimization to handle these two queries equivalently to take advantage of the index. Some database abstraction layers have attempted to workaround this limitation by rewriting EXTRACT(year ...) queries into a query more like query 1. For example: Django's ORM does exctly this. Rather than all abstraction layers trying to optimize this case, maybe it could be pushed to the database layer. I have written a test script that demonstrates that these functionally equivalent queries have different performance characteristics. The script and results are provide below: RESULTS: ---- EXPLAIN SELECT * FROM dates WHERE d >= '1900-01-01' QUERY PLAN ---------------------------------------------------------------------------- Bitmap Heap Scan on dates (cost=9819.23..26390.15 rows=524233 width=40) Recheck Cond: (d >= '1900-01-01'::date) -> Bitmap Index Scan on d_idx (cost=0.00..9688.17 rows=524233 width=0) Index Cond: (d >= '1900-01-01'::date) (4 rows) EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900 QUERY PLAN ------------------------------------------------------------------------------------------------- Seq Scan on dates (cost=0.00..37540.25 rows=524233 width=40) Filter: (date_part('year'::text, (d)::timestamp without time zone) >= 1900::double precision) (2 rows) Timing select_without_extract: 284.233350s select_with_extract: 323.106491s ---- SCRIPT: ---- #!/usr/bin/python3 import datetime import subprocess import random import timeit import sys subprocess.check_call(['psql', 'postgres', '-c', 'DROP DATABASE IF EXISTS datetest'], stdout=subprocess.DEVNULL) subprocess.check_call(['psql', 'postgres', '-c', 'CREATE DATABASE datetest'], stdout=subprocess.DEVNULL) subprocess.check_call(['psql', 'datetest', '-c', 'CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)'], stdout=subprocess.DEVNULL) def chunks(n, l): i = 0 while i < len(l): yield l[i:i+n] i += n d = datetime.date(1800, 1, 1) today = datetime.date.today() values = [] while d < today: values.extend('(\'%s\', \'%s\')' % (d, d) for i in range(20)) d += datetime.timedelta(days=1) random.shuffle(values) for chunk in chunks(1000, values): s = ','.join(chunk) subprocess.check_call(['psql', 'datetest', '-c', 'INSERT INTO dates (d, t) VALUES %s' % s], stdout=subprocess.DEVNULL) subprocess.check_call(['psql', 'datetest', '-c', 'CREATE INDEX d_idx ON dates (d)'], stdout=subprocess.DEVNULL) print('EXPLAIN SELECT * FROM dates WHERE d >= \'1900-01-01\'') sys.stdout.flush() subprocess.check_call(['psql', 'datetest', '-c', 'EXPLAIN SELECT * FROM dates WHERE d >= \'1900-01-01\'']) print('EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900') sys.stdout.flush() subprocess.check_call(['psql', 'datetest', '-c', 'EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900']) def select_without_extract(): subprocess.check_call(['psql', 'datetest', '-c', 'SELECT * FROM dates WHERE d >= \'1900-01-01\''], stdout=subprocess.DEVNULL) def select_with_extract(): subprocess.check_call(['psql', 'datetest', '-c', 'SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900'], stdout=subprocess.DEVNULL) print('Timing') sys.stdout.flush() v = timeit.timeit('select_without_extract()', setup='from __main__ import select_without_extract', number=100) print('select_without_extract: %fs' % v) sys.stdout.flush() v = timeit.timeit('select_with_extract()', setup='from __main__ import select_with_extract', number=100) print('select_with_extract: %fs' % v) sys.stdout.flush() ---
On 04/19/15 19:16, Jon Dufresne wrote: > Given the table: > > CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL) > > With an *index* on field d. The following two queries are functionally > equivalent: > > 1. SELECT * FROM dates WHERE d >= '1900-01-01' > 2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900' > > By functionally equivalent, they will return the same result set. > > Query 2 does not use the index, adding a performance cost. It seems > there is an opportunity for optimization to handle these two queries > equivalently to take advantage of the index. Or you might try creating an expression index ... CREATE INDEX date_year_idx ON dates((extract(year from d))); regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 19, 2015 at 10:42 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On 04/19/15 19:16, Jon Dufresne wrote: >> >> Given the table: >> >> CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL) >> >> With an *index* on field d. The following two queries are functionally >> equivalent: >> >> 1. SELECT * FROM dates WHERE d >= '1900-01-01' >> 2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900' >> >> By functionally equivalent, they will return the same result set. >> >> Query 2 does not use the index, adding a performance cost. It seems >> there is an opportunity for optimization to handle these two queries >> equivalently to take advantage of the index. > > > Or you might try creating an expression index ... > > CREATE INDEX date_year_idx ON dates((extract(year from d))); > Certainly, but won't this add additional overhead in the form of two indexes; one for the column and one for the expression? My point is, why force the user to take these extra steps or add overhead when the the two queries (or two indexes) are functionally equivalent. Shouldn't this is an optimization handled by the database so the user doesn't need to hand optimize these differences?
On Sun, 2015-04-19 at 13:10 -0700, Jon Dufresne wrote: > On Sun, Apr 19, 2015 at 10:42 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > On 04/19/15 19:16, Jon Dufresne wrote: > >> Given the table: > >> CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL) > >> With an *index* on field d. The following two queries are functionally > >> equivalent: > >> 1. SELECT * FROM dates WHERE d >= '1900-01-01' > >> 2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900' > >> By functionally equivalent, they will return the same result set. > >> Query 2 does not use the index, adding a performance cost. It seems > >> there is an opportunity for optimization to handle these two queries > >> equivalently to take advantage of the index. > > Or you might try creating an expression index .. > > CREATE INDEX date_year_idx ON dates((extract(year from d))); > Certainly, but won't this add additional overhead in the form of two > indexes; one for the column and one for the expression? > My point is, why force the user to take these extra steps or add > overhead when the the two queries (or two indexes) are functionally > equivalent. But they aren't functionally equivalent. One is an index on a datetime/date, the other is an index just on the year [a DOUBLE]. Date/datetimes potentially have time zones, integer values do not - in general time values are an order of magnitude more complicated than people expect. > Shouldn't this is an optimization handled by the database > so the user doesn't need to hand optimize these differences? Sometimes "d >= '1900-01-01'" and "EXTRACT(year from d) >= 1900" may be equivalent; but not always. -- Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383 Systems Administrator, Python Developer, LPI / NCLA
On 04/19/15 22:10, Jon Dufresne wrote: > On Sun, Apr 19, 2015 at 10:42 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > >> Or you might try creating an expression index ... >> >> CREATE INDEX date_year_idx ON dates((extract(year from d))); >> > > Certainly, but won't this add additional overhead in the form of two > indexes; one for the column and one for the expression? It will, but it probably will be more efficient than poorly performing queries. Another option is to use the first type of queries with explicit date ranges, thus making it possible to use a single index. > > My point is, why force the user to take these extra steps or add > overhead when the the two queries (or two indexes) are functionally > equivalent. Shouldn't this is an optimization handled by the > database so the user doesn't need to hand optimize these differences? Theoretically yes. But currently the "extract" function call is pretty much a black box for the planner, just like any other function - it has no idea what happens inside, what fields are extracted and so on. It certainly is unable to infer the date range as you propose. It's possible that in the future someone will implement an optimization like this, but I'm not aware of anyone working on that and I wouldn't hold my breath. Until then you either have to create an expression index, or use queries with explicit date ranges (without "extract" calls). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 04/19/15 22:10, Jon Dufresne wrote: >> My point is, why force the user to take these extra steps or add >> overhead when the the two queries (or two indexes) are functionally >> equivalent. Shouldn't this is an optimization handled by the >> database so the user doesn't need to hand optimize these differences? > Theoretically yes. > But currently the "extract" function call is pretty much a black box for > the planner, just like any other function - it has no idea what happens > inside, what fields are extracted and so on. It certainly is unable to > infer the date range as you propose. > It's possible that in the future someone will implement an optimization > like this, but I'm not aware of anyone working on that and I wouldn't > hold my breath. Yeah. In principle you could make the planner do this. As Adam Williams notes nearby, there's a problem with lack of exact consistency between extract() semantics and straight timestamp comparisons; but you could handle that by extracting indexable expressions that are considered lossy, much as we do with anchored LIKE and regexp patterns. The problem is that this would add significant overhead to checking for indexable clauses. With "x LIKE 'foo%'" you just need to make a direct check whether x is an indexed column; this is exactly parallel to noting whether x is indexed in "x >= 'foo'", and it doesn't require much additional machinery or cycles to reject the common case that there's no match to x. But if you want to notice whether d is indexed in "extract(year from d) = 2015", that requires digging down another level in the expression, so it's going to add overhead that's not there now, even in cases that have nothing to do with extract() let alone have any chance of benefiting. We might still be willing to do it if there were a sufficiently wide range of examples that could be handled by the same extra machinery, but this doesn't look too promising from that angle: AFAICS only the "year" case could yield a useful index restriction. So the short answer is that whether it's worth saving users from hand-optimizing such cases depends a lot on what it's going to cost in added planning time for queries that don't get any benefit. This example doesn't look like a case that's going to win that cost/benefit tradeoff comparison. regards, tom lane
On 2015-04-19 15:33, Tom Lane wrote: > >> It's possible that in the future someone will implement an optimization >> like this, but I'm not aware of anyone working on that and I wouldn't >> hold my breath. > > Yeah. In principle you could make the planner do this. As Adam Williams > notes nearby, there's a problem with lack of exact consistency between > extract() semantics and straight timestamp comparisons; but you could > handle that by extracting indexable expressions that are considered lossy, What about functions that are simpler such as upper()/lower()? On 9.3, this: `select email from users where lower(first_name) = 'yves'; is not using the index on first_name (Seq Scan on first_name). This should be easy to implement? -- http://yves.zioup.com gpg: 4096R/32B0F416
Yves Dorfsman <yves@zioup.com> writes: > What about functions that are simpler such as upper()/lower()? If you think those are simpler, you're much mistaken :-(. For instance, "lower(first_name) = 'yves'" would have to be translated to something like "first_name IN ('yves', 'yveS', 'yvEs', 'yvES', ..., 'YVES')" -- 16 possibilities altogether, or 2^N for an N-character string. (And that's just assuming ASCII up/down-casing, never mind the interesting rules in some non-English languages.) In a case-sensitive index, those various strings aren't going to sort consecutively, so we'd end up needing a separate index probe for each possibility. extract(year from date) agrees with timestamp comparison up to boundary cases, that is a few hours either way at a year boundary depending on the timezone situation. So you could translate it to a lossy-but-indexable timestamp comparison condition and not expect to scan too many index items that don't satisfy the original extract() condition. But I don't see how to make something like that work for mapping case-insensitive searches onto case-sensitive indexes. regards, tom lane
On 20/04/15 10:29, Tom Lane wrote: > Yves Dorfsman <yves@zioup.com> writes: >> What about functions that are simpler such as upper()/lower()? > If you think those are simpler, you're much mistaken :-(. For instance, > "lower(first_name) = 'yves'" would have to be translated to something > like "first_name IN ('yves', 'yveS', 'yvEs', 'yvES', ..., 'YVES')" > -- 16 possibilities altogether, or 2^N for an N-character string. > (And that's just assuming ASCII up/down-casing, never mind the interesting > rules in some non-English languages.) In a case-sensitive index, those > various strings aren't going to sort consecutively, so we'd end up needing > a separate index probe for each possibility. > > extract(year from date) agrees with timestamp comparison up to boundary > cases, that is a few hours either way at a year boundary depending on the > timezone situation. So you could translate it to a lossy-but-indexable > timestamp comparison condition and not expect to scan too many index items > that don't satisfy the original extract() condition. But I don't see how > to make something like that work for mapping case-insensitive searches > onto case-sensitive indexes. > > regards, tom lane > > Yeah, an event that happened at 2 am Thursday January 1st 2015 in New Zealand, will be in the year 2014 for people of London in England! Cheers, Gavin
On 4/19/15 4:33 PM, Tom Lane wrote: > We might still be willing to do it if there were a sufficiently wide range > of examples that could be handled by the same extra machinery, but this > doesn't look too promising from that angle: AFAICS only the "year" case > could yield a useful index restriction. "date_trunc() op val" is where we'd want to do this. Midnight, month and quarter boundaries are the cases I've commonly seen. The first is especially bad, because people got in the habit of doing timestamptz::date = '2015-1-1' and then got very irritated when that cast became volatile. Perhaps this could be better handled by having a way to translate date/time specifications into a range? So something like "date_trunc('...', timestamptz) = val" would become "date_trunk('...', timestamptz) <@ [val, val+<correct interval based on truncation>)". <,<=,>= and > wouldn't even need to be a range. To cover the broadest base, we'd want to recognize that extract(year) could transform to date_trunk('year',...), and timestamp/timestamptz::date transforms to date_trunc('day',...). I think these transforms would be lossless, so they could always be made, though we'd not want to do the transformation if there was already an index on something like date_trunc(...). I don't readily see any other data types where this would be useful, but this is so common with timestamps that I think it's worth handling just that case. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com