Обсуждение: have: seq scan - want: index scan

Поиск
Список
Период
Сортировка

have: seq scan - want: index scan

От
Chris Ruprecht
Дата:
Hi guys,

PG = 9.1.5
OS = winDOS 2008R8

I have a table that currently has 207 million rows.
there is a timestamp field that contains data.
more data gets copied from another database into this database.
How do I make this do an index scan instead?
I did an "analyze audittrailclinical" to no avail.
I tested different indexes - no same behavior.

The query does this:

SELECT
audittrailclinical.pgid,
audittrailclinical.timestamp,
mmuser.logon,
audittrailclinical.entityname,
audittrailclinical.clinicalactivity,
audittrailclinical.audittraileventcode,
account.accountnumber,
patient.dnsortpersonnumber
FROM
public.account,
public.audittrailclinical,
public.encounter,
public.entity,
public.mmuser,
public.patient,
public.patientaccount
WHERE
      audittrailclinical.encountersid = encounter.encountersid
and   audittrailclinical.timestamp >= '2008-01-01'::timestamp without time zone
and   audittrailclinical.timestamp <= '2012-10-05'::timestamp without time zone
AND  encounter.practiceid = patient.practiceid
AND  encounter.patientid = patient.patientid
AND  encounter.staffid = patient.staffid
AND  entity.entitysid = audittrailclinical.entitysid
AND  mmuser.mmusersid = audittrailclinical.mmusersid
AND  patient.practiceid = patientaccount.practiceid
AND  patient.patientid = patientaccount.patientid
AND  patientaccount.accountsid = account.accountsid
AND  patientaccount.defaultaccount = 'Y'
AND  patient.dnsortpersonnumber = '347450' ;

The query plan says:

"              ->  Seq Scan on audittrailclinical  (cost=0.00..8637598.76 rows=203856829 width=62)"
"                    Filter: (("timestamp" >= '2008-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <=
'2012-10-0500:00:00'::timestamp without time zone))" 

which takes forever.

How do I make this do an index scan instead?
I did an "analyze audittrailclinical" to no avail.

the table definitions are (the createstamp field is empty - I know, bad data):

CREATE TABLE audittrailclinical
(
  audittrailid text,
  audittraileventcode text,
  clinicalactivity text,
  eventsuccessful text,
  externalunique text,
  recordstamp timestamp without time zone,
  recorddescription text,
  encountersid integer,
  eventuserlogon text,
  computername text,
  applicationcode text,
  practiceid integer,
  mmusersid integer,
  entitysid integer,
  entityname text,
  "timestamp" timestamp without time zone,
  lastuser integer,
  createstamp timestamp without time zone,
  pgid bigint DEFAULT nextval(('"bravepoint_seq"'::text)::regclass)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE audittrailclinical
  OWNER TO intergy;
GRANT ALL ON TABLE audittrailclinical TO intergy;
GRANT SELECT ON TABLE audittrailclinical TO rb;

-- Index: atc_en_time

CREATE INDEX atc_en_time
  ON audittrailclinical
  USING btree
  (entitysid , "timestamp" );

-- Index: atc_id

-- DROP INDEX atc_id;

CREATE INDEX atc_id
  ON audittrailclinical
  USING btree
  (audittrailid COLLATE pg_catalog."default" );






Re: have: seq scan - want: index scan

От
Samuel Gendler
Дата:


On Tue, Oct 16, 2012 at 4:45 PM, Chris Ruprecht <chris@cdrbill.com> wrote:
Hi guys,

PG = 9.1.5
OS = winDOS 2008R8

I have a table that currently has 207 million rows.
there is a timestamp field that contains data.
more data gets copied from another database into this database.
How do I make this do an index scan instead?
I did an "analyze audittrailclinical" to no avail.
I tested different indexes - no same behavior.


The query plan says:

"              ->  Seq Scan on audittrailclinical  (cost=0.00..8637598.76 rows=203856829 width=62)"
"                    Filter: (("timestamp" >= '2008-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2012-10-05 00:00:00'::timestamp without time zone))"

which takes forever.

How do I make this do an index scan instead?
I did an "analyze audittrailclinical" to no avail.

analyze says 203 million out of 207 million rows are matched by your timestamp filter, so it is definitely going to favour a sequential scan, since an index scan that matches that many rows will inevitably be slower than simply scanning the table, since it will have to both do the lookups and load the actual records from the table (all of them, basically) in order to determine their visibility to you, so your index scan will just turn sequential access of the table pages into random access and require index lookups as well.   You can possibly verify this by setting enable_seqscan to false and running your analyze again and see how the plan changes, though I don't believe that will necessarily remove all sequential scans, it just reduces their likelihood, so you may see that nothing changes. If the estimate for the number of matching rows is incorrect, you'll want to increase the statistics gathering for that table or just that column.

ALTER TABLE <table> ALTER COLUMN <column> SET STATISTICS <number>

where number is between 10 and 1000 and I think the default is 100.  Then re-analyze the table and see if the query plan shows better estimates.  I think 9.2 also supports "index only scans" which eliminate the need to load the matched records in certain circumstances. However, all of the columns used by the query would need to be in the index, and you are using an awful lot of columns between the select clause and the table joins.

Are you lacking indexes on the columns used for joins that would allow more selective index scans on those columns which could then just filter by timestamp?  I'm not much of an expert on the query planner, so I'm not sure what exactly will cause that behaviour, but I'd think that good statistics and useful indexes should allow the rest of the where clause to be more selective of the rows from audittrailclinical unless patientaccount.defaultaccount = 'Y' and patient.dnsortpersonnumber = '347450'  are similarly non-selective, though patient.dnsortpersonnumber would seem like it is probably the strong filter, so make sure you've got indexes and accurate stats on all of the foreign keys that connect patient table and audittrailclinical table.  It'd be useful to see the rest of the explain analyze output so we could see how it is handling the joins and why.  Note that because you have multiple composite foreign keys joining tables in your query, you almost certainly won't those composite keys in a single index.  If you have indexes on those columns but they are single-column indexes, that may be what is causing the planner to try to filter the atc table on the timestamp rather than via the joins.  I'm sure someone more knowledgable than I will be along eventually to correct any misinformation I may have passed along.  Without knowing anything about your schema or the rest of the explain analyze output, I'm mostly just guessing.  There is an entire page devoted to formulating useful mailing list questions, incidentally.  Yours really isn't.  Or if the atc table definition is complete, you are definitely missing potentially useful indexes, since you are joining to that table via encountersid and you don't show an index on that column - yet that is the column that eventually joins out to the patient and patientaccount tables, which have the stronger filters on them.

Incidentally, why the join to the entity table via entitysid?  No columns from that table appear to be used anywhere else in the query.

--sam