WIP: Collecting statistics on CSV file data
От | Etsuro Fujita |
---|---|
Тема | WIP: Collecting statistics on CSV file data |
Дата | |
Msg-id | 4E6DE18B.4010505@lab.ntt.co.jp обсуждение исходный текст |
Ответы |
Re: WIP: Collecting statistics on CSV file data
Re: WIP: Collecting statistics on CSV file data |
Список | pgsql-hackers |
Hi there, To enable file_fdw to estimate costs of scanning a CSV file more accurately, I would like to propose a new FDW callback routine, AnalyzeForeignTable, which allows to ANALYZE command to collect statistics on a foreign table, and a corresponding file_fdw function, fileAnalyzeForeignTable. Attached is my WIP patch. Here's a summary of the implementation: void AnalyzeForeignTable (Relation relation, VacuumStmt *vacstmt, int elevel); This is a new FDW callback routine to collect statistics on a foreign table and store the results in the pg_class and pg_statistic system catalogs. This is called when ANALYZE command is executed. (ANALYZE command should be executed because autovacuum does not analyze foreign tables.) static void fileAnalyzeForeignTable(Relation relation, VacuumStmt *vacstmt, int elevel); This new file_fdw function collects and stores the same statistics on CSV file data as collected on a local table except for index related statistics by executing the sequential scan on the CSV file and acquiring sample rows using Vitter's algorithm. (It is time-consuming for a large file.) estimate_costs() (more precisely, clauselist_selectivity() in estimate_costs()) estimates baserel->rows using the statistics stored in the pg_statistic system catalog. If there are no statistics, estimate_costs() estimates it using the default statistics as in PostgreSQL 9.1. I am able to demonstrate the effectiveness of this patch. The following run is performed on a single core of a 3.00GHz Intel Xeon CPU with 8GB of RAM. Configuration settings are default except for work_mem = 256MB. We can see from this result that the optimiser selects a good plan when the foreign tables have been analyzed. I appreciate your comments and suggestions. [sample csv file data] postgres=# COPY (SELECT s.a, repeat('a', 100) FROM generate_series(1, 5000000) AS s(a)) TO '/home/pgsql/sample_csv_data1.csv' (FORMAT csv, DELIMITER ','); COPY 5000000 postgres=# COPY (SELECT (random()*10000)::int, repeat('b', 100) FROM generate_series(1, 5000000)) TO '/home/pgsql/sample_csv_data2.csv' (FORMAT csv, DELIMITER ','); COPY 5000000 [Unpatched] postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# SELECT count(*) FROM tab1; count --------- 5000000 (1 row) postgres=# SELECT count(*) FROM tab2; count --------- 5000000 (1 row) postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- --- Aggregate (cost=128859182.29..128859182.30 rows=1 width=0) (actual time=27321.304..27321.304 rows=1 loops=1) -> Merge Join (cost=5787102.68..111283426.33 rows=7030302383 width=0) (actual time=22181.428..26736.194 rows=4999745 loops=1) Merge Cond: (tab1.aid = tab2.aid) -> Sort (cost=1857986.37..1858198.83 rows=84983 width=4) (actual time=5964.282..5965.958 rows=10000 loops=1) Sort Key: tab1.aid Sort Method: quicksort Memory: 853kB -> Foreign Scan on tab1 (cost=0.00..1851028.44 rows=84983 width=4) (actual time=0.071..5962.382 rows=10000 loops=1) Filter: ((aid >= 0) AND (aid <= 10000)) Foreign File: /home/pgsql/sample_csv_data1.csv Foreign File Size: 543888896 -> Materialize (cost=3929116.30..4011842.29 rows=16545197 width=4) (actual time=16216.953..19550.846 rows=5000000 loops=1) -> Sort (cost=3929116.30..3970479.30 rows=16545197 width=4) (actual time=16216.947..18418.684 rows=5000000 loops=1) Sort Key: tab2.aid Sort Method: external merge Disk: 68424kB -> Foreign Scan on tab2 (cost=0.00..1719149.70 rows=16545197 width=4) (actual time=0.081..6059.630 rows=5000000 loops=1) Foreign File: /home/pgsql/sample_csv_data2.csv Foreign File Size: 529446313 Total runtime: 27350.673 ms (18 rows) [Patched] postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# ANALYZE VERBOSE tab1; INFO: analyzing "public.tab1" INFO: "tab1": scanned, containing 5000000 rows; 30000 rows in sample ANALYZE postgres=# ANALYZE VERBOSE tab2; INFO: analyzing "public.tab2" INFO: "tab2": scanned, containing 5000000 rows; 30000 rows in sample ANALYZE postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1282725.25..1282725.26 rows=1 width=0) (actual time=15114.325..15114.325 rows=1 loops=1) -> Hash Join (cost=591508.50..1271157.90 rows=4626940 width=0) (actual time=5964.449..14526.822 rows=4999745 loops=1) Hash Cond: (tab2.aid = tab1.aid) -> Foreign Scan on tab2 (cost=0.00..564630.00 rows=5000000 width=4) (actual time=0.070..6253.257 rows=5000000 loops=1) Foreign File: /home/pgsql/sample_csv_data2.csv Foreign File Size: 529446313 -> Hash (cost=591393.00..591393.00 rows=9240 width=4) (actual time=5964.346..5964.346 rows=10000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 352kB -> Foreign Scan on tab1 (cost=0.00..591393.00 rows=9240 width=4) (actual time=0.066..5962.222 rows=10000 loops=1) Filter: ((aid >= 0) AND (aid <= 10000)) Foreign File: /home/pgsql/sample_csv_data1.csv Foreign File Size: 543888896 Total runtime: 15114.480 ms (13 rows) Best regards, Etsuro Fujita
Вложения
В списке pgsql-hackers по дате отправления: