Re: sum of all values
От | Madison Kelly |
---|---|
Тема | Re: sum of all values |
Дата | |
Msg-id | 41E7E742.4060607@alteeve.com обсуждение исходный текст |
Ответ на | Re: sum of all values (Richard Huxton <dev@archonet.com>) |
Ответы |
Re: sum of all values
(Richard Huxton <dev@archonet.com>)
|
Список | pgsql-performance |
Richard Huxton wrote: > Madison Kelly wrote: > >> Hi all, >> >> Is there a fast(er) way to get the sum of all integer values for a >> certain condition over many thousands of rows? What I am currently >> doing is this (which takes ~5-10sec.): > > > OK, I'm assuming you've configured PG to your satisfaction and this is > the only query giving you problems. This is a program for general consumption (hopefully... eventually...) so I want to leave the psql config alone. Once I am happier with the program I will try different tuning options and write a faq though I expect 9 out of 10 users won't read it. >> SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE >> a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND >> a.file_type=b.fs_type AND b.fs_backup='t'; > > > You'll want to run EXPLAIN ANALYSE SELECT SUM... and post the output of > that, although the query looks straightforward enough. tle-bu=> EXPLAIN ANALYZE SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type AND b.fs_backup='t'; QUERY PLAN ---------------------------------------------------------------- Aggregate (cost=2202.54..2202.54 rows=1 width=8) (actual time=5078.744..5078.748 rows=1 loops=1) -> Merge Join (cost=724.94..2202.51 rows=11 width=8) (actual time=3281.677..4969.719 rows=12828 loops=1) Merge Cond: (("outer".file_parent_dir = "inner".fs_parent_dir) AND ("outer".file_name = "inner".fs_name) AND ("outer".file_type = "inner".fs_type)) -> Index Scan using file_info_1_search_idx on file_info_1 a (cost=0.00..1317.11 rows=12828 width=104) (actual time=0.042..116.825 rows=12828 loops=1) -> Sort (cost=724.94..740.97 rows=6414 width=96) (actual time=3281.516..3350.640 rows=12828 loops=1) Sort Key: b.fs_parent_dir, b.fs_name, b.fs_type -> Seq Scan on file_set_1 b (cost=0.00..319.35 rows=6414 width=96) (actual time=0.029..129.129 rows=12828 loops=1) Filter: (fs_backup = true) Total runtime: 5080.729 ms (9 rows) >> Here are the schemas, in case they help: >> >> tle-bu=> \d file_info_1 Table "public.file_info_1" >> Column | Type | Modifiers >> -----------------+---------+---------------------------- >> file_acc_time | bigint | not null >> file_group_name | text | not null >> file_group_uid | integer | not null >> file_mod_time | bigint | not null >> file_name | text | not null >> file_parent_dir | text | not null >> file_perm | text | not null >> file_size | bigint | not null >> file_type | text | not null default 'f'::text >> file_user_name | text | not null >> file_user_uid | integer | not null >> Indexes: >> "file_info_1_display_idx" btree (file_parent_dir, file_name, >> file_type) >> "file_info_1_search_idx" btree (file_parent_dir, file_name, >> file_type) >> >> tle-bu=> \d file_set_1 Table "public.file_set_1" >> Column | Type | Modifiers >> ---------------+---------+---------------------------- >> fs_backup | boolean | not null default true >> fs_display | boolean | not null default false >> fs_name | text | not null >> fs_parent_dir | text | not null >> fs_restore | boolean | not null default false >> fs_type | text | not null default 'f'::text >> Indexes: >> "file_set_1_sync_idx" btree (fs_parent_dir, fs_name, fs_type) > > > 1. WHERE ARE YOUR PRIMARY KEYS??? > 2. Why do you have two identical indexes on file_info_1 > 3. WHERE ARE YOUR PRIMARY KEYS??? > 4. Am I right in thinking that always, file_name==fs_name (i.e. they > represent the same piece of information) and if so, why are you storing > it twice? Same for _parent_dir too > 5. file_type/fs_type are being held as unbounded text? Not an index into > some lookup table or a varchar(N)? > > Can you explain what you're trying to do here - it might be you want to > alter your database design. > -- > Richard Huxton > Archonet Ltd This is where I have to admit my novice level of knowledge. Until now I have been more concerned with "making it work". It is only now that I have finished (more or less) the program that I have started going back and trying to find ways to speed it up. I have not used postgres (or perl or anything) before this program. I hope my questions aren't too basic. ^.^; I keep hearing about Primary Keys but I can't say that I know what they are or how they are used. If I do understand, it is a way to reference another table's entry (using a foreign key)? The two matching indexes is a typo in my program that I hadn't noticed, I'll fix that asap. Here is what the database is used for: This is a backup program and I use the DB to store extended information on all selected files and directories on a partition. Each partition has it's own 'file_info_#' and 'file_set_#' tables where '#' matches the ID stored for that partition in the DB in another table. The 'file_info_#' table stored the data that can change such as file size, last modified/accessed, owing user and group and so forth. The 'file_set_#' table stores the flags that say to include or exclude it from a backup/restore job and whether it has been selected for display in the file browser. In the first iteration I -used- to have the data in a single table and I identified the partition with a column called 'file_in_id' (or something similar). As I looked at each file on the system I would do a db call to see if the entry existed and if so, update it and if not, insert it. This was horribly slow though so I decided to break out into the schema above. With the schema above what I do now is just drop the 'file_info_#' table, recreate the table and matching indexes and then do a mass 'COPY' of all the file info on the partition. After this is done I read in the new data from the reloaded 'file_info_#' table and sync the data in 'file_set_#' which removes entries no longer in 'file_info_#', adds new ones matching the parent's values and leaves the existing entries alone. I found droping the table and re-creating it a lot faster than a 'DELETE FROM' call and it also seems to have made 'VACUUM FULL' a lot faster. Thank you very much for your feedback! I hope I haven't done something -too- foolish. :p If I have, I will change it. Madison
В списке pgsql-performance по дате отправления:
Следующее
От: Alex TurnerДата:
Сообщение: Re: Best filesystem for PostgreSQL Database Cluster under Linux