Performance HOWTO - pseudo counter example
От | Jean-Michel POURE |
---|---|
Тема | Performance HOWTO - pseudo counter example |
Дата | |
Msg-id | 200204291745.08315.jm.poure@freesurf.fr обсуждение исходный текст |
Ответ на | Re: Performance Issues with count() ("S Grannis" <sjg@email.com>) |
Список | pgsql-general |
Le Jeudi 25 Avril 2002 20:54, S Grannis a écrit : > I think our work-around for now will be to SELECT the column we wish to > analyze into a flat file and then run a Perl script to do the actual > counting. Dear all, I wrote a small howto to solve S Grannis performance questions on Count() function. The idea is to create and maintain a pseudo-count table using triggers and PLpgSQL. Unfortunately, I could not test the PLpgSQL scripts in 16 Million records for lack of space reason (only on 1 Million records). Code is included to generate fake test data. Could someone help me test the howto on 16 million records? Thank you for your feedback, Cheers, Jean-Michel ************************************************************************* Performance HOWTO - pseudo counter example This document is released under PostgreSQL license ************************************************************************* This tutorial demonstrates how to create fast pseudo-counters in PostgreSQL using PLpgSQL and triggers. 1) Performance background This small howto is insprired in reply to an email on pgsql-general@postgresql.org complaining about PostgreSQL speed. The user needs to run COUNT statements on a large database of 65.000.000 records. The table structure is basically as follows: CREATE TABLE "data" ( "data_oid" serial8, "data_yd" int4 ); In our example, data_yd is a year value between 1950 and 2050. The user needs to run the following query: SELECT COUNT (data_yd) FROM data WHERE data_yd = foo_year. where foo_year is a date between 1950 and 2050. The query takes more than two hours to execute on a double-processor computer running PostgreSQL and GNU/Linux. The proposed solution creates a pseudo-counter on PostgreSQL using PLpgSQL and triggers. The aim is to return a result in 0.005 second. Initilisation itself of the pseudo-counter table should take less than 30 minutes. 1) INSTALLATION a) Database creation Open a terminal windows, connect as 'postgres' user: root@localhost>su postgres Create an empty database: postgresql@localhost>psql template1; template1=\CREATE DATABASE pseudo_counter; template1=\q b) PLpgSQL declaration PLpgSQL is compiled by default in PostgreSQL. But you should enable PLpgSQL on the database itself: postgresql@localhost>CREATELANG plpgsql pseudo_counter c) Data table We first need to create the table stucture: CREATE TABLE "data" ( "data_oid" serial8, "data_yd" int4, "data_counterenabled" bool DEFAULT 'f' ) WITH OIDS; CREATE INDEX data_yd_idx ON data USING btree (data_yd); CREATE INDEX data_counter_idx ON data USING btree (data_counterenabled); And create a PLpgSQL function to add fake records: CREATE FUNCTION "init_fakedata"("int8", "int4", "int4") RETURNS "bool" AS ' DECLARE iLoop int4; tStart timestamp ; BEGIN tStart = now (); IF ($1>0) AND ($2 >0) AND ($3 >0) AND ($3>$2) THEN FOR iLoop in 1 .. $1 LOOP INSERT INTO data (data_yd) VALUES ( int8 (random () * ($3-$2) +$2) ); END LOOP; RETURN ''t''; ELSE RETURN ''f''; END IF; END; ' LANGUAGE 'plpgsql'; To insert 16 million records with a year range between 1950 and 2050, enter: SELECT init_fakedata(16000000, 1950, 2050); which should take a while... and fill mor than 3.2 Gb on disc. If you cannot wait that long : For testing, insert 126.953 records : SELECT init_fakedata(126953, 1950, 2050); This takes 40s on my server. Then, repeat 8 times: INSERT INTO data (data_yd) SELECT data_yd FROM data; This should produce 64999936 fake records more quickly as no random function is used. Enter: CHECKPOINT; VACUUM ANALYSE; to clear data cache and update statistics. d) Pseudo-count table Now, let's create a pseudo-count table. CREATE TABLE "pseudo_count" ( "count_oid" serial, "count_year" int4, "count_value" int4 DEFAULT 0 ) WITH OIDS; CREATE INDEX count_value_idx ON pseudo_count USING btree (count_value); CREATE INDEX count_year_idx ON pseudo_count USING btree (count_year); ... and initialize it with the required data (values in the 1950 - 2050 range) : CREATE FUNCTION "init_pseudocount"("int4", "int4") RETURNS "bool" AS ' DECLARE iLoop int4; BEGIN IF (($1>0) AND ($2>0) AND ($2>=$1)) THEN FOR iLoop in $1 .. $2 LOOP INSERT INTO pseudo_count (count_year) VALUES (iLoop); END LOOP; RETURN ''t''; ELSE RETURN ''f''; END IF; END; ' LANGUAGE 'plpgsql'; Example : SELECT init_pseudocount(1950, 2050) ; will create the required records for years 1900 to 2100. e) PLpgSQL function and trigger Pseudo count is handled using a single trigger 'tg_data' running 'tg_data()' function. TG_OP is used to catch the trigger context ('insert', 'update' or 'delete'). CREATE FUNCTION "tg_data"() RETURNS "opaque" AS 'DECLARE rec record; BEGIN IF (TG_OP=''UPDATE'') THEN IF (new.data_counterenabled = ''t'') AND (old.data_counterenabled = ''f'') THEN UPDATE pseudo_count SET count_value = count_value +1 WHERE count_year = new.data_yd AND count_value >= 0; END IF; IF (new.data_counterenabled = ''f'') AND (old.data_counterenabled = ''t'') THEN UPDATE pseudo_count SET count_value = count_value -1 WHERE count_year = new.data_yd AND count_value > 0; END IF; IF (old.data_yd <> new.data_yd) THEN UPDATE pseudo_count SET count_value = count_value -1 WHERE count_year = old.data_yd AND count_value > 0; UPDATE pseudo_count SET count_value = count_value + 1 WHERE count_year = new.data_yd AND count_value >= 0 ; END IF; END IF; IF (TG_OP=''DELETE'') THEN UPDATE pseudo_count SET count_value = count_value - 1 WHERE count_year = old.data_yd AND count_value >= 0 ; END IF; IF (TG_OP=''UPDATE'') THEN RETURN new; ELSE RETURN old; END IF; END; ' LANGUAGE 'plpgsql'; CREATE TRIGGER "tg_data" BEFORE DELETE OR UPDATE ON "data" FOR EACH ROW EXECUTE PROCEDURE tg_data(); 2) USAGE a) Initialisation Set "pseudo_countenabled" flag on: UPDATE data SET data_counterenabled = 't' WHERE data_counterenabled = 'f' b) Run pseudo-count queries Instead of : SELECT COUNT (data_yd) FROM data WHERE data_yd = foo_year. you now can run: SELECT count_value FROM pseudo_count WHERE pseudo_date = foo_year The anwer comes in 0.005 second. c) Limits Before loading large amount of data, triggers on table 'data' should be dropped and recreated afterwards.
В списке pgsql-general по дате отправления:
Следующее
От: Lincoln YeohДата:
Сообщение: Re: What popular, large commercial websites run PostgreSQL?