Обсуждение:

Поиск
Список
Период
Сортировка

От
Humair Mohammed
Дата:
I have 2 tables with a 200,000 rows of data 3 character/string columns ID, Question and Response. The query below compares the data between the 2 tables based on ID and Question and if the Response does not match between the left table and the right table it identifies the ID's where there is a mismatch. Running the query in SQL Server 2008 using the ISNULL function take a few milliseconds. Running the same query in Postgresql takes over 70 seconds. The 2 queries are below:

SQL Server 2008 R2 Query
select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and t1.question = t2.question and isnull(t1.response,'ISNULL') <> isnull(t2.response,'ISNULL')

Postgres 9.1 Query
select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and t1.question = t2.question and coalesce(t1.response,'ISNULL') <> coalesce(t2.response,'ISNULL')

What gives?

Re:

От
Jayadevan M
Дата:
Hi
> SQL Server 2008 R2 Query
> select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id
> and t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> isnull(t2.response,'ISNULL')
>
> Postgres 9.1 Query
> select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id
> and t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> coalesce(t2.response,'ISNULL')
>
> What gives?
They have same indexes/PK etc?
Regards,
Jayadevan





DISCLAIMER:

"The information in this e-mail and any attachment is intended only for
the person to whom it is addressed and may contain confidential and/or
privileged material. If you have received this e-mail in error, kindly
contact the sender and destroy all copies of the original communication.
IBS makes no warranty, express or implied, nor guarantees the accuracy,
adequacy or completeness of the information contained in this email or any
attachment and is not liable for any errors, defects, omissions, viruses
or for resultant loss or damage, if any, direct or indirect."






Re:

От
Mark Kirkwood
Дата:
On 16/11/10 09:14, Humair Mohammed wrote:
I have 2 tables with a 200,000 rows of data 3 character/string columns ID, Question and Response. The query below compares the data between the 2 tables based on ID and Question and if the Response does not match between the left table and the right table it identifies the ID's where there is a mismatch. Running the query in SQL Server 2008 using the ISNULL function take a few milliseconds. Running the same query in Postgresql takes over 70 seconds. The 2 queries are below:
SQL Server 2008 R2 Queryselect t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and t1.question = t2.question and isnull(t1.response,'ISNULL') <> isnull(t2.response,'ISNULL')
Postgres 9.1 Queryselect t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and t1.question = t2.question and coalesce(t1.response,'ISNULL') <> coalesce(t2.response,'ISNULL')
What gives? 		 	   		   

Can we see the execution plans: (EXPLAIN <the query text here>) for Postgres and (however you get text based query plan from Sql Server), so we can see if there is any obvious differences in how things are done.

Also probably worthwhile is telling us the table definitions of the tables concerned.

For Postgres - did you run ANALYZE on the database concerned before running the queries? (optimizer stats are usually updated automatically, but if you were quick to run the queries after loading the data they might not have been).

regards

Mark

Re:

От
Pavel Stehule
Дата:
2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> I have 2 tables with a 200,000 rows of data 3 character/string columns ID,
> Question and Response. The query below compares the data between the 2
> tables based on ID and Question and if the Response does not match between
> the left table and the right table it identifies the ID's where there is a
> mismatch. Running the query in SQL Server 2008 using the ISNULL function
> take a few milliseconds. Running the same query in Postgresql takes over 70
> seconds. The 2 queries are below:
> SQL Server 2008 R2 Query
> select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> isnull(t2.response,'ISNULL')

> Postgres 9.1 Query
> select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> coalesce(t2.response,'ISNULL')
> What gives?

I think, so must problem can be in ugly predicate
coalesce(t1.response,'ISNULL') <>
> coalesce(t2.response,'ISNULL')

try use a IS DISTINCT OF operator

... AND t1.response IS DISTINCT t2.response

Regards

Pavel Stehule

p.s. don't use a coalesce in WHERE clause if it is possible.

Re: Query Performance SQL Server vs. Postgresql

От
Pavel Stehule
Дата:
2010/11/17 Humair Mohammed <humairm@hotmail.com>:
>
> There are no indexes on the tables either in SQL Server or Postgresql - I am
> comparing apples to apples here. I ran ANALYZE on the postgresql tables,
> after that query performance times are still high 42 seconds with COALESCE
> and 35 seconds with IS DISTINCT FROM.
> Here is the execution plan from Postgresql for qurey - select pb.id from
> pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> pg.question and coalesce(pb.response,'MISSING') <>
> coalesce(pg.response,'MISSING')
> Execution Time: 42 seconds
> "Hash Join  (cost=16212.30..48854.24 rows=93477 width=17)"
> "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> (pg.question)::text))"
> "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)"
> "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134)"
> "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> width=134)"

this is little bit strange - did you ANALYZE and VACUUM?

please send result of EXPLAIN ANALYZE

Pavel

>
> And here is the execution plan from SQL Server for query - select pb.id from
> pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> pg.question and isnull(pb.response,'ISNULL')<>  isnull(pg.response,'ISNULL')
> Execution Time: < 1 second
> Cost: 1%  |--Parallelism(Gather Streams)
> Cost: 31%       |--Hash Match(Inner Join, HASH:([pb].[ID],
> [pb].[Question])=([pg].[ID], [pg].[Question]),
> RESIDUAL:([master].[dbo].[pivotbad].[ID] as
> [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
> [master].[dbo].[pivotbad].[Question] as
> [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question] AND
> [Expr1006]<>[Expr1007]))
>     Cost: 0%  |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
> DEFINE:([Bitmap1008]))
>             Cost: 0%    |--Compute
> Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
> [pb].[Response],'ISNULL')))
>             Cost:  6%   |--Parallelism(Repartition Streams, Hash
> Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
>             Cost: 12%  |--Table Scan(OBJECT:([master].[dbo].[pivotbad] AS
> [pb]))
>             Cost: 0% |--Compute
> Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] as
> [pg].[Response],'ISNULL')))
>                 Cost: 17% |--Parallelism(Repartition Streams, Hash
> Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
>                     Cost: 33% |--Table
> Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
> WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
> [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
>
>
>
>> From: pavel.stehule@gmail.com
>> Date: Tue, 16 Nov 2010 08:12:03 +0100
>> Subject: Re: [PERFORM]
>> To: humairm@hotmail.com
>> CC: pgsql-performance@postgresql.org
>>
>> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
>> > I have 2 tables with a 200,000 rows of data 3 character/string columns
>> > ID,
>> > Question and Response. The query below compares the data between the 2
>> > tables based on ID and Question and if the Response does not match
>> > between
>> > the left table and the right table it identifies the ID's where there is
>> > a
>> > mismatch. Running the query in SQL Server 2008 using the ISNULL function
>> > take a few milliseconds. Running the same query in Postgresql takes over
>> > 70
>> > seconds. The 2 queries are below:
>> > SQL Server 2008 R2 Query
>> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
>> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
>> > isnull(t2.response,'ISNULL')
>>
>> > Postgres 9.1 Query
>> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
>> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
>> > coalesce(t2.response,'ISNULL')
>> > What gives?
>>
>> I think, so must problem can be in ugly predicate
>> coalesce(t1.response,'ISNULL') <>
>> > coalesce(t2.response,'ISNULL')
>>
>> try use a IS DISTINCT OF operator
>>
>> ... AND t1.response IS DISTINCT t2.response
>>
>> Regards
>>
>> Pavel Stehule
>>
>> p.s. don't use a coalesce in WHERE clause if it is possible.
>

Re: Query Performance SQL Server vs. Postgresql

От
Tomas Vondra
Дата:
Dne 17.11.2010 05:47, Pavel Stehule napsal(a):
> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
>>
>> There are no indexes on the tables either in SQL Server or Postgresql - I am
>> comparing apples to apples here. I ran ANALYZE on the postgresql tables,

Actually no, you're not comparing apples to apples. You've provided so
little information that you may be comparing apples to cucumbers or
maybe some strange animals.

1) info about the install

What OS is this running on? I guess it's Windows in both cases, right?

How nuch memory is there? What is the size of shared_buffers? The
default PostgreSQL settings is very very very limited, you have to bump
it to a much larger value.

What are the other inportant settings (e.g. the work_mem)?

2) info about the dataset

How large are the tables? I don't mean number of rows, I mean number of
blocks / occupied disk space. Run this query

SELECT relname, relpages, reltuples, pg_size_pretty(pg_table_size(oid))
FROM pg_class WHERE relname IN ('table1', 'table2');

3) info about the plan

Please, provide EXPLAIN ANALYZE output, maybe with info about buffers,
e.g. something like

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT ...

4) no indexes ?

Why have you decided not to use any indexes? If you want a decent
performance, you will have to use indexes. Obviously there is some
overhead associated with them, but it's premature optimization unless
you prove the opposite.

BTW I'm not a MSSQL expert, but it seems like it's building a bitmap
index on the fly, to synchronize parallelized query - PostgreSQL does
not support that.

regards
Tomas

Re: Query Performance SQL Server vs. Postgresql

От
Rich
Дата:
I have to concur.  Sql is written specifially and only for Windows. It is optimized for windows.  Postgreal is writeen for just about everything trying to use common code so there isn't much optimization because it has to be optimized based on the OS that is running it.  Check out your config and send it to us.  That would include the OS and hardware configs for both machines.

On Wed, Nov 17, 2010 at 3:47 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 17.11.2010 05:47, Pavel Stehule napsal(a):
> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
>>
>> There are no indexes on the tables either in SQL Server or Postgresql - I am
>> comparing apples to apples here. I ran ANALYZE on the postgresql tables,

Actually no, you're not comparing apples to apples. You've provided so
little information that you may be comparing apples to cucumbers or
maybe some strange animals.

1) info about the install

What OS is this running on? I guess it's Windows in both cases, right?

How nuch memory is there? What is the size of shared_buffers? The
default PostgreSQL settings is very very very limited, you have to bump
it to a much larger value.

What are the other inportant settings (e.g. the work_mem)?

2) info about the dataset

How large are the tables? I don't mean number of rows, I mean number of
blocks / occupied disk space. Run this query

SELECT relname, relpages, reltuples, pg_size_pretty(pg_table_size(oid))
FROM pg_class WHERE relname IN ('table1', 'table2');

3) info about the plan

Please, provide EXPLAIN ANALYZE output, maybe with info about buffers,
e.g. something like

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT ...

4) no indexes ?

Why have you decided not to use any indexes? If you want a decent
performance, you will have to use indexes. Obviously there is some
overhead associated with them, but it's premature optimization unless
you prove the opposite.

BTW I'm not a MSSQL expert, but it seems like it's building a bitmap
index on the fly, to synchronize parallelized query - PostgreSQL does
not support that.

regards
Tomas

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Query Performance SQL Server vs. Postgresql

От
Humair Mohammed
Дата:
Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun the query. Results from EXPLAIN ANALYZE below:

"Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual time=43200.223..49502.874 rows=3163 loops=1)"
"  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
"  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
"  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.009..48.200 rows=93496 loops=1)"
"  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual time=42919.453..42919.453 rows=251212 loops=1)"
"        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
"        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.119..173.019 rows=251212 loops=1)"
"Total runtime: 49503.450 ms"


> From: pavel.stehule@gmail.com
> Date: Wed, 17 Nov 2010 05:47:51 +0100
> Subject: Re: Query Performance SQL Server vs. Postgresql
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> >
> > There are no indexes on the tables either in SQL Server or Postgresql - I am
> > comparing apples to apples here. I ran ANALYZE on the postgresql tables,
> > after that query performance times are still high 42 seconds with COALESCE
> > and 35 seconds with IS DISTINCT FROM.
> > Here is the execution plan from Postgresql for qurey - select pb.id from
> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> > pg.question and coalesce(pb.response,'MISSING') <>
> > coalesce(pg.response,'MISSING')
> > Execution Time: 42 seconds
> > "Hash Join  (cost=16212.30..48854.24 rows=93477 width=17)"
> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)"
> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134)"
> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> > width=134)"
>
> this is little bit strange - did you ANALYZE and VACUUM?
>
> please send result of EXPLAIN ANALYZE
>
> Pavel
>
> >
> > And here is the execution plan from SQL Server for query - select pb.id from
> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> > pg.question and isnull(pb.response,'ISNULL')<>  isnull(pg.response,'ISNULL')
> > Execution Time: < 1 second
> > Cost: 1%  |--Parallelism(Gather Streams)
> > Cost: 31%       |--Hash Match(Inner Join, HASH:([pb].[ID],
> > [pb].[Question])=([pg].[ID], [pg].[Question]),
> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
> > [master].[dbo].[pivotbad].[Question] as
> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question] AND
> > [Expr1006]<>[Expr1007]))
> >     Cost: 0%  |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
> > DEFINE:([Bitmap1008]))
> >             Cost: 0%    |--Compute
> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
> > [pb].[Response],'ISNULL')))
> >             Cost:  6%   |--Parallelism(Repartition Streams, Hash
> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
> >             Cost: 12%  |--Table Scan(OBJECT:([master].[dbo].[pivotbad] AS
> > [pb]))
> >             Cost: 0% |--Compute
> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] as
> > [pg].[Response],'ISNULL')))
> >                 Cost: 17% |--Parallelism(Repartition Streams, Hash
> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
> >                     Cost: 33% |--Table
> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
> >
> >
> >
> >> From: pavel.stehule@gmail.com
> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
> >> Subject: Re: [PERFORM]
> >> To: humairm@hotmail.com
> >> CC: pgsql-performance@postgresql.org
> >>
> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> >> > I have 2 tables with a 200,000 rows of data 3 character/string columns
> >> > ID,
> >> > Question and Response. The query below compares the data between the 2
> >> > tables based on ID and Question and if the Response does not match
> >> > between
> >> > the left table and the right table it identifies the ID's where there is
> >> > a
> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL function
> >> > take a few milliseconds. Running the same query in Postgresql takes over
> >> > 70
> >> > seconds. The 2 queries are below:
> >> > SQL Server 2008 R2 Query
> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> >> > isnull(t2.response,'ISNULL')
> >>
> >> > Postgres 9.1 Query
> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> >> > coalesce(t2.response,'ISNULL')
> >> > What gives?
> >>
> >> I think, so must problem can be in ugly predicate
> >> coalesce(t1.response,'ISNULL') <>
> >> > coalesce(t2.response,'ISNULL')
> >>
> >> try use a IS DISTINCT OF operator
> >>
> >> ... AND t1.response IS DISTINCT t2.response
> >>
> >> Regards
> >>
> >> Pavel Stehule
> >>
> >> p.s. don't use a coalesce in WHERE clause if it is possible.
> >

Re: Query Performance SQL Server vs. Postgresql

От
Humair Mohammed
Дата:

There are no indexes on the tables either in SQL Server or Postgresql - I am comparing apples to apples here. I ran ANALYZE on the postgresql tables, after that query performance times are still high 42 seconds with COALESCE and 35 seconds with IS DISTINCT FROM.

Here is the execution plan from Postgresql for qurey - select pb.id from pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question = pg.question and coalesce(pb.response,'MISSING') <> coalesce(pg.response,'MISSING')

Execution Time: 42 seconds

"Hash Join  (cost=16212.30..48854.24 rows=93477 width=17)"
"  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
"  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
"  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)"
"  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134)"
"        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134)"


And here is the execution plan from SQL Server for query - select pb.id from pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question = pg.question and isnull(pb.response,'ISNULL')<>  isnull(pg.response,'ISNULL')

Execution Time: < 1 second

Cost: 1%  |--Parallelism(Gather Streams)
Cost: 31%       |--Hash Match(Inner Join, HASH:([pb].[ID], [pb].[Question])=([pg].[ID], [pg].[Question]), RESIDUAL:([master].[dbo].[pivotbad].[ID] as [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND [master].[dbo].[pivotbad].[Question] as [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question] AND [Expr1006]<>[Expr1007]))
    Cost: 0%  |--Bitmap(HASH:([pb].[ID], [pb].[Question]), DEFINE:([Bitmap1008]))
            Cost: 0%    |--Compute Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as [pb].[Response],'ISNULL')))
            Cost:  6%   |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
            Cost: 12%  |--Table Scan(OBJECT:([master].[dbo].[pivotbad] AS [pb]))
            Cost: 0% |--Compute Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response] as [pg].[Response],'ISNULL')))
                Cost: 17% |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
                    Cost: 33% |--Table Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]), WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))




> From: pavel.stehule@gmail.com
> Date: Tue, 16 Nov 2010 08:12:03 +0100
> Subject: Re: [PERFORM]
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> > I have 2 tables with a 200,000 rows of data 3 character/string columns ID,
> > Question and Response. The query below compares the data between the 2
> > tables based on ID and Question and if the Response does not match between
> > the left table and the right table it identifies the ID's where there is a
> > mismatch. Running the query in SQL Server 2008 using the ISNULL function
> > take a few milliseconds. Running the same query in Postgresql takes over 70
> > seconds. The 2 queries are below:
> > SQL Server 2008 R2 Query
> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> > isnull(t2.response,'ISNULL')
>
> > Postgres 9.1 Query
> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> > coalesce(t2.response,'ISNULL')
> > What gives?
>
> I think, so must problem can be in ugly predicate
> coalesce(t1.response,'ISNULL') <>
> > coalesce(t2.response,'ISNULL')
>
> try use a IS DISTINCT OF operator
>
> ... AND t1.response IS DISTINCT t2.response
>
> Regards
>
> Pavel Stehule
>
> p.s. don't use a coalesce in WHERE clause if it is possible.

Re: Query Performance SQL Server vs. Postgresql

От
Tom Lane
Дата:
Humair Mohammed <humairm@hotmail.com> writes:
> Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun the query. Results from EXPLAIN ANALYZE
below:
> "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual time=43200.223..49502.874 rows=3163 loops=1)""
HashCond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))""  Join Filter:
((COALESCE(pb.response,'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character
varying))::text)"" ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.009..48.200
rows=93496loops=1)""  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual time=42919.453..42919.453
rows=251212loops=1)""        Buckets: 1024  Batches: 64  Memory Usage: 650kB""        ->  Seq Scan on pivotgood pg
(cost=0.00..7537.12rows=251212 width=134) (actual time=0.119..173.019 rows=251212 loops=1)""Total runtime: 49503.450
ms"

I have no idea how much memory SQL Server thinks it can use, but
Postgres is limiting itself to work_mem which you've apparently left at
the default 1MB.  You might get a fairer comparison by bumping that up
some --- try 32MB or so.  You want it high enough so that the Hash
output doesn't say there are multiple batches.

            regards, tom lane

Re: Query Performance SQL Server vs. Postgresql

От
Pavel Stehule
Дата:
Hello,

there should be a problem in a statistic, they are out of reality.
Please, try to use a DISTINCT OF operator now - maybe a statistic will
be better. Next -  try to increase a work_mem. Hash join is
untypically slow in your comp.

Regards

Pavel Stehule

2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun
> the query. Results from EXPLAIN ANALYZE below:
> "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
> time=43200.223..49502.874 rows=3163 loops=1)"
> "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> (pg.question)::text))"
> "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)
> (actual time=0.009..48.200 rows=93496 loops=1)"
> "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
> time=42919.453..42919.453 rows=251212 loops=1)"
> "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
> "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> width=134) (actual time=0.119..173.019 rows=251212 loops=1)"
> "Total runtime: 49503.450 ms"
>
>> From: pavel.stehule@gmail.com
>> Date: Wed, 17 Nov 2010 05:47:51 +0100
>> Subject: Re: Query Performance SQL Server vs. Postgresql
>> To: humairm@hotmail.com
>> CC: pgsql-performance@postgresql.org
>>
>> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
>> >
>> > There are no indexes on the tables either in SQL Server or Postgresql -
>> > I am
>> > comparing apples to apples here. I ran ANALYZE on the postgresql tables,
>> > after that query performance times are still high 42 seconds with
>> > COALESCE
>> > and 35 seconds with IS DISTINCT FROM.
>> > Here is the execution plan from Postgresql for qurey - select pb.id from
>> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
>> > pg.question and coalesce(pb.response,'MISSING') <>
>> > coalesce(pg.response,'MISSING')
>> > Execution Time: 42 seconds
>> > "Hash Join  (cost=16212.30..48854.24 rows=93477 width=17)"
>> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text
>> > =
>> > (pg.question)::text))"
>> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character
>> > varying))::text
>> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
>> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496
>> > width=134)"
>> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134)"
>> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
>> > width=134)"
>>
>> this is little bit strange - did you ANALYZE and VACUUM?
>>
>> please send result of EXPLAIN ANALYZE
>>
>> Pavel
>>
>> >
>> > And here is the execution plan from SQL Server for query - select pb.id
>> > from
>> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
>> > pg.question and isnull(pb.response,'ISNULL')<>
>> >  isnull(pg.response,'ISNULL')
>> > Execution Time: < 1 second
>> > Cost: 1%  |--Parallelism(Gather Streams)
>> > Cost: 31%       |--Hash Match(Inner Join, HASH:([pb].[ID],
>> > [pb].[Question])=([pg].[ID], [pg].[Question]),
>> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
>> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
>> > [master].[dbo].[pivotbad].[Question] as
>> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question]
>> > AND
>> > [Expr1006]<>[Expr1007]))
>> >     Cost: 0%  |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
>> > DEFINE:([Bitmap1008]))
>> >             Cost: 0%    |--Compute
>> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
>> > [pb].[Response],'ISNULL')))
>> >             Cost:  6%   |--Parallelism(Repartition Streams, Hash
>> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
>> >             Cost: 12%  |--Table Scan(OBJECT:([master].[dbo].[pivotbad]
>> > AS
>> > [pb]))
>> >             Cost: 0% |--Compute
>> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response]
>> > as
>> > [pg].[Response],'ISNULL')))
>> >                 Cost: 17% |--Parallelism(Repartition Streams, Hash
>> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
>> >                     Cost: 33% |--Table
>> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
>> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
>> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
>> >
>> >
>> >
>> >> From: pavel.stehule@gmail.com
>> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
>> >> Subject: Re: [PERFORM]
>> >> To: humairm@hotmail.com
>> >> CC: pgsql-performance@postgresql.org
>> >>
>> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
>> >> > I have 2 tables with a 200,000 rows of data 3 character/string
>> >> > columns
>> >> > ID,
>> >> > Question and Response. The query below compares the data between the
>> >> > 2
>> >> > tables based on ID and Question and if the Response does not match
>> >> > between
>> >> > the left table and the right table it identifies the ID's where there
>> >> > is
>> >> > a
>> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL
>> >> > function
>> >> > take a few milliseconds. Running the same query in Postgresql takes
>> >> > over
>> >> > 70
>> >> > seconds. The 2 queries are below:
>> >> > SQL Server 2008 R2 Query
>> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
>> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
>> >> > isnull(t2.response,'ISNULL')
>> >>
>> >> > Postgres 9.1 Query
>> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
>> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
>> >> > coalesce(t2.response,'ISNULL')
>> >> > What gives?
>> >>
>> >> I think, so must problem can be in ugly predicate
>> >> coalesce(t1.response,'ISNULL') <>
>> >> > coalesce(t2.response,'ISNULL')
>> >>
>> >> try use a IS DISTINCT OF operator
>> >>
>> >> ... AND t1.response IS DISTINCT t2.response
>> >>
>> >> Regards
>> >>
>> >> Pavel Stehule
>> >>
>> >> p.s. don't use a coalesce in WHERE clause if it is possible.
>> >
>

Re: Query Performance SQL Server vs. Postgresql

От
"Kevin Grittner"
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I have no idea how much memory SQL Server thinks it can use

Hmmm...  That triggered an old memory -- when we were running SQL
Server on Windows there was some registry setting which we tweaked
to prevent the OS from trying to cache disk I/O.  (Sorry I don't
remember the name of it.)  That helped SQL Server perform better,
but would cripple PostgreSQL -- it counts on OS caching.

Of course, once we found that PostgreSQL was 70% faster on identical
hardware with identical load, and switching the OS to Linux brought
it to twice as fast, I haven't had to worry about SQL Server or
Windows configurations.  ;-)  Don't panic if PostgreSQL seems slower
at first, it's probably a configuration or maintenance schedule
issue that can be sorted out.

Besides the specific advice Tom gave you, you might want to browse
this page for configuration in general:

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

And if you continue to experience performance issues, this page can
help you get to a resolution quickly:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

We've been very happy with the switch to PostgreSQL.  We've had
better performance, better reliability, less staff time needed to
babysit backups, and we've been gradually using more of the advance
features not available in other products.  It's well worth the
effort to get over those initial bumps resulting from product
differences.

-Kevin

Re: Query Performance SQL Server vs. Postgresql

От
Pavel Stehule
Дата:
Hello

I don't know. I checked similar query on similar dataset (size), and I
have a times about 3 sec on much more worse notebook. So problem can
be in disk IO operation speed - maybe in access to TOASTed value.

2010/11/21 Humair Mohammed <humairm@hotmail.com>:
> I am running 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67
> Ghz Intel CPU. Both the SQL 2008 R2 and Postgresql are installed on the same
> machine. The DISTINCT FROM instead of the COALESCE does not help much. I ran
> 2 further tests with work_mem modifications (please note memory usage is
> quite low 650kb, so I am not sure if the work_mem is a factor):

it's has a little bit different meaning. work_mem is just limit, so
"memory usage" must not be great than work_mem ever. if then pg
increase "butches" number - store data to blocks on disk. Higher
work_mem ~ less butches. So ideal is 1 butches.

Regards

Pavel Stehule

> First, I modified the work_mem setting to 1GB (reloaded config) from the
> default 1MB and I see a response time of 33 seconds. Results below from
> EXPLAIN ANALYZE:
> "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
> time=26742.343..33274.317 rows=3163 loops=1)"
> "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> (pg.question)::text))"
> "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)
> (actual time=0.055..40.710 rows=93496 loops=1)"
> "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
> time=25603.460..25603.460 rows=251212 loops=1)"
> "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
> "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> width=134) (actual time=0.050..120.269 rows=251212 loops=1)"
> "Total runtime: 33275.028 ms"
>
> Second, I modified the work_mem setting to 2GB (reloaded config) and I see a
> response time of 38 seconds. Results below from EXPLAIN ANALYZE:
> "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
> time=26574.459..38406.422 rows=3163 loops=1)"
> "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> (pg.question)::text))"
> "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)
> (actual time=0.067..37.938 rows=93496 loops=1)"
> "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
> time=26426.127..26426.127 rows=251212 loops=1)"
> "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
> "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> width=134) (actual time=0.038..115.319 rows=251212 loops=1)"
> "Total runtime: 38406.927 ms"
>
> By no means I am trying to compare the 2 products. When I noticed the slow
> behavior of COALESCE I tried it on SQL Server. And since they are running on
> the same machine my comment regarding apples to apples. It is possible that
> this is not an apples to apples comparison other than the fact that it is
> running on the same machine.
>
>> From: pavel.stehule@gmail.com
>> Date: Thu, 18 Nov 2010 07:14:24 +0100
>> Subject: Re: Query Performance SQL Server vs. Postgresql
>> To: humairm@hotmail.com
>> CC: pgsql-performance@postgresql.org
>>
>> Hello,
>>
>> there should be a problem in a statistic, they are out of reality.
>> Please, try to use a DISTINCT OF operator now - maybe a statistic will
>> be better. Next - try to increase a work_mem. Hash join is
>> untypically slow in your comp.
>>
>> Regards
>>
>> Pavel Stehule
>>
>> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
>> > Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to
>> > rerun
>> > the query. Results from EXPLAIN ANALYZE below:
>> > "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
>> > time=43200.223..49502.874 rows=3163 loops=1)"
>> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text
>> > =
>> > (pg.question)::text))"
>> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character
>> > varying))::text
>> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
>> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496
>> > width=134)
>> > (actual time=0.009..48.200 rows=93496 loops=1)"
>> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
>> > time=42919.453..42919.453 rows=251212 loops=1)"
>> > "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
>> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
>> > width=134) (actual time=0.119..173.019 rows=251212 loops=1)"
>> > "Total runtime: 49503.450 ms"
>> >
>> >> From: pavel.stehule@gmail.com
>> >> Date: Wed, 17 Nov 2010 05:47:51 +0100
>> >> Subject: Re: Query Performance SQL Server vs. Postgresql
>> >> To: humairm@hotmail.com
>> >> CC: pgsql-performance@postgresql.org
>> >>
>> >> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
>> >> >
>> >> > There are no indexes on the tables either in SQL Server or Postgresql
>> >> > -
>> >> > I am
>> >> > comparing apples to apples here. I ran ANALYZE on the postgresql
>> >> > tables,
>> >> > after that query performance times are still high 42 seconds with
>> >> > COALESCE
>> >> > and 35 seconds with IS DISTINCT FROM.
>> >> > Here is the execution plan from Postgresql for qurey - select pb.id
>> >> > from
>> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question
>> >> > =
>> >> > pg.question and coalesce(pb.response,'MISSING') <>
>> >> > coalesce(pg.response,'MISSING')
>> >> > Execution Time: 42 seconds
>> >> > "Hash Join  (cost=16212.30..48854.24 rows=93477 width=17)"
>> >> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND
>> >> > ((pb.question)::text
>> >> > =
>> >> > (pg.question)::text))"
>> >> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character
>> >> > varying))::text
>> >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
>> >> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496
>> >> > width=134)"
>> >> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134)"
>> >> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12
>> >> > rows=251212
>> >> > width=134)"
>> >>
>> >> this is little bit strange - did you ANALYZE and VACUUM?
>> >>
>> >> please send result of EXPLAIN ANALYZE
>> >>
>> >> Pavel
>> >>
>> >> >
>> >> > And here is the execution plan from SQL Server for query - select
>> >> > pb.id
>> >> > from
>> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question
>> >> > =
>> >> > pg.question and isnull(pb.response,'ISNULL')<>
>> >> >  isnull(pg.response,'ISNULL')
>> >> > Execution Time: < 1 second
>> >> > Cost: 1%  |--Parallelism(Gather Streams)
>> >> > Cost: 31%       |--Hash Match(Inner Join, HASH:([pb].[ID],
>> >> > [pb].[Question])=([pg].[ID], [pg].[Question]),
>> >> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
>> >> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
>> >> > [master].[dbo].[pivotbad].[Question] as
>> >> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as
>> >> > [pg].[Question]
>> >> > AND
>> >> > [Expr1006]<>[Expr1007]))
>> >> >     Cost: 0%  |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
>> >> > DEFINE:([Bitmap1008]))
>> >> >             Cost: 0%    |--Compute
>> >> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response]
>> >> > as
>> >> > [pb].[Response],'ISNULL')))
>> >> >             Cost:  6%   |--Parallelism(Repartition Streams, Hash
>> >> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
>> >> >             Cost: 12%  |--Table
>> >> > Scan(OBJECT:([master].[dbo].[pivotbad]
>> >> > AS
>> >> > [pb]))
>> >> >             Cost: 0% |--Compute
>> >> >
>> >> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response]
>> >> > as
>> >> > [pg].[Response],'ISNULL')))
>> >> >                 Cost: 17% |--Parallelism(Repartition Streams, Hash
>> >> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
>> >> >                     Cost: 33% |--Table
>> >> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
>> >> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
>> >> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
>> >> >
>> >> >
>> >> >
>> >> >> From: pavel.stehule@gmail.com
>> >> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
>> >> >> Subject: Re: [PERFORM]
>> >> >> To: humairm@hotmail.com
>> >> >> CC: pgsql-performance@postgresql.org
>> >> >>
>> >> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
>> >> >> > I have 2 tables with a 200,000 rows of data 3 character/string
>> >> >> > columns
>> >> >> > ID,
>> >> >> > Question and Response. The query below compares the data between
>> >> >> > the
>> >> >> > 2
>> >> >> > tables based on ID and Question and if the Response does not match
>> >> >> > between
>> >> >> > the left table and the right table it identifies the ID's where
>> >> >> > there
>> >> >> > is
>> >> >> > a
>> >> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL
>> >> >> > function
>> >> >> > take a few milliseconds. Running the same query in Postgresql
>> >> >> > takes
>> >> >> > over
>> >> >> > 70
>> >> >> > seconds. The 2 queries are below:
>> >> >> > SQL Server 2008 R2 Query
>> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id
>> >> >> > and
>> >> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
>> >> >> > isnull(t2.response,'ISNULL')
>> >> >>
>> >> >> > Postgres 9.1 Query
>> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id
>> >> >> > and
>> >> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
>> >> >> > coalesce(t2.response,'ISNULL')
>> >> >> > What gives?
>> >> >>
>> >> >> I think, so must problem can be in ugly predicate
>> >> >> coalesce(t1.response,'ISNULL') <>
>> >> >> > coalesce(t2.response,'ISNULL')
>> >> >>
>> >> >> try use a IS DISTINCT OF operator
>> >> >>
>> >> >> ... AND t1.response IS DISTINCT t2.response
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Pavel Stehule
>> >> >>
>> >> >> p.s. don't use a coalesce in WHERE clause if it is possible.
>> >> >
>> >
>

Re: Query Performance SQL Server vs. Postgresql

От
Humair Mohammed
Дата:
I am running 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU. Both the SQL 2008 R2 and Postgresql are installed on the same machine. The DISTINCT FROM instead of the COALESCE does not help much. I ran 2 further tests with work_mem modifications (please note memory usage is quite low 650kb, so I am not sure if the work_mem is a factor):

First, I modified the work_mem setting to 1GB (reloaded config) from the default 1MB and I see a response time of 33 seconds. Results below from EXPLAIN ANALYZE:

"Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual time=26742.343..33274.317 rows=3163 loops=1)"
"  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
"  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
"  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.055..40.710 rows=93496 loops=1)"
"  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual time=25603.460..25603.460 rows=251212 loops=1)"
"        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
"        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.050..120.269 rows=251212 loops=1)"
"Total runtime: 33275.028 ms"


Second, I modified the work_mem setting to 2GB (reloaded config) and I see a response time of 38 seconds. Results below from EXPLAIN ANALYZE:

"Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual time=26574.459..38406.422 rows=3163 loops=1)"
"  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
"  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
"  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.067..37.938 rows=93496 loops=1)"
"  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual time=26426.127..26426.127 rows=251212 loops=1)"
"        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
"        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.038..115.319 rows=251212 loops=1)"
"Total runtime: 38406.927 ms"


By no means I am trying to compare the 2 products. When I noticed the slow behavior of COALESCE I tried it on SQL Server. And since they are running on the same machine my comment regarding apples to apples. It is possible that this is not an apples to apples comparison other than the fact that it is running on the same machine.


> From: pavel.stehule@gmail.com
> Date: Thu, 18 Nov 2010 07:14:24 +0100
> Subject: Re: Query Performance SQL Server vs. Postgresql
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> Hello,
>
> there should be a problem in a statistic, they are out of reality.
> Please, try to use a DISTINCT OF operator now - maybe a statistic will
> be better. Next - try to increase a work_mem. Hash join is
> untypically slow in your comp.
>
> Regards
>
> Pavel Stehule
>
> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> > Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun
> > the query. Results from EXPLAIN ANALYZE below:
> > "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
> > time=43200.223..49502.874 rows=3163 loops=1)"
> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)
> > (actual time=0.009..48.200 rows=93496 loops=1)"
> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
> > time=42919.453..42919.453 rows=251212 loops=1)"
> > "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> > width=134) (actual time=0.119..173.019 rows=251212 loops=1)"
> > "Total runtime: 49503.450 ms"
> >
> >> From: pavel.stehule@gmail.com
> >> Date: Wed, 17 Nov 2010 05:47:51 +0100
> >> Subject: Re: Query Performance SQL Server vs. Postgresql
> >> To: humairm@hotmail.com
> >> CC: pgsql-performance@postgresql.org
> >>
> >> 2010/11/17 Humair Mohammed <humairm@hotmail.com>:
> >> >
> >> > There are no indexes on the tables either in SQL Server or Postgresql -
> >> > I am
> >> > comparing apples to apples here. I ran ANALYZE on the postgresql tables,
> >> > after that query performance times are still high 42 seconds with
> >> > COALESCE
> >> > and 35 seconds with IS DISTINCT FROM.
> >> > Here is the execution plan from Postgresql for qurey - select pb.id from
> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> >> > pg.question and coalesce(pb.response,'MISSING') <>
> >> > coalesce(pg.response,'MISSING')
> >> > Execution Time: 42 seconds
> >> > "Hash Join  (cost=16212.30..48854.24 rows=93477 width=17)"
> >> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text
> >> > =
> >> > (pg.question)::text))"
> >> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character
> >> > varying))::text
> >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> >> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496
> >> > width=134)"
> >> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134)"
> >> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> >> > width=134)"
> >>
> >> this is little bit strange - did you ANALYZE and VACUUM?
> >>
> >> please send result of EXPLAIN ANALYZE
> >>
> >> Pavel
> >>
> >> >
> >> > And here is the execution plan from SQL Server for query - select pb.id
> >> > from
> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> >> > pg.question and isnull(pb.response,'ISNULL')<>
> >> >  isnull(pg.response,'ISNULL')
> >> > Execution Time: < 1 second
> >> > Cost: 1%  |--Parallelism(Gather Streams)
> >> > Cost: 31%       |--Hash Match(Inner Join, HASH:([pb].[ID],
> >> > [pb].[Question])=([pg].[ID], [pg].[Question]),
> >> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
> >> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
> >> > [master].[dbo].[pivotbad].[Question] as
> >> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question]
> >> > AND
> >> > [Expr1006]<>[Expr1007]))
> >> >     Cost: 0%  |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
> >> > DEFINE:([Bitmap1008]))
> >> >             Cost: 0%    |--Compute
> >> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
> >> > [pb].[Response],'ISNULL')))
> >> >             Cost:  6%   |--Parallelism(Repartition Streams, Hash
> >> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
> >> >             Cost: 12%  |--Table Scan(OBJECT:([master].[dbo].[pivotbad]
> >> > AS
> >> > [pb]))
> >> >             Cost: 0% |--Compute
> >> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response]
> >> > as
> >> > [pg].[Response],'ISNULL')))
> >> >                 Cost: 17% |--Parallelism(Repartition Streams, Hash
> >> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
> >> >                     Cost: 33% |--Table
> >> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
> >> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
> >> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
> >> >
> >> >
> >> >
> >> >> From: pavel.stehule@gmail.com
> >> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
> >> >> Subject: Re: [PERFORM]
> >> >> To: humairm@hotmail.com
> >> >> CC: pgsql-performance@postgresql.org
> >> >>
> >> >> 2010/11/15 Humair Mohammed <humairm@hotmail.com>:
> >> >> > I have 2 tables with a 200,000 rows of data 3 character/string
> >> >> > columns
> >> >> > ID,
> >> >> > Question and Response. The query below compares the data between the
> >> >> > 2
> >> >> > tables based on ID and Question and if the Response does not match
> >> >> > between
> >> >> > the left table and the right table it identifies the ID's where there
> >> >> > is
> >> >> > a
> >> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL
> >> >> > function
> >> >> > take a few milliseconds. Running the same query in Postgresql takes
> >> >> > over
> >> >> > 70
> >> >> > seconds. The 2 queries are below:
> >> >> > SQL Server 2008 R2 Query
> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> >> >> > isnull(t2.response,'ISNULL')
> >> >>
> >> >> > Postgres 9.1 Query
> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> >> >> > coalesce(t2.response,'ISNULL')
> >> >> > What gives?
> >> >>
> >> >> I think, so must problem can be in ugly predicate
> >> >> coalesce(t1.response,'ISNULL') <>
> >> >> > coalesce(t2.response,'ISNULL')
> >> >>
> >> >> try use a IS DISTINCT OF operator
> >> >>
> >> >> ... AND t1.response IS DISTINCT t2.response
> >> >>
> >> >> Regards
> >> >>
> >> >> Pavel Stehule
> >> >>
> >> >> p.s. don't use a coalesce in WHERE clause if it is possible.
> >> >
> >

Re: Query Performance SQL Server vs. Postgresql

От
Humair Mohammed
Дата:

1) OS/Configuration
64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU
postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 64-bit)
work_mem  2GB
shared_buffers = 2

2) Dataset
name,pages,tuples,pg_size_pretty
"pivotbad";1870;93496;"15 MB"
"pivotgood";5025;251212;"39 MB"

3) EXPLAIN (ANALYZE ON, BUFFERS ON)
"Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual time=25814.222..32296.765 rows=3163 loops=1)"
"  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
"  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
"  Buffers: shared hit=384 read=6511, temp read=6444 written=6318"
"  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.069..37.143 rows=93496 loops=1)"
"        Buffers: shared hit=192 read=1678"
"  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual time=24621.752..24621.752 rows=251212 loops=1)"
"        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
"        Buffers: shared hit=192 read=4833, temp written=4524"
"        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.038..117.780 rows=251212 loops=1)"
"              Buffers: shared hit=192 read=4833"
"Total runtime: 32297.305 ms"

4) INDEXES
I can certainly add an index but given the table sizes I am not sure if that is a factor. This by no means is a large dataset less than 350,000 rows in total and 3 columns. Also this was just a quick dump of data for comparison purpose. When I saw the poor performance on the COALESCE, I pointed the data load to SQL Server and ran the same query except with the TSQL specific ISNULL function.

Re: Query Performance SQL Server vs. Postgresql

От
Pavel Stehule
Дата:
2010/11/21 Humair Mohammed <humairm@hotmail.com>:
>
> 1) OS/Configuration
> 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU
> postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500,
> 64-bit)
> work_mem  2GB
> shared_buffers = 2

shared_buffers = 2 ???

Regards

Pavel Stehule


> 2) Dataset
> name,pages,tuples,pg_size_pretty
> "pivotbad";1870;93496;"15 MB"
> "pivotgood";5025;251212;"39 MB"
> 3) EXPLAIN (ANALYZE ON, BUFFERS ON)
> "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
> time=25814.222..32296.765 rows=3163 loops=1)"
> "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> (pg.question)::text))"
> "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> "  Buffers: shared hit=384 read=6511, temp read=6444 written=6318"
> "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)
> (actual time=0.069..37.143 rows=93496 loops=1)"
> "        Buffers: shared hit=192 read=1678"
> "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
> time=24621.752..24621.752 rows=251212 loops=1)"
> "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
> "        Buffers: shared hit=192 read=4833, temp written=4524"
> "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> width=134) (actual time=0.038..117.780 rows=251212 loops=1)"
> "              Buffers: shared hit=192 read=4833"
> "Total runtime: 32297.305 ms"
> 4) INDEXES
> I can certainly add an index but given the table sizes I am not sure if that
> is a factor. This by no means is a large dataset less than 350,000 rows in
> total and 3 columns. Also this was just a quick dump of data for comparison
> purpose. When I saw the poor performance on the COALESCE, I pointed the data
> load to SQL Server and ran the same query except with the TSQL specific
> ISNULL function.
>

Re: Query Performance SQL Server vs. Postgresql

От
tv@fuzzy.cz
Дата:
> 4) INDEXESI can certainly add an index but given the table sizes I am not
> sure if that is a factor. This by no means is a large dataset less than
> 350,000 rows in total and 3 columns. Also this was just a quick dump of
> data for comparison purpose. When I saw the poor performance on the
> COALESCE, I pointed the data load to SQL Server and ran the same query
> except with the TSQL specific ISNULL function.

350000 rows definitely is a lot of rows, although with 3 INT column it's
just about 13MB of data (including overhead). But indexes can be quite
handy when doing joins, as in this case.

Tomas


Re: Query Performance SQL Server vs. Postgresql

От
Pavel Stehule
Дата:
Hello

2010/11/21 Humair Mohammed <humairm@hotmail.com>:
> That was a typo:
> work_mem = 2GB
> shared_buffers = 2GB

ok, then try to decrease a shared_buffers. Maybe a win7 has a some
problem - large a shared buffers are well just for UNIX like systems.
I am thinking so 500 MB is enough

Regards

Pavel Stehule

>> From: pavel.stehule@gmail.com
>> Date: Sun, 21 Nov 2010 12:38:43 +0100
>> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
>> To: humairm@hotmail.com
>> CC: pgsql-performance@postgresql.org
>>
>> 2010/11/21 Humair Mohammed <humairm@hotmail.com>:
>> >
>> > 1) OS/Configuration
>> > 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel
>> > CPU
>> > postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500,
>> > 64-bit)
>> > work_mem  2GB
>> > shared_buffers = 2
>>
>> shared_buffers = 2 ???
>>
>> Regards
>>
>> Pavel Stehule
>>
>>
>> > 2) Dataset
>> > name,pages,tuples,pg_size_pretty
>> > "pivotbad";1870;93496;"15 MB"
>> > "pivotgood";5025;251212;"39 MB"
>> > 3) EXPLAIN (ANALYZE ON, BUFFERS ON)
>> > "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
>> > time=25814.222..32296.765 rows=3163 loops=1)"
>> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text
>> > =
>> > (pg.question)::text))"
>> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character
>> > varying))::text
>> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
>> > "  Buffers: shared hit=384 read=6511, temp read=6444 written=6318"
>> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496
>> > width=134)
>> > (actual time=0.069..37.143 rows=93496 loops=1)"
>> > "        Buffers: shared hit=192 read=1678"
>> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
>> > time=24621.752..24621.752 rows=251212 loops=1)"
>> > "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
>> > "        Buffers: shared hit=192 read=4833, temp written=4524"
>> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
>> > width=134) (actual time=0.038..117.780 rows=251212 loops=1)"
>> > "              Buffers: shared hit=192 read=4833"
>> > "Total runtime: 32297.305 ms"
>> > 4) INDEXES
>> > I can certainly add an index but given the table sizes I am not sure if
>> > that
>> > is a factor. This by no means is a large dataset less than 350,000 rows
>> > in
>> > total and 3 columns. Also this was just a quick dump of data for
>> > comparison
>> > purpose. When I saw the poor performance on the COALESCE, I pointed the
>> > data
>> > load to SQL Server and ran the same query except with the TSQL specific
>> > ISNULL function.
>> >
>

Re: Query Performance SQL Server vs. Postgresql

От
tv@fuzzy.cz
Дата:
> First, I modified the work_mem setting to 1GB (reloaded config) from the
> default 1MB and I see a response time of 33 seconds. Results below from
> EXPLAIN ANALYZE:

...

> Second, I modified the work_mem setting to 2GB (reloaded config) and I see
> a response time of 38 seconds. Results below from EXPLAIN ANALYZE:

...

How did you reload the config? Using 'kill -HUP pid'? That should work
fine. Have you cheched 'work_mem' after the reload?

Because the explain plans are exactly the same (structure, estimated
costs). The really interesting bit is this and it did not change at all

   Buckets: 1024 Batches: 64  Memory Usage: 650kB

As Tom Lane already mentioned, splitting hash join into batches (due to
small memory) adds overhead, the optimal number of batches is 1. But I
guess 1GB of work_mem is an overkill - something like 64MB should be fine.

The suspicious thing is the query plans have not changed at all
(especially the number of batches). I think you're not telling us
something important (unintentionally of course).

> By no means I am trying to compare the 2 products. When I noticed the slow
> behavior of COALESCE I tried it on SQL Server. And since they are running
> on the same machine my comment regarding apples to apples. It is possible
> that this is not an apples to apples comparison other than the fact that
> it is running on the same machine.

OK. The point of my post was that you've provided very little info about
the settings etc. so it was difficult to identify why PostgreSQL is so
slow.

Tomas


Re: Query Performance SQL Server vs. Postgresql

От
tv@fuzzy.cz
Дата:
>> 4) INDEXESI can certainly add an index but given the table sizes I am
>> not
>> sure if that is a factor. This by no means is a large dataset less than
>> 350,000 rows in total and 3 columns. Also this was just a quick dump of
>> data for comparison purpose. When I saw the poor performance on the
>> COALESCE, I pointed the data load to SQL Server and ran the same query
>> except with the TSQL specific ISNULL function.
>
> 350000 rows definitely is a lot of rows, although with 3 INT column it's
> just about 13MB of data (including overhead). But indexes can be quite
> handy when doing joins, as in this case.

OK, I've just realized the tables have 3 character columns, not integers.
In that case the tables are probably much bigger (and there are things
like TOAST). In that case indexes may be even more important.

Tomas


Re: Query Performance SQL Server vs. Postgresql

От
Tom Lane
Дата:
tv@fuzzy.cz writes:
>> Second, I modified the work_mem setting to 2GB (reloaded config) and I see
>> a response time of 38 seconds. Results below from EXPLAIN ANALYZE:

> How did you reload the config? Using 'kill -HUP pid'? That should work
> fine. Have you cheched 'work_mem' after the reload?

> Because the explain plans are exactly the same (structure, estimated
> costs). The really interesting bit is this and it did not change at all

>    Buckets: 1024 Batches: 64  Memory Usage: 650kB

If that didn't change, I'm prepared to bet that the OP didn't actually
manage to change the active value of work_mem.

            regards, tom lane

Re: Query Performance SQL Server vs. Postgresql

От
Robert Haas
Дата:
On Nov 21, 2010, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> tv@fuzzy.cz writes:
>>> Second, I modified the work_mem setting to 2GB (reloaded config) and I see
>>> a response time of 38 seconds. Results below from EXPLAIN ANALYZE:
>
>> How did you reload the config? Using 'kill -HUP pid'? That should work
>> fine. Have you cheched 'work_mem' after the reload?
>
>> Because the explain plans are exactly the same (structure, estimated
>> costs). The really interesting bit is this and it did not change at all
>
>>   Buckets: 1024 Batches: 64  Memory Usage: 650kB
>
> If that didn't change, I'm prepared to bet that the OP didn't actually
> manage to change the active value of work_mem.

Yep.  All this speculation about slow disks and/or COALESCE strikes me as likely totally off-base. I think the original
posterneeds to run "show work_mem" right before the EXPLAIN ANALYZE to make sure the new value they set actually stuck.
There'sno reason for the planner to have used only 650kB if work_mem is set to anything >=2MB. 

...Robert

Re: Query Performance SQL Server vs. Postgresql

От
Humair Mohammed
Дата:
That was a typo:

work_mem = 2GB
shared_buffers = 2GB

> From: pavel.stehule@gmail.com
> Date: Sun, 21 Nov 2010 12:38:43 +0100
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> 2010/11/21 Humair Mohammed <humairm@hotmail.com>:
> >
> > 1) OS/Configuration
> > 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU
> > postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500,
> > 64-bit)
> > work_mem  2GB
> > shared_buffers = 2
>
> shared_buffers = 2 ???
>
> Regards
>
> Pavel Stehule
>
>
> > 2) Dataset
> > name,pages,tuples,pg_size_pretty
> > "pivotbad";1870;93496;"15 MB"
> > "pivotgood";5025;251212;"39 MB"
> > 3) EXPLAIN (ANALYZE ON, BUFFERS ON)
> > "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
> > time=25814.222..32296.765 rows=3163 loops=1)"
> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > "  Buffers: shared hit=384 read=6511, temp read=6444 written=6318"
> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)
> > (actual time=0.069..37.143 rows=93496 loops=1)"
> > "        Buffers: shared hit=192 read=1678"
> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
> > time=24621.752..24621.752 rows=251212 loops=1)"
> > "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
> > "        Buffers: shared hit=192 read=4833, temp written=4524"
> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> > width=134) (actual time=0.038..117.780 rows=251212 loops=1)"
> > "              Buffers: shared hit=192 read=4833"
> > "Total runtime: 32297.305 ms"
> > 4) INDEXES
> > I can certainly add an index but given the table sizes I am not sure if that
> > is a factor. This by no means is a large dataset less than 350,000 rows in
> > total and 3 columns. Also this was just a quick dump of data for comparison
> > purpose. When I saw the poor performance on the COALESCE, I pointed the data
> > load to SQL Server and ran the same query except with the TSQL specific
> > ISNULL function.
> >

Re: Query Performance SQL Server vs. Postgresql

От
Humair Mohammed
Дата:

Correct, the optimizer did not take the settings with the pg_ctl reload command. I did a pg_ctl restart and work_mem now displays the updated value. I had to bump up all the way to 2047 MB to get the response below (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB (which is the max value that can be set for work_mem - anything more than that results in a FATAL error because of the limit) the results are below. The batches and memory usage are reflecting the right behavior with these settings. Thanks for everyones input, the result is now matching what SQL Server was producing.

"Hash Join  (cost=11305.30..39118.43 rows=92869 width=17) (actual time=145.888..326.216 rows=3163 loops=1)"
"  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
"  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
"  Buffers: shared hit=6895"
"  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.011..11.903 rows=93496 loops=1)"
"        Buffers: shared hit=1870"
"  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual time=145.673..145.673 rows=251212 loops=1)"
"        Buckets: 32768  Batches: 1  Memory Usage: 39939kB"
"        Buffers: shared hit=5025"
"        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.004..26.242 rows=251212 loops=1)"
"              Buffers: shared hit=5025"
"Total runtime: 331.168 ms"

Humair


> CC: tv@fuzzy.cz; humairm@hotmail.com; pavel.stehule@gmail.com; pgsql-performance@postgresql.org
> From: robertmhaas@gmail.com
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> Date: Sun, 21 Nov 2010 13:55:54 -0500
> To: tgl@sss.pgh.pa.us
>
> On Nov 21, 2010, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > tv@fuzzy.cz writes:
> >>> Second, I modified the work_mem setting to 2GB (reloaded config) and I see
> >>> a response time of 38 seconds. Results below from EXPLAIN ANALYZE:
> >
> >> How did you reload the config? Using 'kill -HUP pid'? That should work
> >> fine. Have you cheched 'work_mem' after the reload?
> >
> >> Because the explain plans are exactly the same (structure, estimated
> >> costs). The really interesting bit is this and it did not change at all
> >
> >> Buckets: 1024 Batches: 64 Memory Usage: 650kB
> >
> > If that didn't change, I'm prepared to bet that the OP didn't actually
> > manage to change the active value of work_mem.
>
> Yep. All this speculation about slow disks and/or COALESCE strikes me as likely totally off-base. I think the original poster needs to run "show work_mem" right before the EXPLAIN ANALYZE to make sure the new value they set actually stuck. There's no reason for the planner to have used only 650kB if work_mem is set to anything >=2MB.
>
> ...Robert

Re: Query Performance SQL Server vs. Postgresql

От
tv@fuzzy.cz
Дата:
>
>
> Correct, the optimizer did not take the settings with the pg_ctl reload
> command. I did a pg_ctl restart and work_mem now displays the updated
> value. I had to bump up all the way to 2047 MB to get the response below
> (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB
> (which is the max value that can be set for work_mem - anything more than
> that results in a FATAL error because of the limit) the results are below.

Hm, can you post explain plan for the case work_mem=1024MB. I guess the
difference is due to caching. According to the explain analyze, there are
just cache hits, no reads.

Anyway the hash join uses only about 40MB of memory, so 1024MB should be
perfectly fine and the explain plan should be exactly the same as with
work_mem=2047MB. And the row estimates seem quite precise, so I don't
think there's some severe overestimation.

Tomas


Re: Query Performance SQL Server vs. Postgresql

От
Samuel Gendler
Дата:


On Sun, Nov 21, 2010 at 10:21 PM, Humair Mohammed <humairm@hotmail.com> wrote:

Correct, the optimizer did not take the settings with the pg_ctl reload command. I did a pg_ctl restart and work_mem now displays the updated value. I had to bump up all the way to 2047 MB to get the response below (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB (which is the max value that can be set for work_mem - anything more than that results in a FATAL error because of the limit) the results are below. The batches and memory usage are reflecting the right behavior with these settings. Thanks for everyones input, the result is now matching what SQL Server was producing.


I believe you can set work_mem to a different value just for the duration of a single query, so you needn't have work_mem set so high if for every query on the system.  A single query may well use a multiple of work_mem, so you really probably don't want it that high all the time unless all of your queries are structured similarly.  Just set work_mem='2047MB'; query; reset all;

But you should wait until someone more knowledgable than I confirm what I just wrote.

Re: Query Performance SQL Server vs. Postgresql

От
tv@fuzzy.cz
Дата:
> I believe you can set work_mem to a different value just for the duration
> of
> a single query, so you needn't have work_mem set so high if for every
> query
> on the system.  A single query may well use a multiple of work_mem, so you
> really probably don't want it that high all the time unless all of your
> queries are structured similarly.  Just set work_mem='2047MB'; query;
> reset
> all;

Yes, executing "set work_mem='64MB'" right before the query should be just
fine. Setting work_mem to 2GB is an overkill most of the time (99.99999%).

Tomas


Re: Query Performance SQL Server vs. Postgresql

От
Humair Mohammed
Дата:
I did some further analysis and here are the results:

work_mem;response_time
1MB;62 seconds
2MB;2 seconds
4MB;700 milliseconds
8MB;550 milliseconds

In all cases shared_buffers were set to the default value of 32MB. As you can see the 1 to 2 MB jump on the work_mem does wonders. I probably don't need this to be any higher than 8 or 16 MB. Thanks to all for help!

Humair

> Date: Mon, 22 Nov 2010 12:00:15 +0100
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> From: tv@fuzzy.cz
> To: humairm@hotmail.com
> CC: pgsql-performance@postgresql.org
>
> >
> >
> > Correct, the optimizer did not take the settings with the pg_ctl reload
> > command. I did a pg_ctl restart and work_mem now displays the updated
> > value. I had to bump up all the way to 2047 MB to get the response below
> > (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB
> > (which is the max value that can be set for work_mem - anything more than
> > that results in a FATAL error because of the limit) the results are below.
>
> Hm, can you post explain plan for the case work_mem=1024MB. I guess the
> difference is due to caching. According to the explain analyze, there are
> just cache hits, no reads.
>
> Anyway the hash join uses only about 40MB of memory, so 1024MB should be
> perfectly fine and the explain plan should be exactly the same as with
> work_mem=2047MB. And the row estimates seem quite precise, so I don't
> think there's some severe overestimation.
>
> Tomas
>

Re: Query Performance SQL Server vs. Postgresql

От
Merlin Moncure
Дата:
On Mon, Nov 22, 2010 at 7:12 PM, Humair Mohammed <humairm@hotmail.com> wrote:
> I did some further analysis and here are the results:
> work_mem;response_time
> 1MB;62 seconds
> 2MB;2 seconds
> 4MB;700 milliseconds
> 8MB;550 milliseconds
> In all cases shared_buffers were set to the default value of 32MB. As you
> can see the 1 to 2 MB jump on the work_mem does wonders. I probably don't
> need this to be any higher than 8 or 16 MB. Thanks to all for help!
> Humair

work_mem directly affects how the query is planned, because certain
types of plans (hash joins ans large sorts) require memory. raising or
lowering shared_buffers OTOH is very subtle and is not something you
tune to improve the execution of a single query...

merlin