Обсуждение: Poor performance on simple queries compared to sql server express

Поиск
Список
Период
Сортировка

Poor performance on simple queries compared to sql server express

От
"Adam Ma'ruf"
Дата:
Hi,

I wasn't whether or not to mail to the novice mailing list of this one.  Since this is performance related I'm posting it here, but I am definitely a novice at postgresql - converting from mssql just now.

I have a ~2.5gb table with ~5M rows of data.  A query that groups by two fields and sums a floating field takes approximately 122 seconds.  The equivalent query takes ~ 8seconds in my previous sql server express installation.

I've tried to vary the parameters in postgresql.conf:
I've tried wavering shared buffers from 512mb to 4000mb
and working_mem from 64mb to 4000mb (i thought this might be the answer since the execution plan (referenced below) indicates that the sort relies on an External Merge Disk method)
I've increased the default_statistics_target  to 10000 and full vacuum analyzed
I realize there are no indexes on this table.  My main concern is why I can't get this to run as fast as in sql server express (which also has no indexes, and the same query takes about 8 seconds)

My system:  Windows Professional 64-bit
8 gb of ram
Intel i5-220M CPU @ 2.5GHz 

Here is the link to the execution plan:  http://explain.depesz.com/s/Ytx3

Thanks a lot in advance and do let me know if you require any more information to make an informed opinion,
A

Re: Poor performance on simple queries compared to sql server express

От
Pavel Stehule
Дата:
Hello


It is little bit strange - can you send a info about your PostgreSQL version, send a query, and table description?

In this case, PostgreSQL should to use a hash aggregate, but from some strange reason, pg didn't do it.

Second strange issue is speed of external sort - it is less than I can expect.

What I know - a usual advice for MS Win is setting minimal shared bufferes - 512MB can be too much there.

Regards

Pavel Stehule


2013/8/26 Adam Ma'ruf <adam.maruf@gmail.com>
Hi,

I wasn't whether or not to mail to the novice mailing list of this one.  Since this is performance related I'm posting it here, but I am definitely a novice at postgresql - converting from mssql just now.

I have a ~2.5gb table with ~5M rows of data.  A query that groups by two fields and sums a floating field takes approximately 122 seconds.  The equivalent query takes ~ 8seconds in my previous sql server express installation.

I've tried to vary the parameters in postgresql.conf:
I've tried wavering shared buffers from 512mb to 4000mb
and working_mem from 64mb to 4000mb (i thought this might be the answer since the execution plan (referenced below) indicates that the sort relies on an External Merge Disk method)
I've increased the default_statistics_target  to 10000 and full vacuum analyzed
I realize there are no indexes on this table.  My main concern is why I can't get this to run as fast as in sql server express (which also has no indexes, and the same query takes about 8 seconds)

My system:  Windows Professional 64-bit
8 gb of ram
Intel i5-220M CPU @ 2.5GHz 

Here is the link to the execution plan:  http://explain.depesz.com/s/Ytx3

Thanks a lot in advance and do let me know if you require any more information to make an informed opinion,
A

Re: Poor performance on simple queries compared to sql server express

От
"Adam Ma'ruf"
Дата:
Sure

I just upgraded to 9.2.4.  The query is:
SELECT        quebec_four
            , sierra
            , SUM(dollaramount) as dollaramount
  FROM alpha_quebec_echo
  GROUP BY   quebec_four
             , sierra

alpha_quebec_echo has 5,409,743 rows and 39 columns.  Quebec_four and sierra are both varchar, dollar amount is a floating point field.  It has no indexes (but neither did the mssql express table).  Any other details you need?

Thanks,
A


On Mon, Aug 26, 2013 at 2:36 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hello


It is little bit strange - can you send a info about your PostgreSQL version, send a query, and table description?

In this case, PostgreSQL should to use a hash aggregate, but from some strange reason, pg didn't do it.

Second strange issue is speed of external sort - it is less than I can expect.

What I know - a usual advice for MS Win is setting minimal shared bufferes - 512MB can be too much there.

Regards

Pavel Stehule


2013/8/26 Adam Ma'ruf <adam.maruf@gmail.com>
Hi,

I wasn't whether or not to mail to the novice mailing list of this one.  Since this is performance related I'm posting it here, but I am definitely a novice at postgresql - converting from mssql just now.

I have a ~2.5gb table with ~5M rows of data.  A query that groups by two fields and sums a floating field takes approximately 122 seconds.  The equivalent query takes ~ 8seconds in my previous sql server express installation.

I've tried to vary the parameters in postgresql.conf:
I've tried wavering shared buffers from 512mb to 4000mb
and working_mem from 64mb to 4000mb (i thought this might be the answer since the execution plan (referenced below) indicates that the sort relies on an External Merge Disk method)
I've increased the default_statistics_target  to 10000 and full vacuum analyzed
I realize there are no indexes on this table.  My main concern is why I can't get this to run as fast as in sql server express (which also has no indexes, and the same query takes about 8 seconds)

My system:  Windows Professional 64-bit
8 gb of ram
Intel i5-220M CPU @ 2.5GHz 

Here is the link to the execution plan:  http://explain.depesz.com/s/Ytx3

Thanks a lot in advance and do let me know if you require any more information to make an informed opinion,
A


Re: Poor performance on simple queries compared to sql server express

От
"Tomas Vondra"
Дата:
On 26 Srpen 2013, 15:02, Adam Ma'ruf wrote:
> Sure
>
> I just upgraded to 9.2.4.  The query is:
> SELECT        quebec_four
>             , sierra
>             , SUM(dollaramount) as dollaramount
>   FROM alpha_quebec_echo
>   GROUP BY   quebec_four
>              , sierra
>
> alpha_quebec_echo has 5,409,743 rows and 39 columns.  Quebec_four and
> sierra are both varchar, dollar amount is a floating point field.  It has
> no indexes (but neither did the mssql express table).  Any other details
> you need?
>
> Thanks,
> A

Hi,

It's quite clear why the query is so slow - the plan is using on-disk sort
with ~5M rows, and that's consuming a lot of time (almost 120 seconds).

I'm wondering why it chose the sort in the first place. I'd guess it'll
choose hash aggregate, which does not require sorted input.

Can you try running "set enable_sort = false" and then explain of the query?

If that does not change the plan to "HashAggregate" instead of
"GroupAggregate", please check and post values of enable_* and cost_*
variables.

Another question is why it's doing the sort on disk and not in memory. The
explain you've posted shows it requires ~430MB on disk, and in my
experience it usually requires ~3x that much to do the in-memory sort.

I see you've set work_mem=4GB, is that correct? Can you try with a lower
value - say, 1 or 2GB? I'm not sure how this works on Windows, though.
Maybe there's some other limit (and SQL Server is not hitting it, because
it's native Windows application).

Can you prepare a testcase (table structure + data) and post it somewhere?
Or at least the structure, if it's not possible to share the data.

Also, output from "select * from pg_settings" would be helpful.

Tomas




Re: Poor performance on simple queries compared to sql server express

От
"Adam Ma'ruf"
Дата:
Hi

Thanks for the response.  I reran the query but first ran the statement you provided and set working mem to 2gb.  It ended up taking 133s and group aggregate was still used

Here are the values you asked for:
# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

#seq_page_cost = 1.0 # measured on an arbitrary scale
#random_page_cost = 4.0 # same scale as above
#cpu_tuple_cost = 0.01 # same scale as above
#cpu_index_tuple_cost = 0.005 # same scale as above
#cpu_operator_cost = 0.0025 # same scale as above
#effective_cache_size = 6000MB


The output of select * from pg_statistics is large...should I attach it as a separate file (not sure if that's allowed on these mailing lists)

The data is ~2.5gb, I can't think of any place I can upload it.  I can provide the columns and data type.  it's a subset of public data from usaspending.gov

column_name,                                       datatype,  ordinal position, nullable?
idxinteger1YES
obligatedamountdouble precision2YES
baseandexercisedoptionsvaluedouble precision3YES
baseandalloptionsvaluedouble precision4YES
maj_fund_agency_catcharacter varying5YES
contractingofficeagencyidcharacter varying6YES
contractingofficeidcharacter varying7YES
fundingrequestingagencyidcharacter varying8YES
fundingrequestingofficeidcharacter varying9YES
signeddatedate10YES
effectivedatedate11YES
currentcompletiondatedate12YES
ultimatecompletiondatedate13YES
lastdatetoordercharacter varying14YES
typeofcontractpricingcharacter varying15YES
multiyearcontractcharacter varying16YES
vendornamecharacter varying17YES
dunsnumbercharacter varying18YES
parentdunsnumbercharacter varying19YES
psc_catcharacter varying20YES
productorservicecodecharacter varying21YES
principalnaicscodecharacter varying22YES
piidcharacter varying23YES
modnumbercharacter varying24YES
fiscal_yearcharacter varying25YES
idvpiidcharacter varying26YES
extentcompetedcharacter varying27YES
numberofoffersreceiveddouble precision28YES
competitiveprocedurescharacter varying29YES
solicitationprocedurescharacter varying30YES
evaluatedpreferencecharacter varying31YES
firm8aflagcharacter varying32YES
sdbflagcharacter varying33YES
issbacertifiedsmalldisadvantagedbusinesscharacter varying34YES
womenownedflagcharacter varying35YES
veteranownedflagcharacter varying36YES
minorityownedbusinessflagcharacter varying37YES
data_sourcetext38YES
psc_cd character varying39YES




On Mon, Aug 26, 2013 at 9:40 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 26 Srpen 2013, 15:02, Adam Ma'ruf wrote:
> Sure
>
> I just upgraded to 9.2.4.  The query is:
> SELECT        quebec_four
>             , sierra
>             , SUM(dollaramount) as dollaramount
>   FROM alpha_quebec_echo
>   GROUP BY   quebec_four
>              , sierra
>
> alpha_quebec_echo has 5,409,743 rows and 39 columns.  Quebec_four and
> sierra are both varchar, dollar amount is a floating point field.  It has
> no indexes (but neither did the mssql express table).  Any other details
> you need?
>
> Thanks,
> A

Hi,

It's quite clear why the query is so slow - the plan is using on-disk sort
with ~5M rows, and that's consuming a lot of time (almost 120 seconds).

I'm wondering why it chose the sort in the first place. I'd guess it'll
choose hash aggregate, which does not require sorted input.

Can you try running "set enable_sort = false" and then explain of the query?

If that does not change the plan to "HashAggregate" instead of
"GroupAggregate", please check and post values of enable_* and cost_*
variables.

Another question is why it's doing the sort on disk and not in memory. The
explain you've posted shows it requires ~430MB on disk, and in my
experience it usually requires ~3x that much to do the in-memory sort.

I see you've set work_mem=4GB, is that correct? Can you try with a lower
value - say, 1 or 2GB? I'm not sure how this works on Windows, though.
Maybe there's some other limit (and SQL Server is not hitting it, because
it's native Windows application).

Can you prepare a testcase (table structure + data) and post it somewhere?
Or at least the structure, if it's not possible to share the data.

Also, output from "select * from pg_settings" would be helpful.

Tomas



Re: Poor performance on simple queries compared to sql server express

От
Tomas Vondra
Дата:
Hi,

On 27.8.2013 06:06, Adam Ma'ruf wrote:
> Hi
>
> Thanks for the response.  I reran the query but first ran the statement
> you provided and set working mem to 2gb.  It ended up taking 133s and
> group aggregate was still used

OK.

>
> Here are the values you asked for:
> # - Planner Method Configuration -
> # - Planner Cost Constants -

All set to default, so seems fine to me.

>
> #seq_page_cost = 1.0# measured on an arbitrary scale
> #random_page_cost = 4.0# same scale as above
> #cpu_tuple_cost = 0.01# same scale as above
> #cpu_index_tuple_cost = 0.005# same scale as above
> #cpu_operator_cost = 0.0025# same scale as above
> #effective_cache_size = 6000MB

Well, if effective_cache_size is commented out, then it's still 128MB
(default). But I don't think that matters here.

> The output of select * from pg_statistics is large...should I attach it
> as a separate file (not sure if that's allowed on these mailing lists)

I haven't asked for pg_statistics dump. I asked for pg_settings (but I
already got most of the important pieces above).


> The data is ~2.5gb, I can't think of any place I can upload it.  I can

There's like a zillion of such places. E.g. Dropbox, Box, Wuala, Google
Drive, mega.co.nz or one of the many other alternatives. All of them
give you ~5GB space for free.

Or I could give you access to my FTP server, if that's what you prefer.

> provide the columns and data type.  it's a subset of public data from
> usaspending.gov <http://usaspending.gov>

Is there a simple way to download / filter the public data to get the
same dataset as you have?

Tomas