Обсуждение: Select performance vs. mssql

От:
mark durrant
Дата:

Hi,

I have some experience with MSSQL and am examining
PostgreSQL. I'm running under Windows. I like what I
see so far, but I'm hoping for some performance
advice:

1. My test database has 7 million records.
2. There are two columns - an integer and a char
column called Day which has a random value of Mon or
Tues, etc. in it.
3. I made an index on Day.

My query is:

select count(*) from mtable where day='Mon'

Results:

1. P3 600 512MB RAM MSSQL. It takes about 4-5 secs to
run. If I run a few queries and everything is cached,
it is sometimes  just 1 second.

2. Athlon 1.3 Ghz 1GB RAM. PostgreSQL takes 7 seconds.
I have played with the buffers setting and currently
have it at 7500. At 20000 it took over 20 seconds to
run.

5 seconds vs 7 isn't that big of a deal, but 1 second
vs 7 seconds is. Also, the slower performance is with
much lesser hardware.

Any ideas to try?

Thanks much,
Mark

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

От:
Christopher Kings-Lynne
Дата:

> select count(*) from mtable where day='Mon'
>
> Results:
>
> 1. P3 600 512MB RAM MSSQL. It takes about 4-5 secs to
> run. If I run a few queries and everything is cached,
> it is sometimes  just 1 second.
>
> 2. Athlon 1.3 Ghz 1GB RAM. PostgreSQL takes 7 seconds.
> I have played with the buffers setting and currently
> have it at 7500. At 20000 it took over 20 seconds to
> run.
>
> 5 seconds vs 7 isn't that big of a deal, but 1 second
> vs 7 seconds is. Also, the slower performance is with
> much lesser hardware.

Post the result of this for us:

explain analyze select count(*) from mtable where day='Mon';

On both machines.

Chris

От:
mark durrant
Дата:

> Post the result of this for us:
>
> explain analyze select count(*) from mtable where
> day='Mon';
>
> On both machines.

Hi Chris --

PostgreSQL Machine:
"Aggregate  (cost=140122.56..140122.56 rows=1 width=0)
(actual time=24516.000..24516.000 rows=1 loops=1)"
"  ->  Index Scan using "day" on mtable
(cost=0.00..140035.06 rows=35000 width=0) (actual
time=47.000..21841.000 rows=1166025 loops=1)"
"        Index Cond: ("day" = 'Mon'::bpchar)"
"Total runtime: 24516.000 ms"
(Note this took 24 seconds after fresh reboot, next
execution was 11, and execution without explain
analyze was 6.7 seconds)

MSSQL Machine:
That "Explain Analyze" command doesn't work for MSSQL,
but I did view the Query plan. 97% of it was "Scanning
a particular range of rows from a nonclustered index"

Thanks for your help --Mark

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

От:
Neil Conway
Дата:

mark durrant wrote:
> PostgreSQL Machine:
> "Aggregate  (cost=140122.56..140122.56 rows=1 width=0)
> (actual time=24516.000..24516.000 rows=1 loops=1)"
> "  ->  Index Scan using "day" on mtable
> (cost=0.00..140035.06 rows=35000 width=0) (actual
> time=47.000..21841.000 rows=1166025 loops=1)"
> "        Index Cond: ("day" = 'Mon'::bpchar)"
> "Total runtime: 24516.000 ms"

Have you run ANALYZE?

Clustering the table on the "day" index (via the CLUSTER command) would
be worth trying.

-Neil

От:
"Harald Lau (Sector-X)"
Дата:

Mark,

> MSSQL Machine:
> That "Explain Analyze" command doesn't work for MSSQL,

try this:
set showplan_all on
go
select ...
go

Harald

От:
mark durrant
Дата:

First, thanks for all the helpful replies. I've
listened to the suggestions and done some more digging
and have results:

I did show_plan_all in MSSQL and found that it was
doing an Index Scan. I've read someplace that if the
data you need is all in the index, then MSSQL has a
feature/hack where it does not have to go to the
table, it can do my COUNT using the index alone. I
think this explains the 1 second query performance.

I changed the query to also include the other column
which is not indexed. The results were MSSQL now used
a TableScan and was MUCH slower than PostgreSQL.

I clustered the index on MSSQL and PostgreSQL and
increased buffers to 15000 on PGSQL. I saw a
noticeable performance increase on both. On the more
complicated query, PostgreSQL is now 3.5 seconds.
MSSQL is faster again doing an index scan and is at 2
seconds. Remember the MSSQL machine has a slower CPU
as well.

My interpretations:

--Given having to do a table scan, PostgreSQL seems to
be faster. The hardware on my PostrgreSQL machine is
nicer than the MSSQL one, so perhaps they are just
about the same speed with speed determined by the
disk.

--Tuning helps. Clustered index cut my query time
down. More buffers helped.

--As Chris pointed out, how real-world is this test?
His point is valid. The database we're planning will
have a lot of rows and require a lot of summarization
(hence my attempt at a "test"), but we shouldn't be
pulling a million rows at a time.

--MSSQL's ability to hit the index only and not having
to go to the table itself results in a _big_
performance/efficiency gain. If someone who's in
development wants to pass this along, it would be a
nice addition to PostgreSQL sometime in the future.
I'd suspect that as well as making one query faster,
it would make everything else faster/more scalable as
the server load is so much less.

Thanks again,

Mark



__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site
http://smallbusiness.yahoo.com/resources/

От:
Bruno Wolff III
Дата:

On Tue, May 24, 2005 at 08:36:36 -0700,
  mark durrant <> wrote:
>
> --MSSQL's ability to hit the index only and not having
> to go to the table itself results in a _big_
> performance/efficiency gain. If someone who's in
> development wants to pass this along, it would be a
> nice addition to PostgreSQL sometime in the future.
> I'd suspect that as well as making one query faster,
> it would make everything else faster/more scalable as
> the server load is so much less.

This gets brought up a lot. The problem is that the index doesn't include
information about whether the current transaction can see the referenced
row. Putting this information in the index will add significant overhead
to every update and the opinion of the developers is that this would be
a net loss overall.

От:
mark durrant
Дата:

I'm far from an expert, so this may be off-base... but
perhaps a suggestion would be to allow a hint to be
sent to the optimizer if the user doesn't care that
the result is "approximate" maybe then this wouldn't
require adding more overhead to the indexes.

MSSQL has something like this with (nolock)
i.e. select count(*) from customers (nolock) where
name like 'Mark%'

Regardless, I'm very impressed with PostgreSQL and I
think we're moving ahead with it.

Mark

--- Bruno Wolff III <> wrote:
> On Tue, May 24, 2005 at 08:36:36 -0700,
>   mark durrant <> wrote:
> >
> > --MSSQL's ability to hit the index only and not
> having
> > to go to the table itself results in a _big_
> > performance/efficiency gain. If someone who's in
> > development wants to pass this along, it would be
> a
> > nice addition to PostgreSQL sometime in the
> future.
> > I'd suspect that as well as making one query
> faster,
> > it would make everything else faster/more scalable
> as
> > the server load is so much less.
>
> This gets brought up a lot. The problem is that the
> index doesn't include
> information about whether the current transaction
> can see the referenced
> row. Putting this information in the index will add
> significant overhead
> to every update and the opinion of the developers is
> that this would be
> a net loss overall.



__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site
http://smallbusiness.yahoo.com/resources/

От:
Alex Turner
Дата:

Until you start worrying about MVC - we have had problems with the MSSQL implementation of read consistency because of this 'feature'.

Alex Turner
NetEconomist

On 5/24/05, Bruno Wolff III <> wrote:
On Tue, May 24, 2005 at 08:36:36 -0700,
  mark durrant <> wrote:
>
> --MSSQL's ability to hit the index only and not having
> to go to the table itself results in a _big_
> performance/efficiency gain. If someone who's in
> development wants to pass this along, it would be a
> nice addition to PostgreSQL sometime in the future.
> I'd suspect that as well as making one query faster,
> it would make everything else faster/more scalable as
> the server load is so much less.

This gets brought up a lot. The problem is that the index doesn't include
information about whether the current transaction can see the referenced
row. Putting this information in the index will add significant overhead
to every update and the opinion of the developers is that this would be
a net loss overall.

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

От:
Josh Berkus
Дата:

Folks,

> > This gets brought up a lot. The problem is that the
> > index doesn't include
> > information about whether the current transaction
> > can see the referenced
> > row. Putting this information in the index will add
> > significant overhead
> > to every update and the opinion of the developers is
> > that this would be
> > a net loss overall.

Pretty much.  There has been discussion about allowing index-only access to
"frozen" tables, i.e. archive partitions.  But it all sort of hinges on
someone implementing it and testing ....

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

От:
PFC
Дата:

> Pretty much.  There has been discussion about allowing index-only access
> to
> "frozen" tables, i.e. archive partitions.  But it all sort of hinges on
> someone implementing it and testing ....

    Would be interesting as a parameter to set at index creation (ie. if you
know this table will have a lot of reads and few writes)... like create an
index on columns X,Y keeping data on columns X,Y and Z...
    But in this case do you still need the table ?
    Or even create a table type where the table and the index are one, like
an auto-clustered table...
    I don't know if it would be used that often, though ;)


От:
Michael Stone
Дата:

On Tue, May 24, 2005 at 04:35:14PM -0700, Josh Berkus wrote:
>Pretty much.  There has been discussion about allowing index-only access to
>"frozen" tables, i.e. archive partitions.  But it all sort of hinges on
>someone implementing it and testing ....

Is there any way to expose the planner estimate? For some purposes it's
enough to just give a rough ballpark (e.g., a google-esque "results 1-10
of approximately 10000000") so a user knows whether its worth even
starting to page through.

Mike Stone

От:
John A Meinel
Дата:

Michael Stone wrote:

> On Tue, May 24, 2005 at 04:35:14PM -0700, Josh Berkus wrote:
>
>> Pretty much.  There has been discussion about allowing index-only
>> access to "frozen" tables, i.e. archive partitions.  But it all sort
>> of hinges on someone implementing it and testing ....
>
>
> Is there any way to expose the planner estimate? For some purposes it's
> enough to just give a rough ballpark (e.g., a google-esque "results 1-10
> of approximately 10000000") so a user knows whether its worth even
> starting to page through.
>
> Mike Stone
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

Well, you could always do:

EXPLAIN SELECT ...

And then parse out the rows= in the first line.

John
=:->


От:
Christopher Kings-Lynne
Дата:

> --As Chris pointed out, how real-world is this test?
> His point is valid. The database we're planning will
> have a lot of rows and require a lot of summarization
> (hence my attempt at a "test"), but we shouldn't be
> pulling a million rows at a time.

If you want to do lots of aggregate analysis, I suggest you create a
sepearate summary table, and create triggers on the main table to
maintain your summaries in the other table...

> --MSSQL's ability to hit the index only and not having
> to go to the table itself results in a _big_
> performance/efficiency gain. If someone who's in
> development wants to pass this along, it would be a
> nice addition to PostgreSQL sometime in the future.
> I'd suspect that as well as making one query faster,
> it would make everything else faster/more scalable as
> the server load is so much less.

This is well-known and many databases do it.  However, due to MVCC
considerations in PostgreSQL, it's not feasible for us to implement it...

Chris

От:
"Jim C. Nasby"
Дата:

On Wed, May 25, 2005 at 09:29:36AM +0800, Christopher Kings-Lynne wrote:
> >--MSSQL's ability to hit the index only and not having
> >to go to the table itself results in a _big_
> >performance/efficiency gain. If someone who's in
> >development wants to pass this along, it would be a
> >nice addition to PostgreSQL sometime in the future.
> >I'd suspect that as well as making one query faster,
> >it would make everything else faster/more scalable as
> >the server load is so much less.
>
> This is well-known and many databases do it.  However, due to MVCC
> considerations in PostgreSQL, it's not feasible for us to implement it...

Wasn't there a plan to store some visibility info in indexes? IIRC the
idea was that a bit would be set in the index tuple indicating that all
transactions that wouldn't be able to see that index value were
complete, meaning that there was no reason to hit the heap for that
tuple.

I looked on the TODO but didn't see this, maybe it fell through the
cracks?
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

От:
Enrico Weigelt
Дата:

* Bruno Wolff III <> wrote:

<snip>

> This gets brought up a lot. The problem is that the index doesn't include
> information about whether the current transaction can see the referenced
> row. Putting this information in the index will add significant overhead
> to every update and the opinion of the developers is that this would be
> a net loss overall.

wouldn't it work well to make this feature optionally for each
index ? There could be some flag on the index (ie set at create
time) which tells postgres whether to store mvcc information.


cu
--
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service
  phone:     +49 36207 519931         www:       http://www.metux.de/
  fax:       +49 36207 519932         email:     
---------------------------------------------------------------------
  Realtime Forex/Stock Exchange trading powered by postgresSQL :))
                                            http://www.fxignal.net/
---------------------------------------------------------------------

От:
Jochem van Dieten
Дата:

Enrico Weigelt wrote:
> Bruno Wolff III wrote:
>>
>> This gets brought up a lot. The problem is that the index doesn't include
>> information about whether the current transaction can see the referenced
>> row. Putting this information in the index will add significant overhead
>> to every update and the opinion of the developers is that this would be
>> a net loss overall.
>
> wouldn't it work well to make this feature optionally for each
> index ? There could be some flag on the index (ie set at create
> time) which tells postgres whether to store mvcc information.

There is no reason to assume it can't work.

There is little reason to assume that it will be the best
solution in many circumstances.

There is a big reason why people are sceptical: there is no patch.


The issue has been debated and beaten to death. People have
formed their opinions and are unlikely to change their position.
If you want to convince people, your best bet is to submit a
patch and have OSDL measure the performance improvement.

Jochem