Обсуждение: huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search

Поиск
Список
Период
Сортировка

huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search

От
Ivan Sergio Borgonovo
Дата:
I know I'm comparing apples and orange but still the difference in
performance was quite astonishing.

I've 2 tables that look like:

create table products(
  id bigint
  price double precision, /* legacy, don't ask */
  sometextfield1 varchar(128),
  sometextfield2 varchar(128),
  ...
);

one on a MS SQL 2005 and another one on pg 8.3.

MS SQL has full text search on the text fields (I don't know the
details).

pg product table has a tsvector field and a gin index defined on it +
trigger that update the tsvector field when the textfields change.
The trigger is made in a way that it actually update the tsvector
just if the text fields are changed.

The hardware on the 2 machines is a bit different.
MS SQL 2005 runs on a box that is 2 years old, 2 SATA drives on RAID
1 hw, 2 Xeon dual core (I can't check details right now)
PG runs on a box that has more than 5 years, 3 SCSI drives on RAID 5
hw, 2 Xeon single core HT (Intel(R) Xeon(TM) CPU 3.20GHz, cpu fam
14, model 4)

Both have 4Gb of ram.
shared_buffers is 240Mb.
Both share a similar workload.
Both boxes were in the same "price class" when they were bought.

In both tables I've to update price.
VACUUM FULL was performed just before updating the prices.

MS SQL receives a large sql file that contain all the UPDATE
statements.
PG receive a csv file that is loaded into a table with COPY and then
does the update as
update products set price=p.price from temp_price where id=p.id and
price<>p.price;

MS SQL ingurgitate the whole sql file in around 10sec.
pg takes more than 5 min to just run the single update statement.

I'd like to know if such a large difference can be justified just by
HW difference or by a difference in the process on how data are
loaded [1] or by the difference in performance of the 2 servers on
this kind of workload or by some postgres config before I decide how
to manage my time to redesign the import procedure.

If HW can justify such huge difference I'll devote my time to other
problems.

I'd say that a potential culprit could be the gin index. No matter
if the tsvector is updated or not, if the row is changed I think the
index is going to be updated anyway.
Somehow MS SQL circumvent this problem, possibly by building the
equivalent of a tsvector column in a "hidden" table that
automatically join to the "text" table.
This add a join but reduce the cost of table modification since
simpler (btree) indexes are faster to update.

Still huge updates are rare and that table is mostly read and very
rarely written. During unusually huge updates I may consider to drop
the gin index.

[1] I'd expect that excluding the time it takes to load the csv a
single update should run faster than a huge list of single statement
update

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search

От
Allan Kamau
Дата:
On Tue, Sep 28, 2010 at 1:49 PM, Ivan Sergio Borgonovo
<mail@webthatworks.it> wrote:
> I know I'm comparing apples and orange but still the difference in
> performance was quite astonishing.
>
> I've 2 tables that look like:
>
> create table products(
>  id bigint
>  price double precision, /* legacy, don't ask */
>  sometextfield1 varchar(128),
>  sometextfield2 varchar(128),
>  ...
> );
>
> one on a MS SQL 2005 and another one on pg 8.3.
>
> MS SQL has full text search on the text fields (I don't know the
> details).
>
> pg product table has a tsvector field and a gin index defined on it +
> trigger that update the tsvector field when the textfields change.
> The trigger is made in a way that it actually update the tsvector
> just if the text fields are changed.
>
> The hardware on the 2 machines is a bit different.
> MS SQL 2005 runs on a box that is 2 years old, 2 SATA drives on RAID
> 1 hw, 2 Xeon dual core (I can't check details right now)
> PG runs on a box that has more than 5 years, 3 SCSI drives on RAID 5
> hw, 2 Xeon single core HT (Intel(R) Xeon(TM) CPU 3.20GHz, cpu fam
> 14, model 4)
>
> Both have 4Gb of ram.
> shared_buffers is 240Mb.
> Both share a similar workload.
> Both boxes were in the same "price class" when they were bought.
>
> In both tables I've to update price.
> VACUUM FULL was performed just before updating the prices.
>
> MS SQL receives a large sql file that contain all the UPDATE
> statements.
> PG receive a csv file that is loaded into a table with COPY and then
> does the update as
> update products set price=p.price from temp_price where id=p.id and
> price<>p.price;
>
> MS SQL ingurgitate the whole sql file in around 10sec.
> pg takes more than 5 min to just run the single update statement.
>
> I'd like to know if such a large difference can be justified just by
> HW difference or by a difference in the process on how data are
> loaded [1] or by the difference in performance of the 2 servers on
> this kind of workload or by some postgres config before I decide how
> to manage my time to redesign the import procedure.
>
> If HW can justify such huge difference I'll devote my time to other
> problems.
>
> I'd say that a potential culprit could be the gin index. No matter
> if the tsvector is updated or not, if the row is changed I think the
> index is going to be updated anyway.
> Somehow MS SQL circumvent this problem, possibly by building the
> equivalent of a tsvector column in a "hidden" table that
> automatically join to the "text" table.
> This add a join but reduce the cost of table modification since
> simpler (btree) indexes are faster to update.
>
> Still huge updates are rare and that table is mostly read and very
> rarely written. During unusually huge updates I may consider to drop
> the gin index.
>
> [1] I'd expect that excluding the time it takes to load the csv a
> single update should run faster than a huge list of single statement
> update
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Restarting the computer on which PG is running may help. I have access
to a server running PG 8.4 on Ubuntu and I have noticed that after a
day of intense use the PG slows down significantly, "free -g" reports
almost no free memory available (something seems to leak memory on
this Ubuntu box). But when I restart the OS (Ubuntu), the PG executes
my queries in good time. I seem not to have similar problems on the
other servers running Fedora 12 and 13. But it could be my
configuration(s) on the Ubuntu box at fault, I am still investigating.

Allan.

Re: huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search

От
Alban Hertroys
Дата:
On 28 Sep 2010, at 12:49, Ivan Sergio Borgonovo wrote:
> The hardware on the 2 machines is a bit different.
> MS SQL 2005 runs on a box that is 2 years old, 2 SATA drives on RAID
> 1 hw, 2 Xeon dual core (I can't check details right now)
> PG runs on a box that has more than 5 years, 3 SCSI drives on RAID 5
> hw, 2 Xeon single core HT (Intel(R) Xeon(TM) CPU 3.20GHz, cpu fam
> 14, model 4)

RAID-5 isn't ideal for databases, the RAID-1 in the other box is probably faster. Especially since it's on more modern
hardware.

> In both tables I've to update price.
> VACUUM FULL was performed just before updating the prices.

VACUUM FULL? Was that really necessary? You did REINDEX after that, didn't you? If not, your indexes became bloated.
If the table wasn't empty before you probably meant to do a VACUUM ANALYSE, but if it was, just ANALYSE would have been
sufficient.

> MS SQL receives a large sql file that contain all the UPDATE
> statements.
> PG receive a csv file that is loaded into a table with COPY and then
> does the update as
> update products set price=p.price from temp_price where id=p.id and
> price<>p.price;

Did you ANALYSE between loading the data and updating?

Also, executing the trigger on each copied line is likely to be a little slow, due to the overhead of calling a stored
procedure(especially if it's plpgsql). It's probably quite a bit faster to disable the trigger and create the gin-index
afterloading the data. 

> MS SQL ingurgitate the whole sql file in around 10sec.
> pg takes more than 5 min to just run the single update statement.

An EXPLAIN ANALYSE of that statement would tell what it's doing and what's taking so long. A simple EXPLAIN would
probablybe sufficient to see what query plan it thinks it needs though. 

Since an update means PG has to insert and delete records, the fact that the database is on RAID-5 is probably a factor
here.How much? No idea. 

> I'd like to know if such a large difference can be justified just by
> HW difference or by a difference in the process on how data are
> loaded [1] or by the difference in performance of the 2 servers on
> this kind of workload or by some postgres config before I decide how
> to manage my time to redesign the import procedure.

Did you tune that database? Several options (work_mem for example) could significantly improve your performance if you
canset them higher (or reduce it if you set them too high). You can do that per session too. 

> If HW can justify such huge difference I'll devote my time to other
> problems.

Partially, yes, but not that much I think.

> I'd say that a potential culprit could be the gin index. No matter
> if the tsvector is updated or not, if the row is changed I think the
> index is going to be updated anyway.

gin indexes require relatively much RAM. If you didn't assign much in your settings then it's quite possible that the
databasecan't keep the index in memory or that things have to spill to disk. Leave enough room for the OS's disk cache
though,Postgres benefits from that as well. 

Is there any particular reason you went with a gin index and not a gist one? Gin can be faster, but consumes (much)
morememory, but gist is also quite good with text searches and doesn't require quite as much memory. 

> [1] I'd expect that excluding the time it takes to load the csv a
> single update should run faster than a huge list of single statement
> update


Correct.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4ca231ae678301692839670!



Re: huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search

От
Sam Mason
Дата:
On Tue, Sep 28, 2010 at 02:35:09PM +0300, Allan Kamau wrote:
> I have access
> to a server running PG 8.4 on Ubuntu and I have noticed that after a
> day of intense use the PG slows down significantly, "free -g" reports
> almost no free memory available (something seems to leak memory on
> this Ubuntu box).

I'm not sure which values you're looking at in "free", but you generally
want there to be very little free memory--you want the memory to be used
for caching the disk.  As long as the cached data isn't "dirty" (i.e.
unwritten data) then it can be released very quickly and made available
for whatever is needed, or better serve as a useful cache.

Rebooting normally just hides other issues.

--
  Sam  http://samason.me.uk/

Re: huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search

От
Ivan Sergio Borgonovo
Дата:
On Tue, 28 Sep 2010 20:19:10 +0200
Alban Hertroys <dalroi@solfertje.student.utwente.nl> wrote:

> On 28 Sep 2010, at 12:49, Ivan Sergio Borgonovo wrote:
> > The hardware on the 2 machines is a bit different.
> > MS SQL 2005 runs on a box that is 2 years old, 2 SATA drives on
> > RAID 1 hw, 2 Xeon dual core (I can't check details right now)
> > PG runs on a box that has more than 5 years, 3 SCSI drives on
> > RAID 5 hw, 2 Xeon single core HT (Intel(R) Xeon(TM) CPU 3.20GHz,
> > cpu fam 14, model 4)

> > In both tables I've to update price.
> > VACUUM FULL was performed just before updating the prices.

> VACUUM FULL? Was that really necessary? You did REINDEX after
> that, didn't you? If not, your indexes became bloated. If the
> table wasn't empty before you probably meant to do a VACUUM
> ANALYSE, but if it was, just ANALYSE would have been sufficient.

...

> Did you ANALYSE between loading the data and updating?

I thought VACUUM FULL was more "magical" and implied a REINDEX.
Am I wrong?
The index that should be reindexed is the one on the pk, a simple
btree, that's not going to be as slow as rebuilding a gin... still
I'd really thought that VACUUM FULL implied a lot of things
(including ANALYZE as well).

> Also, executing the trigger on each copied line is likely to be a
> little slow, due to the overhead of calling a stored procedure
> (especially if it's plpgsql). It's probably quite a bit faster to
> disable the trigger and create the gin-index after loading the
> data.

I'll try to drop the trigger. I'm not expecting it the biggest
factor still adding something here and something there may end up in
the huge difference between the 2.
Anyway MS SQL seems to overcome all this nuisances auto-magically.

> An EXPLAIN ANALYSE of that statement would tell what it's doing
> and what's taking so long. A simple EXPLAIN would probably be
> sufficient to see what query plan it thinks it needs though.

I'll post the EXPLAIN. Before I run EXPLAIN ANALYZE I've to take
some precaution the DB doesn't explode.

> Did you tune that database? Several options (work_mem for example)
> could significantly improve your performance if you can set them
> higher (or reduce it if you set them too high). You can do that
> per session too.

pg is generally faster than the other MS SQL box on what's normally
done on a daily basis. Just large updates to the product page seems
to be a pain. Other INSERT/UPDATE operations are seldom performed,
they involve smaller tables with no gin index.

> > If HW can justify such huge difference I'll devote my time to
> > other problems.

> Partially, yes, but not that much I think.

That's my worry... but still in many circumstances pg performs
better than the MS SQL box... yeah... on pretty different
workload... but while on other workloads pg is a bit faster (20%
to 100% faster) even if it is on an older box, on this one is very
slow.

> > I'd say that a potential culprit could be the gin index. No
> > matter if the tsvector is updated or not, if the row is changed
> > I think the index is going to be updated anyway.

> gin indexes require relatively much RAM. If you didn't assign much
> in your settings then it's quite possible that the database can't
> keep the index in memory or that things have to spill to disk.
> Leave enough room for the OS's disk cache though, Postgres
> benefits from that as well.

> Is there any particular reason you went with a gin index and not a
> gist one? Gin can be faster, but consumes (much) more memory, but
> gist is also quite good with text searches and doesn't require
> quite as much memory.

gin index is doing a very good work and well full text searches are
the typical workload of that box and the one that is more important
to be fast.

I'd say if gin was occupying so much memory performances wouldn't be
so good on a daily basis.

I'd post excerpt of my postgres.conf (what's important here?) and
see if anything can be improved for *this* workload and temporary
tune the DB for this exceptional update still I'm going to continue
to be a bit surprised of such a huge difference even if it will come
out that it was actually a .conf not suited for this workload.

BTW the box is running Apache and php.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it