Обсуждение: SourceForge & Postgres

Поиск
Список
Период
Сортировка

SourceForge & Postgres

От
Tim Perdue
Дата:
I thought the hackers team would be interested in knowing that SourceForge, as
of Friday evening, is running on Postgres. Some 95,000 users and 12,500 Open
Source projects are depending on your stuff, so I hope it's going to be stable
for us. ;-)

Throughout the codebase we're making good use of transactions, subselects, and
foreign keys in all the places I've been wanting them for the past year, but
I'm running into some places where the query optimizer is not using the right
indexes, and sometimes does sequential scans on tables.

Here's a good example. If I remove the ORDER BY (which I didn't care to have),
postgres resorts to a sequential scan of the table, instead of using one of
3 or 4 appropriate indexes. I have an index on group_id, one on
(group_id,status_id) and one on (group_id,status_id,assigned_to)

SELECT
bug.group_id,bug.priority,bug.bug_id,bug.summary,bug.date,users.user_name AS
submitted_by,user2.user_name AS assigned_to_user
FROM bug,users,users user2
WHERE group_id='1'
AND bug.status_id <> '3'
AND users.user_id=bug.submitted_by
AND user2.user_id=bug.assigned_to
--
ORDER BY bug.group_id,bug.status_id
--
LIMIT 51 OFFSET 0;

Tim

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems

Re: SourceForge & Postgres

От
mlw
Дата:
Tim Perdue wrote:
> 
> I thought the hackers team would be interested in knowing that SourceForge, as
> of Friday evening, is running on Postgres. Some 95,000 users and 12,500 Open
> Source projects are depending on your stuff, so I hope it's going to be stable
> for us. ;-)
> 
> Throughout the codebase we're making good use of transactions, subselects, and
> foreign keys in all the places I've been wanting them for the past year, but
> I'm running into some places where the query optimizer is not using the right
> indexes, and sometimes does sequential scans on tables.
> 
> Here's a good example. If I remove the ORDER BY (which I didn't care to have),
> postgres resorts to a sequential scan of the table, instead of using one of
> 3 or 4 appropriate indexes. I have an index on group_id, one on
> (group_id,status_id) and one on (group_id,status_id,assigned_to)
> 
> SELECT
> bug.group_id,bug.priority,bug.bug_id,bug.summary,bug.date,users.user_name AS
> submitted_by,user2.user_name AS assigned_to_user
> FROM bug,users,users user2
> WHERE group_id='1'
> AND bug.status_id <> '3'
> AND users.user_id=bug.submitted_by
> AND user2.user_id=bug.assigned_to
> --
> ORDER BY bug.group_id,bug.status_id
> --
> LIMIT 51 OFFSET 0;

This is one of my long standing problems with Postgres, and I have
probably pissed of most of the Postgres guys with my views, but.....

Postgres is stubborn about index selection. I have a FAQ on my website.

http://www.mohawksoft.com/postgres/pgindex.html

In short, run vacuum analyze. If that doesn't fix it, it is because the
data being indexed has a lot of key fields that are probably duplicated.
Given a large table with a statistically significant number of records
assigned to a relatively few unique keys, Postgres will likely calculate
that doing a table scan is the best path.

I almost always start postmaster with the "-o -fs" switches because of
this problem.


Re: SourceForge & Postgres

От
Bruce Momjian
Дата:
-- Start of PGP signed section.
> I thought the hackers team would be interested in knowing that SourceForge, as
> of Friday evening, is running on Postgres. Some 95,000 users and 12,500 Open
> Source projects are depending on your stuff, so I hope it's going to be stable
> for us. ;-)

This is great news.  As far as the optimizer, any chance of testing 7.1
to see if it is improved.  I believe it has been over 7.0.3.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: SourceForge & Postgres

От
mlw
Дата:
Bruce Momjian wrote:
> This is great news.  As far as the optimizer, any chance of testing 7.1
> to see if it is improved.  I believe it has been over 7.0.3.

I just did a test of my database that exhibits this behavior, using 7.1
from CVS.

When postmaster is started with "-o -fs" I get this:

cdinfo=# explain select * from ztitles where artistid = 0 ;
NOTICE:  QUERY PLAN:

Index Scan using ztitles_artistid_ndx on ztitles  (cost=0.00..5915.01
rows=3163 width=296)

EXPLAIN

When postmaster is started without "-o -fs" I get this:

cdinfo=# explain select * from ztitles where artistid = 0 ;
NOTICE:  QUERY PLAN:
Seq Scan on ztitles  (cost=0.00..4740.75 rows=3163 width=296)
EXPLAIN 

-- 
http://www.mohawksoft.com


Re: SourceForge & Postgres

От
Tom Lane
Дата:
Tim Perdue <tim@perdue.net> writes:
> I thought the hackers team would be interested in knowing that SourceForge,
>  as of Friday evening, is running on Postgres.

Cool!

> I'm running into some places where the query optimizer is not using the right
> indexes, and sometimes does sequential scans on tables.

I assume you've done a VACUUM ANALYZE at least once since loading up
your data?

It'd be useful to see the results of an EXPLAIN for the problem query,
both with and without SET ENABLE_SEQSCAN TO OFF.  Also, it'd be helpful
to see VACUUM's stats for the relevant tables.  You can get those for
a table named 'FOO' with

select attname,attdisbursion,s.*
from pg_statistic s, pg_attribute a, pg_class c
where starelid = c.oid and attrelid = c.oid and staattnum = attnum
and relname = 'FOO';

        regards, tom lane


Re: SourceForge & Postgres

От
Tom Lane
Дата:
mlw <markw@mohawksoft.com> writes:
> cdinfo=# explain select * from ztitles where artistid = 0 ;
> NOTICE:  QUERY PLAN:

> Index Scan using ztitles_artistid_ndx on ztitles  (cost=0.00..5915.01
> rows=3163 width=296)

> When postmaster is started without "-o -fs" I get this:

> cdinfo=# explain select * from ztitles where artistid = 0 ;
> NOTICE:  QUERY PLAN:
> Seq Scan on ztitles  (cost=0.00..4740.75 rows=3163 width=296)

How many tuples are in the table?  How many are actually returned
by this query?  Also, what do you get from

select attname,attdisbursion,s.*
from pg_statistic s, pg_attribute a, pg_class c
where starelid = c.oid and attrelid = c.oid and staattnum = attnum
and relname = 'ztitles';
        regards, tom lane


Re: SourceForge & Postgres

От
The Hermit Hacker
Дата:
one thing I've found to get around this is for any query that doesn't
appear to use the index properly, just do:

SET ENABLE_SEQSCAN=OFF;
<query>
SET ENABLE_SEQSCAN=ON;

that way for those queries that do work right, ou haven't forced it a
different route ..

On Mon, 11 Dec 2000, mlw wrote:

> Bruce Momjian wrote:
> > This is great news.  As far as the optimizer, any chance of testing 7.1
> > to see if it is improved.  I believe it has been over 7.0.3.
> 
> I just did a test of my database that exhibits this behavior, using 7.1
> from CVS.
> 
> When postmaster is started with "-o -fs" I get this:
> 
> cdinfo=# explain select * from ztitles where artistid = 0 ;
> NOTICE:  QUERY PLAN:
> 
> Index Scan using ztitles_artistid_ndx on ztitles  (cost=0.00..5915.01
> rows=3163 width=296)
> 
> EXPLAIN
> 
> When postmaster is started without "-o -fs" I get this:
> 
> cdinfo=# explain select * from ztitles where artistid = 0 ;
> NOTICE:  QUERY PLAN:
>  
> Seq Scan on ztitles  (cost=0.00..4740.75 rows=3163 width=296)
>  
> EXPLAIN 
> 
> -- 
> http://www.mohawksoft.com
> 

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: SourceForge & Postgres

От
Tim Perdue
Дата:
On Tue, Dec 12, 2000 at 12:20:00AM -0400, The Hermit Hacker wrote:
>
> one thing I've found to get around this is for any query that doesn't
> appear to use the index properly, just do:
>
> SET ENABLE_SEQSCAN=OFF;
> <query>
> SET ENABLE_SEQSCAN=ON;
>
> that way for those queries that do work right, ou haven't forced it a
> different route ..

I've heard there are other ways to give clues to the
optimizer, but haven't seen anything in the docs on it. Anyway, I have gotten
virtually all of the queries optimized as much as possible. Some of the
queries are written in such a way that they key off of things in 2 or more
tables, so that's kinda hard to optimize in any circumstance.

Any plans to optimize:

-Views
-IN (1,2,3)
-SELECT count(*) FROM x WHERE indexed_field='z'

Tim

--
Founder - PHPBuilder.com / Geocrawler.com
Lead Developer - SourceForge
VA Linux Systems

Re: SourceForge & Postgres

От
mlw
Дата:
Tom Lane wrote:
>
> mlw <markw@mohawksoft.com> writes:
> > cdinfo=# explain select * from ztitles where artistid = 0 ;
> > NOTICE:  QUERY PLAN:
>
> > Index Scan using ztitles_artistid_ndx on ztitles  (cost=0.00..5915.01
> > rows=3163 width=296)
>
> > When postmaster is started without "-o -fs" I get this:
>
> > cdinfo=# explain select * from ztitles where artistid = 0 ;
> > NOTICE:  QUERY PLAN:
>
> > Seq Scan on ztitles  (cost=0.00..4740.75 rows=3163 width=296)
>
> How many tuples are in the table?  How many are actually returned
> by this query?  Also, what do you get from
>
> select attname,attdisbursion,s.*
> from pg_statistic s, pg_attribute a, pg_class c
> where starelid = c.oid and attrelid = c.oid and staattnum = attnum
> and relname = 'ztitles';

I have attached the output.

btw anyone trying this query should use: "attdispersion"

The explain I gave, there are no records that actually have an artistid
of 0. However, I will show the explain with a valid artistid number.

This is without "-o -fs"
cdinfo=# explain select * from ztitles where artistid = 100000220 ;
NOTICE:  QUERY PLAN:

Seq Scan on ztitles  (cost=0.00..4740.75 rows=3163 width=296)

EXPLAIN

And this is with "-o -fs"

cdinfo=# explain select * from ztitles where artistid = 100000220 ;
NOTICE:  QUERY PLAN:

Index Scan using ztitles_artistid_ndx on ztitles  (cost=0.00..5915.01
rows=3163 width=296)

EXPLAIN


select count(*) from ztitles where artistid = 100000220 ;
 count
-------
    16
(1 row)

--
http://www.mohawksoft.com  attname   | attdispersion | starelid | staattnum | staop | stanullfrac | stacommonfrac |
stacommonval      |         staloval         |     stahival       

------------+---------------+----------+-----------+-------+-------------+---------------+------------------------+--------------------------+-------------------
 muzenbr    |            -1 |    19274 |         1 |    97 |           0 |   4.72277e-06 | 397705                 |
58608                   | 399022 
 artistid   |     0.0477198 |    19274 |         2 |    97 |           0 |      0.149362 | 100050450              |
100000000               | 100055325 
 cat2       |      0.618418 |    19274 |         3 |  1066 |           0 |      0.763058 | Performer              |
BoxedSet                | Single 
 cat3       |     0.0459786 |    19274 |         4 |  1066 | 4.72277e-06 |      0.145367 | International          |
Blues                   | Sound Effects 
 cat4       |      0.308324 |    19274 |         5 |  1066 | 4.72277e-06 |       0.50827 |                        |
                    | Zydeco 
 performer  |     0.0477281 |    19274 |         6 |  1066 | 4.72277e-06 |      0.149381 | Various Artists        |
"Blue"Gene Tyranny      | underGRIND 
 performer2 |     0.0477198 |    19274 |         7 |  1066 | 4.72277e-06 |      0.149362 | Various Artists        |
"Chuscales",Jose Valle  | underGRIND 
 title      |   1.88982e-05 |    19274 |         8 |  1066 | 4.72277e-06 |   9.44555e-05 | Good Old Country       | !
                    | Zzzzzz... [EP] * 
 artist1    |      0.770286 |    19274 |         9 |  1066 | 4.72277e-06 |      0.864182 |                        |
                    | w 
 engineer   |      0.719466 |    19274 |        10 |  1066 | 4.72277e-06 |      0.831534 |                        |
                    | ob Bullock 
 producer   |      0.586756 |    19274 |        11 |  1066 | 4.72277e-06 |      0.740488 |                        |
                    | Zvika Nadaf 
 labelname  |   0.000490215 |    19274 |        12 |  1066 | 4.72277e-06 |    0.00242751 | Rykodisc USA           |
                    | `A`A`Li`I Records 
 catalog    |    0.00114041 |    19274 |        13 |  1066 | 4.72277e-06 |    0.00557759 | 1                      |
                    | sftri325 
 distribut  |     0.0342314 |    19274 |        14 |  1066 | 4.72277e-06 |       0.11669 | Universal Distribution |
(Independentlyby Label) | n/a 
 released   |     0.0331312 |    19274 |        15 |  1066 | 4.72277e-06 |      0.113828 | n/a                    |
01/01/1986              | n/a 
 origrel    |     0.0266312 |    19274 |        16 |  1066 | 4.72277e-06 |     0.0961651 | 2000                   |
1911                    | n/a 
 nbrdiscs   |      0.931311 |    19274 |        17 |  1066 | 4.72277e-06 |      0.961169 | 1                      | 01
                    | 9 
 spar       |       0.84807 |    19274 |        18 |  1066 | 4.72277e-06 |      0.912166 | n/a                    | *N*
                    | n/a 
 minutes    |      0.778454 |    19274 |        19 |  1066 | 4.72277e-06 |       0.86933 |                        |
                    | 99 
 seconds    |      0.778454 |    19274 |        20 |  1066 | 4.72277e-06 |       0.86933 |                        |
                    | 98 
 monostereo |      0.854336 |    19274 |        21 |  1066 | 4.72277e-06 |      0.915935 | Stereo                 |
Mono                    | n/a 
 studiolive |      0.878293 |    19274 |        22 |  1066 | 4.72277e-06 |      0.930221 | Studio                 |
Live                    | n/a 
 available  |      0.632032 |    19274 |        23 |  1058 | 4.72277e-06 |      0.756938 | Y                      | N
                    | Y 
 previews   |      0.798323 |    19274 |        24 |  1066 | 4.72277e-06 |      0.881742 |                        |
                    | 99954 
 pnotes     |      0.310707 |    19274 |        25 |  1066 | 4.72277e-06 |      0.510617 |                        |
                    | 99986 
 acd        |      0.998235 |    19274 |        26 |  1066 | 4.72277e-06 |      0.999117 | A                      | A
                    | D 
(26 rows)


Re: SourceForge & Postgres

От
Tom Lane
Дата:
mlw <markw@mohawksoft.com> writes:
> btw anyone trying this query should use: "attdispersion"

Sorry about that --- I just copied-and-pasted the query from some notes
that are obsolete as of 7.1...

> cdinfo=# explain select * from ztitles where artistid = 100000220 ;
> NOTICE:  QUERY PLAN:
> Seq Scan on ztitles  (cost=0.00..4740.75 rows=3163 width=296)
> And this is with "-o -fs"

> Index Scan using ztitles_artistid_ndx on ztitles  (cost=0.00..5915.01
> rows=3163 width=296)
>   attname   | attdispersion | starelid | staattnum | staop | stanullfrac | stacommonfrac |      stacommonval      |
     staloval         |     stahival      
 
>  artistid   |     0.0477198 |    19274 |         2 |    97 |           0 |      0.149362 | 100050450              |
100000000               | 100055325
 

The reason why the thing is going for a sequential scan is that
astonishingly high stacommonfrac statistic.  Does artistid 100050450
really account for 14.9% of all the rows in your table?  (Who is that
anyway? ;-))  If so, a search for artistid 100050450 definitely *should*
use a sequential scan.  The problem at hand is estimating the frequency
of entries for some other artistid, given that we only have this much
statistical info available.  Obviously the stats are insufficient, and
I hope to do something about that in a release or two, but it ain't
gonna happen for 7.1.  In the meantime, if you've got huge outliers
like that, you could try reducing the value of NOT_MOST_COMMON_RATIO
in src/backend/utils/adt/selfuncs.c.
        regards, tom lane


Re: SourceForge & Postgres

От
mlw
Дата:
Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > btw anyone trying this query should use: "attdispersion"
> 
> Sorry about that --- I just copied-and-pasted the query from some notes
> that are obsolete as of 7.1...
> 
> > cdinfo=# explain select * from ztitles where artistid = 100000220 ;
> > NOTICE:  QUERY PLAN:
> 
> > Seq Scan on ztitles  (cost=0.00..4740.75 rows=3163 width=296)
> 
> > And this is with "-o -fs"
> 
> > Index Scan using ztitles_artistid_ndx on ztitles  (cost=0.00..5915.01
> > rows=3163 width=296)
> 
> >   attname   | attdispersion | starelid | staattnum | staop | stanullfrac | stacommonfrac |      stacommonval      |
       staloval         |     stahival
 
> >  artistid   |     0.0477198 |    19274 |         2 |    97 |           0 |      0.149362 | 100050450              |
100000000               | 100055325
 
> 
> The reason why the thing is going for a sequential scan is that
> astonishingly high stacommonfrac statistic.  Does artistid 100050450
> really account for 14.9% of all the rows in your table?  (Who is that
> anyway? ;-))  If so, a search for artistid 100050450 definitely *should*
> use a sequential scan.

I tested this statement against the database and you are right, about 14
seconds with the index, 4 without.

BTW ID # 100050450 is "Various Artists"

This is sort of a point I was trying to make in previous emails. I think
this situation, and this sort of ratio is far more likely than the
attention it has been given.

In about every project I have used postgres I have run into this. It is
only recently that I have understood what the problem was and how to get
around it (sort of).

This one entry is destroying any intelligent performance we could hope
to attain. As I said, I always see this sort of behavior in some
implementation.


>  The problem at hand is estimating the frequency
> of entries for some other artistid, given that we only have this much
> statistical info available.  Obviously the stats are insufficient, and
> I hope to do something about that in a release or two, but it ain't
> gonna happen for 7.1.  In the meantime, if you've got huge outliers
> like that, you could try reducing the value of NOT_MOST_COMMON_RATIO
> in src/backend/utils/adt/selfuncs.c.

I did some playing with this value, and I can seem to have it
differentiate between 100050450 and anything else.


-- 
http://www.mohawksoft.com


Re: SourceForge & Postgres

От
Bruce Momjian
Дата:
Tim, how is PostgreSQL working for you?

-- Start of PGP signed section.
> I thought the hackers team would be interested in knowing that SourceForge, as
> of Friday evening, is running on Postgres. Some 95,000 users and 12,500 Open
> Source projects are depending on your stuff, so I hope it's going to be stable
> for us. ;-)
> 
> Throughout the codebase we're making good use of transactions, subselects, and
> foreign keys in all the places I've been wanting them for the past year, but
> I'm running into some places where the query optimizer is not using the right
> indexes, and sometimes does sequential scans on tables.
> 
> Here's a good example. If I remove the ORDER BY (which I didn't care to have),
> postgres resorts to a sequential scan of the table, instead of using one of
> 3 or 4 appropriate indexes. I have an index on group_id, one on
> (group_id,status_id) and one on (group_id,status_id,assigned_to) 
> 
> SELECT
> bug.group_id,bug.priority,bug.bug_id,bug.summary,bug.date,users.user_name AS
> submitted_by,user2.user_name AS assigned_to_user
> FROM bug,users,users user2
> WHERE group_id='1'
> AND bug.status_id <> '3'
> AND users.user_id=bug.submitted_by
> AND user2.user_id=bug.assigned_to
> -- 
> ORDER BY bug.group_id,bug.status_id
> --
> LIMIT 51 OFFSET 0;
> 
> Tim
> 
> -- 
> Founder - PHPBuilder.com / Geocrawler.com
> Lead Developer - SourceForge
> VA Linux Systems
-- End of PGP section, PGP failed!


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: SourceForge & Postgres

От
Jan Wieck
Дата:
Tim Perdue wrote:
> I thought the hackers team would be interested in knowing that SourceForge, as
> of Friday evening, is running on Postgres. Some 95,000 users and 12,500 Open
> Source projects are depending on your stuff, so I hope it's going to be stable
> for us. ;-)

Tim,
   the  PG  core team is wondering if SourceForge might still be   running on a snapshot prior to  BETA3,  because
there is  a   major bug in it that could result in a complete corruption of   the system catalog.
 
   The bug is that the shared buffer cache might mix  up  blocks   between  different  databases.  As  long  as you
onlyuse one   database, you're fairly safe.  But  a  single  'createdb'  or   'createuser'  on  the  same  instance,
whichis connecting to   template1, could blow away your entire  installation.  It  is   fixed in BETA3.
 
   My personal recommendation should be clear.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: SourceForge & Postgres

От
Oleg Bartunov
Дата:
Tim,

I've found your message in postgres hackers list and wondering if
sourceforge db part could be improved using our recent (7.1) GiST improvements.

In short, using RD-Tree + GiST we've added index support for arrays of
integers. For example, in our rather busy web site we have pool
of online news. Most complex query to construct main page is
select messages from given list of categories, because it requires
join from  message_section_map (message could belong to several
categories).
messages    message_section_map
--------    -------------------
msg_id      msg_id
title       sect_id
.....

WHERE clause (simplificated) looks like
......
message_section_map.sect_id in (1,13,103,10488,105,17,9,4,2,260000373,12,7,8,14,5,6,11,15,
10339,10338,10336,10335,260000404,260000405,260000403,206) and
message_section_map.msg_id = messages.msg_id order by publication_date
desc .....

This is really difficult query and takes a long time to execute.

now, we exclude message_section_map, just add array <sections> to
table messages which contains all sect_id given message belong to.
Using our index support for arrays of int4  our complex query
executes very fast !

I think sourceforge uses some kind of such queries.

Some info about GiST extension and our contribution could be find
at http://www.sai.msu.su/~megera/postgres/gist/

Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83