Обсуждение: Experiences with extensibility

Поиск
Список
Период
Сортировка

Experiences with extensibility

От
Eric Davies
Дата:
The existing server extensibilities in modern DBMSs have been critical in our company�s development of software products that improve database performance for certain scientific computing applications.  We are planning to develop other products that will utilize an extensible database engine, and so we would be grateful for feedback on the following questions as they pertain to your experience in developing PostgreSQL applications:
 
1.      What types of extensibility (i.e., data types, functions/methods, indexing, etc.) are available and convenient to use in practice?
2.      What types of extensibility (possibly already available in other DBMSs) are currently missing in PostgreSQL?
3.      To what extent was your choice of PostgreSQL as a development platform based primarily on its extensibility features?

**********************************************
Eric Davies, M.Sc.
Barrodale Computing Services Ltd.
Tel: (250) 472-4372 Fax: (250) 472-4373
Web: http://www.barrodale.com
Email: eric@barrodale.com
**********************************************
Mailing Address:
P.O. Box 3075 STN CSC
Victoria BC Canada V8W 3W2

Shipping Address:
Hut R, McKenzie Avenue
University of Victoria
Victoria BC Canada V8W 3W2
**********************************************


Re: Experiences with extensibility

От
"Joshua D. Drake"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 08 Jan 2008 16:28:11 -0800
Eric Davies <eric@barrodale.com> wrote:

> The existing server extensibilities in modern DBMSs have been 
> critical in our company's development of software products that 
> improve database performance for certain scientific computing 
> applications.  We are planning to develop other products that will 
> utilize an extensible database engine, and so we would be grateful 
> for feedback on the following questions as they pertain to your 
> experience in developing PostgreSQL applications:
> 
> 1.      What types of extensibility (i.e., data types, 
> functions/methods, indexing, etc.) are available and convenient to 
> use in practice?

Uhhh... all? You can create custom data types and even use them on the
fly via DOMAINS, we have full function support in just about every
language imaginable and full support for hash, btree, ltree, gist and
GIN indexes.

> 2.      What types of extensibility (possibly already available in 
> other DBMSs) are currently missing in PostgreSQL?

None that I am aware of.

> 3.      To what extent was your choice of PostgreSQL as a development 
> platform based primarily on its extensibility features?
> 

There is no other open source database that can compare with
PostgreSQL's extensibility, reliability and scalability.

Joshua D. Drake


> **********************************************
> Eric Davies, M.Sc.
> Barrodale Computing Services Ltd.
> Tel: (250) 472-4372 Fax: (250) 472-4373
> Web: http://www.barrodale.com
> Email: eric@barrodale.com
> **********************************************
> Mailing Address:
> P.O. Box 3075 STN CSC
> Victoria BC Canada V8W 3W2
> 
> Shipping Address:
> Hut R, McKenzie Avenue
> University of Victoria
> Victoria BC Canada V8W 3W2
> **********************************************
> 
> 


- -- 
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ 
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHhBclATb/zqfZUUQRAkfNAKCheraBon354vPvgRLUdr/oOhozxQCfUNfj
pM1bdri5Zv04B9KQuyaX8qQ=
=EN3c
-----END PGP SIGNATURE-----

Re: Experiences with extensibility

От
"Rodrigo E. De León Plicet"
Дата:
On Jan 8, 2008 7:36 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> There is no other open source database that can compare with
> PostgreSQL's extensibility, reliability and scalability.

+1000

Re: Experiences with extensibility

От
Alvaro Herrera
Дата:
Joshua D. Drake wrote:

> On Tue, 08 Jan 2008 16:28:11 -0800
> Eric Davies <eric@barrodale.com> wrote:

> > 3.      To what extent was your choice of PostgreSQL as a development
> > platform based primarily on its extensibility features?
>
> There is no other open source database that can compare with
> PostgreSQL's extensibility, reliability and scalability.

AFAIK there is no other database system _at all_ that can compete with
PostgreSQL's extensibility.

On Postgres, you can create your own:
- functions (common functions; "set-returning" functions; aggregate functions)
- types
- operators
- indexing access methods (in particular, indexing for your own types)
- PL languages (i.e. you can create language handler for whatever suits
you).  This means there are handlers already available for Python, Tcl,
Perl, PHP, sh, and others.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Experiences with extensibility

От
Gregory Stark
Дата:
"Joshua D. Drake" <jd@commandprompt.com> writes:

>> 2.      What types of extensibility (possibly already available in
>> other DBMSs) are currently missing in PostgreSQL?
>
> None that I am aware of.

I'm sure there are some options available in some databases which Postgres
doesn't have. Usually Postgres has more choices than any of the others but
that doesn't mean that it includes the union of all of their feature sets.

You'll probably get more useful answers if you ask a more specific question.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: Experiences with extensibility

От
Robert Treat
Дата:
On Tuesday 08 January 2008 21:31, Gregory Stark wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> >> 2.      What types of extensibility (possibly already available in
> >> other DBMSs) are currently missing in PostgreSQL?
> >
> > None that I am aware of.
>
> I'm sure there are some options available in some databases which Postgres
> doesn't have. Usually Postgres has more choices than any of the others but
> that doesn't mean that it includes the union of all of their feature sets.
>

I guess one could include synonyms and packages as possible items we don't
have which would make us more extensible, but the uses for those tools that
can't be covered with the tools available in postgres is pretty narrow.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Experiences with extensibility

От
Guido Neitzer
Дата:
On 08.01.2008, at 17:36, Joshua D. Drake wrote:

>> 2.      What types of extensibility (possibly already available in
>> other DBMSs) are currently missing in PostgreSQL?
>
> None that I am aware of.

Easy multi-master clustering with just two machines.

cug

--
http://www.event-s.net


Re: Experiences with extensibility

От
"Joshua D. Drake"
Дата:
Guido Neitzer wrote:
> On 08.01.2008, at 17:36, Joshua D. Drake wrote:
>
>>> 2.      What types of extensibility (possibly already available in
>>> other DBMSs) are currently missing in PostgreSQL?
>>
>> None that I am aware of.
>
> Easy multi-master clustering with just two machines.

That isn't really an extensibility argument. At least not in my mind.
Further I don't know of anyone that can "easily" do it. You either
suffer the possibility of catastrophic data loss (dolphins) or you
suffer guaranteed bank account drainage (Oracle), or you suffer the
willingness of Monopolies (MSSQL).

None of those equate to "easy".

Joshua D. Drake

>
> cug
>


Re: Experiences with extensibility

От
Guido Neitzer
Дата:
On 08.01.2008, at 23:20, Joshua D. Drake wrote:

> That isn't really an extensibility argument.

I was thinking about that too - for me, it still is just an
outstanding issue with PostgreSQL. It is incredibly scalable on one
machine but it totally sucks when you want more, but not much more.

Like, I have a situation where I need multi-master just for
availability. Two small servers are good enough for that. But
unfortunately with PostgreSQL the whole setup is a major pain in the ...

> At least not in my mind. Further I don't know of anyone that can
> "easily" do it. You either suffer the possibility of catastrophic
> data loss (dolphins) or you suffer guaranteed bank account drainage
> (Oracle), or you suffer the willingness of Monopolies (MSSQL).

FrontBase. It has an incredibly easy to configure replication and
multi master clustering support, is very reliable and can also handle
really big databases. The only problem is that the query planner is
not as good as PostgreSQL's so you might end up with much worse
performance. Depends a bit on the complexity of the database and how
"special" your queries are.

But if you need something easy to setup, multi-master with just two
machines, easy fail-over (done in the JDBC driver) without your
application even noticing it - try it. It's free, but not open source.
And it's a good product. I use it for some stuff and PostgreSQL for
other projects. Just depends on the requirements.

cug

--
http://www.event-s.net


Re: Experiences with extensibility

От
"Joshua D. Drake"
Дата:
Guido Neitzer wrote:
> On 08.01.2008, at 23:20, Joshua D. Drake wrote:
>
>> That isn't really an extensibility argument.
>
> I was thinking about that too - for me, it still is just an outstanding
> issue with PostgreSQL. It is incredibly scalable on one machine but it
> totally sucks when you want more, but not much more.

There are OS level things you can do here.


> But if you need something easy to setup, multi-master with just two
> machines, easy fail-over (done in the JDBC driver) without your
> application even noticing it - try it.

http://www.continuent.org/HomePage


> It's free, but not open source.
> And it's a good product. I use it for some stuff and PostgreSQL for
> other projects. Just depends on the requirements.
>

Sincerely,

Joshua D. Drake



> cug
>


Re: Experiences with extensibility

От
Sim Zacks
Дата:
 > That isn't really an extensibility argument. At least not in my mind.
 > Further I don't know of anyone that can "easily" do it. You either
 > suffer the possibility of catastrophic data loss (dolphins) or you
 > suffer guaranteed bank account drainage (Oracle), or you suffer the
 > willingness of Monopolies (MSSQL).
 >
 > None of those equate to "easy".

That's a load of FUD. When looking at feature-sets that are available or not
available in an open source product, you can't throw out all the things that a
commercial, closed source project has because it isn't open source and it costs
money.

The reason companies go with the closed source, expensive solutions is because
they are better products.

When evaluating a database for your company, it is better to look at what the
closed source products offer that cause companies to shell out tons of money and
decide if it is worth locking yourself into an expensive and/or exclusive agreement.

Re: Experiences with extensibility

От
Guido Neitzer
Дата:
On 08.01.2008, at 23:40, Joshua D. Drake wrote:

> There are OS level things you can do here.

They are normally not really easier and, more important, I don't have
them on my deployment environment.

> http://www.continuent.org/HomePage

When I'm talking about two cheap machines you recommend a solution
where I need four machines (Or can I use the uni/cluster machines also
as db nodes?) and licenses for a couple of thousands bucks? Sorry, no
option.

And, I have my option ...

cug

--
http://www.event-s.net


Re: Experiences with extensibility

От
"Joshua D. Drake"
Дата:
Sim Zacks wrote:
>
>  > That isn't really an extensibility argument. At least not in my mind.
>  > Further I don't know of anyone that can "easily" do it. You either
>  > suffer the possibility of catastrophic data loss (dolphins) or you
>  > suffer guaranteed bank account drainage (Oracle), or you suffer the
>  > willingness of Monopolies (MSSQL).
>  >
>  > None of those equate to "easy".
>
> That's a load of FUD. When looking at feature-sets that are available or
> not
> available in an open source product, you can't throw out all the things
> that a
> commercial, closed source project has because it isn't open source and
> it costs
> money.

You obviously didn't read my post.

>
> The reason companies go with the closed source, expensive solutions is
> because
> they are better products.

Sometimes, sometimes not. It depends on your needs.

>
> When evaluating a database for your company, it is better to look at
> what the
> closed source products offer that cause companies to shell out tons of
> money and
> decide if it is worth locking yourself into an expensive and/or
> exclusive agreement.

The only thing this post could possibly be is a Troll. Please go back
under the bridge.

Sincerely,

Joshua D. rake



Re: Experiences with extensibility

От
"Joshua D. Drake"
Дата:
Guido Neitzer wrote:
> On 08.01.2008, at 23:40, Joshua D. Drake wrote:
>
>> There are OS level things you can do here.
>
> They are normally not really easier and, more important, I don't have
> them on my deployment environment.
>
>> http://www.continuent.org/HomePage
>
> When I'm talking about two cheap machines you recommend a solution where
> I need four machines (Or can I use the uni/cluster machines also as db
> nodes?) and licenses for a couple of thousands bucks? Sorry, no option.
>

Did you even bother to read the page?

http://sequoia.continuent.org/HomePage

Open Source... Free...

Sequoia is a transparent middleware solution offering clustering, load
balancing and failover services for any database. Sequoia is the
continuation of the C-JDBC project.

It can be downloaded here:

https://forge.continuent.org/frs/?group_id=6


> And, I have my option ...
>

Great! I was just trying to show you that there was a JDBC layer
available for multi-mastering with PostgreSQL.

Sincerely,

Joshua D. Drake



Re: Experiences with extensibility

От
Ow Mun Heng
Дата:
On Tue, 2008-01-08 at 23:37 -0700, Guido Neitzer wrote:
> On 08.01.2008, at 23:20, Joshua D. Drake wrote:

> Like, I have a situation where I need multi-master just for
> availability. Two small servers are good enough for that. But
> unfortunately with PostgreSQL the whole setup is a major pain in the ...
>

Isn't that the reason they hire DB admins and not the run of the mill
guy.

I've not played with multimaster (sync/async) and I doubt I will since
there's no requirement for it., (yet)

In any case, based on my research there's lots of FOSS and (not-so)FOSS
based solutions and of course, each comes with their own learning curve
and also depends on the complexity of the requirements. (Mind you, even
MSSQL with all it's polished point and click interface, you still have
times when you pull hairs out)

I've done a simple master/slave configuration which is faring well, so
that's fine (for me)

Re: Experiences with extensibility

От
Guido Neitzer
Дата:
On 09.01.2008, at 00:08, Joshua D. Drake wrote:

> Did you even bother to read the page?

Actually I tried but typed it in the browser and it resolved directly
to continuent.com (which I have as a bookmark) and I wasn't aware of
the Sequoia stuff anymore and combined Contiuent with uni/cluster
directly in my mind. That was the main problem here.

After reading again a bit now, I haven't seen mentioned how many nodes
are required to have a minimum more or less safe fail-over setup.

> Great! I was just trying to show you that there was a JDBC layer
> available for multi-mastering with PostgreSQL.

When I find some time, I might dig a bit deeper in the Sequoia stuff
again. We will see. Thanks for that hint.

cug

--
http://www.event-s.net


Re: Experiences with extensibility

От
Ow Mun Heng
Дата:
On Tue, 2008-01-08 at 23:05 -0800, Joshua D. Drake wrote:
> Sim Zacks wrote:

>
> >
> > The reason companies go with the closed source, expensive solutions is
> > because they are better products.
>
> Sometimes, sometimes not. It depends on your needs.

This is total FUD. Everything has a place. And besides, as what I read, nobody ever gets fired
for recommending an expensive solution that comes with expensive support contracts and what not.
(wish I could google and insert the link to where I read that)

>
> >
> > When evaluating a database for your company, it is better to look at
> > what the
> > closed source products offer that cause companies to shell out tons of
> > money and
> > decide if it is worth locking yourself into an expensive and/or
> > exclusive agreement.
>
> The only thing this post could possibly be is a Troll. Please go back
> under the bridge.


No, it's better to evaluate if the features which are being provided
will fit your needs. This is akin to buying a lamborghini only to drive
it down to the local 7-11, down the (same) road to buy some bread.

Take a walk instead, save my ears, save some petrol, save some money.

Otherwise, you end up paying X amount more for features you don't need.
(Me remembers vividly an episode of Simpsons where Homer was given free
rein to design the ultimate American Dream Car.)


Re: Experiences with extensibility

От
Guido Neitzer
Дата:
On 09.01.2008, at 00:14, Ow Mun Heng wrote:

>> Like, I have a situation where I need multi-master just for
>> availability. Two small servers are good enough for that. But
>> unfortunately with PostgreSQL the whole setup is a major pain in
>> the ...
>>
>
> Isn't that the reason they hire DB admins and not the run of the mill
> guy.

Isn't that more the situation where it is preferred to have a working
fail-over with as less money and work as possible?

There is just no way I (personally) can afford hiring someone to set
that up as I'm talking about something that hasn't brought a dollar
yet and will probably not for the next time ... and it is my own
project, but there is still some need for a reliable service to come
to a point where I can maybe hire someone.

cug

--
http://www.event-s.net


Re: Experiences with extensibility

От
Ow Mun Heng
Дата:
On Wed, 2008-01-09 at 00:21 -0700, Guido Neitzer wrote:
> On 09.01.2008, at 00:08, Joshua D. Drake wrote:

> > Great! I was just trying to show you that there was a JDBC layer
> > available for multi-mastering with PostgreSQL.
>
> When I find some time, I might dig a bit deeper in the Sequoia stuff
> again. We will see. Thanks for that hint.
>


IIRC, there was a presentation on avaiable replication solutions in one
of the pgcons. Perhaps you can search there.

Re: Experiences with extensibility

От
Ow Mun Heng
Дата:
On Wed, 2008-01-09 at 00:24 -0700, Guido Neitzer wrote:
> On 09.01.2008, at 00:14, Ow Mun Heng wrote:
>
> >> Like, I have a situation where I need multi-master just for
> >> availability. Two small servers are good enough for that. But
> >> unfortunately with PostgreSQL the whole setup is a major pain in
> >> the ...
> >>
> >
> > Isn't that the reason they hire DB admins and not the run of the mill
> > guy.
>
> Isn't that more the situation where it is preferred to have a working
> fail-over with as less money and work as possible?

Yep.. There's where FOSS comes about. But as mentioned, there's a
learning curve in everything and granted that in FOSS, sometimes
documentation is sparse etc.

I guess the other side of the coin is this -> If you want it cheap, you
have to do it yourself and I've be rich for each time the
plumber/electricion/etc comes around to fix something. Each time, the
itch is for me to learn how to do it myself.

>
> There is just no way I (personally) can afford hiring someone to set
> that up as I'm talking about something that hasn't brought a dollar
> yet and will probably not for the next time ... and it is my own
> project, but there is still some need for a reliable service to come
> to a point where I can maybe hire someone.

point taken.

Re: Experiences with extensibility

От
"Joshua D. Drake"
Дата:
Ow Mun Heng wrote:
> On Tue, 2008-01-08 at 23:05 -0800, Joshua D. Drake wrote:
>> Sim Zacks wrote:
>
>>> The reason companies go with the closed source, expensive solutions is
>>> because they are better products.
>> Sometimes, sometimes not. It depends on your needs.
>
> This is total FUD. Everything has a place. And besides, as what I read, nobody ever gets fired
> for recommending an expensive solution that comes with expensive support contracts and what not.
> (wish I could google and insert the link to where I read that)

Exactly. It is amazing to me that companies are snookered into the idea
that per cpu pricing (or per client) for support contracts is a valid
method to determine actual costs to support the customer.

There are good closed source products but to suggest that just because
it is an expensive solution it is better is a little dumb.


>> The only thing this post could possibly be is a Troll. Please go back
>> under the bridge.
>
> No, it's better to evaluate if the features which are being provided
> will fit your needs. This is akin to buying a lamborghini only to drive
> it down to the local 7-11, down the (same) road to buy some bread.
>
> Take a walk instead, save my ears, save some petrol, save some money.

No kidding.

>
> Otherwise, you end up paying X amount more for features you don't need.
> (Me remembers vividly an episode of Simpsons where Homer was given free
> rein to design the ultimate American Dream Car.)
>

Heh...

Sincerely,

Joshua D. Drake



Re: Experiences with extensibility

От
Sim Zacks
Дата:
You wrote that either it is not implemented well (catastrophic data losss) or is
expensive (Oracle) or it is a monopoly (MSSQL). None of those are easy.
Expensive and monopoly don't seem to me to be non-easy, rather undesirable if
you don't need to get into it.

When someone asks a question about a feature found in a commercial product and
the answer is that the feature is not available unless you accept on yourself
horrid possibilities, that is similar to Microsoft saying that sure you can use
open source, but there is no support, it is unreliable, ... Pure FUD. You can
call it reverse FUD, but it is FUD nonetheless.

We use postgresql because it is open source, we have in-house experience to deal
with it so we don't have any extra support costs and we don't need the features
that are offered in commercial products that PostGreSQL does not have. We also
don't need the speed that commercial products offer that is missing in PostgreSQL.

Sim


Joshua D. Drake wrote:
> Sim Zacks wrote:
>>
>>  > That isn't really an extensibility argument. At least not in my mind.
>>  > Further I don't know of anyone that can "easily" do it. You either
>>  > suffer the possibility of catastrophic data loss (dolphins) or you
>>  > suffer guaranteed bank account drainage (Oracle), or you suffer the
>>  > willingness of Monopolies (MSSQL).
>>  >
>>  > None of those equate to "easy".
>>
>> That's a load of FUD. When looking at feature-sets that are available
>> or not
>> available in an open source product, you can't throw out all the
>> things that a
>> commercial, closed source project has because it isn't open source and
>> it costs
>> money.
>
> You obviously didn't read my post.
>
>>
>> The reason companies go with the closed source, expensive solutions is
>> because
>> they are better products.
>
> Sometimes, sometimes not. It depends on your needs.
>
>>
>> When evaluating a database for your company, it is better to look at
>> what the
>> closed source products offer that cause companies to shell out tons of
>> money and
>> decide if it is worth locking yourself into an expensive and/or
>> exclusive agreement.
>
> The only thing this post could possibly be is a Troll. Please go back
> under the bridge.
>
> Sincerely,
>
> Joshua D. rake
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: Experiences with extensibility

От
"Joshua D. Drake"
Дата:
Sim Zacks wrote:

> We use postgresql because it is open source, we have in-house experience
> to deal with it so we don't have any extra support costs and we don't
> need the features that are offered in commercial products that
> PostGreSQL does not have. We also don't need the speed that commercial
> products offer that is missing in PostgreSQL.

I use PostgreSQL because it has a ton of features the closed source
products don't offer and is generally faster than the closed source
solutions.

Granted there are scenarios where others are FASTER (SELECT COUNT(*))
but I find that if you are doing those items, you normally have a weird
design anyway.

Sincerely,

Joshua D. Drake


count(*) and bad design was: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Wed, 09 Jan 2008 00:06:45 -0800
"Joshua D. Drake" <jd@commandprompt.com> wrote:

> Granted there are scenarios where others are FASTER (SELECT
> COUNT(*)) but I find that if you are doing those items, you
> normally have a weird design anyway.

> Sincerely,

Sincerely, would you make an example of such a bad design?

Or did you just mean that count(*) is bad design in postgresql since
there are usually better alternatives in postgresql?

I'm not joking. I'd like to learn.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: count(*) and bad design was: Experiences with extensibility

От
"Dann Corbit"
Дата:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Ivan Sergio Borgonovo
> Sent: Wednesday, January 09, 2008 1:30 AM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] count(*) and bad design was: Experiences with
> extensibility
>
> On Wed, 09 Jan 2008 00:06:45 -0800
> "Joshua D. Drake" <jd@commandprompt.com> wrote:
>
> > Granted there are scenarios where others are FASTER (SELECT
> > COUNT(*)) but I find that if you are doing those items, you
> > normally have a weird design anyway.
>
> > Sincerely,
>
> Sincerely, would you make an example of such a bad design?

A program that estimates cardinality by doing SELECT COUNT(*) is a bad
design.  Assuming you have the wherewithal to vacuum your tables (or
have autovacuum enabled) a query against the system tables will be a
much better estimate of cardinality.

Now (some may argue) what if we want an _EXACT_ value for COUNT(*)?  We
had better ask ourselves (in that circumstance) "Am I willing to lock
the entire table and scan it?" because that is what will be necessary to
get a truly exact value.  Otherwise, you can get totals that are wildly
off-base if someone is doing a bulk import or deleting a large number of
records.

So:
    SELECT reltuples FROM pg_class WHERE relname = <table_name>;

Is more often what is really wanted.

> Or did you just mean that count(*) is bad design in postgresql since
> there are usually better alternatives in postgresql?

If you are using COUNT(*) as an existence test, then substitute:

    WHERE EXISTS(<criteria>)


Use the indexes (if possible) by WHERE clause restriction:

    SELECT count(1) FROM <table_name> WHERE <condition_list>

Will use indexes if appropriate.


> I'm not joking. I'd like to learn.

I think this should be a FAQ because it is a (F)requently (A)sked
(Q)uestion.

IMO-YMMV.



> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 6: explain analyze is your friend

Re: count(*) and bad design was: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Wed, 9 Jan 2008 01:39:34 -0800
"Dann Corbit" <DCorbit@connx.com> wrote:

> > On Wed, 09 Jan 2008 00:06:45 -0800
> > "Joshua D. Drake" <jd@commandprompt.com> wrote:

> > > Granted there are scenarios where others are FASTER (SELECT
> > > COUNT(*)) but I find that if you are doing those items, you
> > > normally have a weird design anyway.

> > > Sincerely,

> > Sincerely, would you make an example of such a bad design?

> A program that estimates cardinality by doing SELECT COUNT(*) is a
> bad design.  Assuming you have the wherewithal to vacuum your
> tables (or have autovacuum enabled) a query against the system
> tables will be a much better estimate of cardinality.
>
> Now (some may argue) what if we want an _EXACT_ value for
> COUNT(*)?  We had better ask ourselves (in that circumstance) "Am I
> willing to lock the entire table and scan it?" because that is what
> will be necessary to get a truly exact value.  Otherwise, you can
> get totals that are wildly off-base if someone is doing a bulk
> import or deleting a large number of records.

Please forgive my naiveness in this field but what does it mean an
"exact count" and what other DB means with "an exact count" and how
other DB deal with it?

How "count" is defined in the SQL standard?

Is there a real situation then where you really need the "exact"
count?

Am I right saying that:

select count(*) from ...
-- here count may already be different

and that:
select for update count(*)

could be very expensive? Or what would it mean to do a
select for update count(*) ...?


> I think this should be a FAQ because it is a (F)requently (A)sked
> (Q)uestion.

After reading your email I think the real problem is not how to do
otherwise but understand what count(*) really mean and when and if it
is really useful and when it can be avoided.

I'd write in the FAQ something in the line of:

- What count(*) really does?
- When it can be avoided?
- When it can't be avoided?

In my everyday use of count(*), after your email I can hardly spot a
place where I need an exact count.
But to better understand and being convinced that using count(*) is
bad design I think last question could help a lot.

How does count(*) with a where clause perform generally on postgresql
compared to other DB?
I'd expect it perform as good or better than other DB since now the
bottleneck should be how efficiently it can filter records... but
still a count(*) with a where clause will incur in the same problem
of what "exact" means.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: count(*) and bad design was: Experiences with extensibility

От
"Harald Armin Massa"
Дата:
Ivan,

> Please forgive my naiveness in this field but what does it mean an
> "exact count" and what other DB means with "an exact count" and how
> other DB deal with it?

PostgreSQL will give you an exact count of the contents of the
database as it is in the moment you begin your count. (i.e. the
transaction starts)

BUT as the table is not locked, in parallel somebody can bulkload MANY
items into the database, so at the moment (start of your transaction)
+ 1msec your count may be invalid allready.

> I'd expect it perform as good or better than other DB since now the
> bottleneck should be how efficiently it can filter records... but
> still a count(*) with a where clause will incur in the same problem
> of what "exact" means.

I know of 3 concepts to answer count() faster then PostreSQL:

1) just lie, present an estimate
2) do not have MVCC
3) store "record deleted info" in index, so you can answer count()
with only scanning the index

Harald


--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
fx 01212-5-13695179
-
EuroPython 2008 will take place in Vilnius, Lithuania - Stay tuned!

Re: count(*) and bad design was: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Wed, 9 Jan 2008 13:04:39 +0100
"Harald Armin Massa" <haraldarminmassa@gmail.com> wrote:

> Ivan,

> > Please forgive my naiveness in this field but what does it mean an
> > "exact count" and what other DB means with "an exact count" and
> > how other DB deal with it?

> PostgreSQL will give you an exact count of the contents of the
> database as it is in the moment you begin your count. (i.e. the
> transaction starts)

> BUT as the table is not locked, in parallel somebody can bulkload
> MANY items into the database, so at the moment (start of your
> transaction)
> + 1msec your count may be invalid allready.

That's reasonable. What other DB do and what is the SQL standard
requirement for count(*)?

> > I'd expect it perform as good or better than other DB since now
> > the bottleneck should be how efficiently it can filter records...
> > but still a count(*) with a where clause will incur in the same
> > problem of what "exact" means.
>
> I know of 3 concepts to answer count() faster then PostreSQL:
>
> 1) just lie, present an estimate
> 2) do not have MVCC
> 3) store "record deleted info" in index, so you can answer count()
> with only scanning the index

Sorry if I insist but I think this is a beaten selling point against
postgresql.
One of the most recurring argument about why someone else db is
better than pg is count is slow.

Who lies?
If it is possible to do otherwise to have count run faster what are
the trade off that make it unreasonable to implement it in pg?

This is not very useful question but 3) imply that select scan the
index return the rows and just later check if they are still there.
Is it?
And since indexes aren't updated "on the fly" you may get back a
larger number than what is actually the real value.

Let me consider an everyday use where count() looks as the most
obvious solution: paging.

I search trough a table and I need to know which is the last page.
Concurrently someone is deleting a ton of records.
No matter if count() is fast or not when I output the pager it will
be "wrong". But still I'll need an estimate of the number of pages,
it is not fun if that estimate is wrong *and* slow.
And once you add the where clauses there is no way to cache the count.
Is there a way to count based on indexes without taking into account
deleted rows so to "count" faster?

I can make the search faster using indexes as Dann Corbit suggested,
but as you imply that won't make count as fast as the
"concurrence"[1] that lie or don't use MVCC or store deleted info in
indexes.

SELECT reltuples FROM pg_class WHERE relname = <table_name>;

doesn't apply since you can't add "conditions".

Please be patient. I hear this over and over and over. Postgresql is
faulty because it can't count fast.
And the only reply I've seen are: it is bad design to use count,
you're a dumb ass. I admit I may be a dumb ass, but it is hard to
sell Postgres if I continue to be a dumb ass ;)

- What count(*) really does?
Now I understood that count(*) return the # of rows as it sees them
at the moment it was invoked. That should be true for other DB as
well. That means that unless other DB lock the table they can't take
into account records that are going to be deleted once the count has
been issued.

- When count can be avoided?
Well since even other DB may report the "wrong" number, this makes
count() look less a Sacre Graal. But still if you need an estimate,
wouldn't it be better to have it quickly?
How does postgresql compare to other DB when you run:
select count(*) from table where conditions
once you use indexes?
If such kind of query will have anyway to scan the results to see if
they are still there since info about deleted records aren't stored
in indexes, is there a way to ignore this and just have a faster
estimate?
I still can't see why it is bad design to use count().

- When count() can't be avoided?
All the situation where you may really need count() I think you also
need to lock the table but well I'd be curious to see an example
where you need count()

Still can somebody make an example of bad design and one where
count() couldn't be avoided if any?

Consider that while it makes few sense to rely on "wrong" numbers in
a "business" environment where data integrity/coherence makes *a lot*
of sense it is not so clear in a CMS world where most of those
critics come from.

I know that the arguments to promote postgres in the "business" world
where DB2, Oracle and MS SQL play (?) may be different and count()
may lose its importance in that context and you could say that other
advantages plenty pay off the "slowness" of an operation that in such
a context is rare.


thanks


[1] or does postgres perform as the concurrence once you add where
clauses?


--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: Experiences with extensibility

От
Clodoaldo
Дата:
2008/1/9, Sim Zacks <sim@compulab.co.il>:
>
> The reason companies go with the closed source, expensive solutions is because
> they are better products.

Not necessarily. FOSS products don't have a selling team to persuade
and bribe people. Expensive solutions, and that is in part what make
them expensive, can spend lots of time persuading and can offer good
money to those who decide which is the "best" product. Those who
decide are not the coders or db admins and in general don't really
care much.

Regards, Clodoaldo Pinto Neto

Re: count(*) and bad design was: Experiences with extensibility

От
Sim Zacks
Дата:
Using count(*) is not bad design, though generally it makes sense to use it with
a where.

Saying using count(*) is bad design means that the only design that you can
visualize is the specific one that you are using.

There are tons of real world examples where you need count. That is why so many
people use it as a benchmark.

Obviously if you have an application where millions of rows are added and
subtracted every minute, then the value of count is kind of vague.

However, if you are querying a data warehouse that gets populated once a month,
then count has a huge value. You could use statistics in that case, but it is
much harder to join against other tables when you are using statistics. It is
also less intuitive then using the SQL standard for finding the number of rows.

Sim

Re: Experiences with extensibility

От
Sim Zacks
Дата:
I believe I was misunderstood. The fact that a product is closed source does not
make it a better product. Some companies that are using Oracle would be better
off using PostgreSQL. Other companies that need the features that Oracle offers
would not be better off using Postgresql.

However, there are a lot of closed source products that are better then their
open source counter-parts.
Oracle vs. Postgresql is one of them. Obviously if you don't need the
feature-set provided by Oracle then you would be foolish for paying for it. But
if you do need the extra features, then it is worth it.

Sim




Clodoaldo wrote:
> 2008/1/9, Sim Zacks <sim@compulab.co.il>:
>> The reason companies go with the closed source, expensive solutions is because
>> they are better products.
>
> Not necessarily. FOSS products don't have a selling team to persuade
> and bribe people. Expensive solutions, and that is in part what make
> them expensive, can spend lots of time persuading and can offer good
> money to those who decide which is the "best" product. Those who
> decide are not the coders or db admins and in general don't really
> care much.
>
> Regards, Clodoaldo Pinto Neto
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: Experiences with extensibility

От
Andrew Sullivan
Дата:
On Tue, Jan 08, 2008 at 11:37:38PM -0700, Guido Neitzer wrote:
> Like, I have a situation where I need multi-master just for
> availability. Two small servers are good enough for that. But
> unfortunately with PostgreSQL the whole setup is a major pain in the ...

Really?  I don't think a RAID array with heartbeat and failover is that big
a pain.  It requires some careful implementation, but it can be made to work
well, I think.

A


Re: count(*) and bad design was: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Wed, 09 Jan 2008 16:33:54 +0200
Sim Zacks <sim@compulab.co.il> wrote:

> Using count(*) is not bad design, though generally it makes sense
> to use it with a where.

I got the impression from others comments that postgresql
under perform other DB even when a where clause on indexed column is
involved.

I may have misinterpreted this but still could someone clarify?

Harald Armin Massa wrote:

> 1) just lie, present an estimate
> 2) do not have MVCC
> 3) store "record deleted info" in index, so you can answer count()
> with only scanning the index

I got the impression that even counting with clauses on on indexed
columns means you'll have to check if columns are still there. That
seems to imply that the extra cost make pg under perform compared to
other DB even in that scenario.

I beg pardon to Harald if I misinterpreted his words.

> Saying using count(*) is bad design means that the only design that
> you can visualize is the specific one that you are using.

I'd be interested in some example by Joshua otherwise I can't
understand what he meant.

If you're interested in all the record in a table, there is no way to
have an "engraved in stone" answer and since there are no "where
clauses" you can cache that info and update it once in a while.
If you have a where clause I'm not expecting an "engraved in stone"
answer but I'd expect to have a quick way to get an estimate and I
still haven't understood if:
a) I could improve my design to avoid count
b) postgres perform as good as other db on count where there is a
where clause
c) is there a way to have a quick estimate avoiding count when there
is a where clause

> There are tons of real world examples where you need count. That is
> why so many people use it as a benchmark.

> Obviously if you have an application where millions of rows are
> added and subtracted every minute, then the value of count is kind
> of vague.

> However, if you are querying a data warehouse that gets populated
> once a month, then count has a huge value. You could use statistics
> in that case, but it is much harder to join against other tables
> when you are using statistics. It is also less intuitive then using
> the SQL standard for finding the number of rows.

Again: paging records. You can't do statistics.
Surely you could optimise and lie... but that comes to a cost
compared to the simplicity of count.

Still everybody knows that a frequent complain about postgresql is it
has a slow count.

I can understand grey answer provided they are coherent.
a) the above claim is false
b) this claim is true just on cases where you could opt for a better
design
c) this claim is false for count without where clause
d) this claim is true
e) ...

details on b) would be much appreciated. Other cases require just a
yes/no answer.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: count(*) and bad design was: Experiences with extensibility

От
"Scott Marlowe"
Дата:
On Jan 9, 2008 8:33 AM, Sim Zacks <sim@compulab.co.il> wrote:
> Using count(*) is not bad design, though generally it makes sense to use it with
> a where.
>
> Saying using count(*) is bad design means that the only design that you can
> visualize is the specific one that you are using.
>
> There are tons of real world examples where you need count. That is why so many
> people use it as a benchmark.
>
> Obviously if you have an application where millions of rows are added and
> subtracted every minute, then the value of count is kind of vague.
>
> However, if you are querying a data warehouse that gets populated once a month,
> then count has a huge value. You could use statistics in that case, but it is
> much harder to join against other tables when you are using statistics. It is
> also less intuitive then using the SQL standard for finding the number of rows.

I think part of the problem is people think of count() as something
other than an aggregate function.

If I ran "select avg(i) from table" on a 20M row table, I'd expect it
to take a few seconds, after all, I'm running a function across 20
Million rows.

Some databases have the ability to short-circuit count(*) without a
where clause, some with a where clause.  But the basic model of
count(*) is that it's an aggregate function, and what you're asking
the db to do is to count every single row in the db that matches your
where clause.

Now, everything's a tradeoff.  If PostgreSQL had visibility
information in the indexes, it would have to lock both the table and
index for every write, thus slowing down all the other queries that
are trying to access the table.  It would be a tradeoff that
sacrificed write speed for read speed. In a db that was used mostly
for writing, it would likely be a fair trade.  In a db that did a lot
of writing, it might slow the whole thing to a crawl.

So, the slow count(*) performance of postgresql, especially count(*)
without a where clause, is a result of the TANSTAAFL principle (there
ain't no such thing as a free lunch).

Now, if there's a where clause that's selective enough, then a
count(*) query may use the index and be a bit faster, but remember, in
pgsql, it's still got to actually hit the table to see if each tuple
really is visible to this transaction, so the index needs to be fairly
selective to be a win.

A possible workaround is to have something like a separate table with
nothing but the IDs and whatever would be in your where clause for the
tables you're accessing with a foreign key to it, and use THAT for a
count(*).  Since the rows are skinnier, the count(*) will be faster.
Another alternative is to have a trigger fire that keeps a track of
the size of the table in a summary table when rows are added and
deleted.

Each of these methods "costs" you something, in time and effort or
performance, and that's why they're not automatic.  For instance, I
have an 80M row stats db that grows by about 1M rows a week.  I do NOT
need to count the whole thing, hardly ever, and can live with the fact
that I don't know exactly how many rows it has at any given time.
When I do a select count(*) with a where clause it's usually
restricted to < 1 weeks data and can use an index and come back pretty
quickly.

select count(*) from bigtable where inserttime > now() - interval '1 week';
  count
---------
 1254269
(1 row)

Time: 21422.368 ms

Second run:
select count(*) from bigtable where inserttime > now() - interval '1 week';
  count
---------
 1254320
(1 row)

Time: 2001.204 ms

With the data loaded into shared_buffers / linux kernel cache, that's
not too bad.

Now, I run the same query against our production oracle machine, which
is a MUCH more powerful server...

SQL> select count(*) from bigtable where inserttime > SYSDATE-7;

  COUNT(*)
----------
   1255972

Elapsed: 00:00:18.62

second run:
SQL> select count(*) from bigtable where inserttime > SYSDATE-7;

  COUNT(*)
----------
   1255973
Elapsed: 00:00:00.98

Now, obviously, Oracle's got some optimizations for what it's got in
the buffer there, but the first run isn't really any faster. In fact,
for a much more powerful machine, the performance was, relatively
speaking, pretty bad compared to my little 1 CPU 1 sw RAID-10
reporting server.

So, while PostgreSQL's count(*) performance isn't blindingly fast,
it's not the dog some people make it out to be either.

Re: count(*) and bad design was: Experiences with extensibility

От
"Scott Marlowe"
Дата:
On Jan 9, 2008 10:21 AM, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
> On Wed, 09 Jan 2008 16:33:54 +0200
> Sim Zacks <sim@compulab.co.il> wrote:
>
> > Using count(*) is not bad design, though generally it makes sense
> > to use it with a where.
>
> I got the impression from others comments that postgresql
> under perform other DB even when a where clause on indexed column is
> involved.

Correct, see my other post.  Again though, it's a tradeoff.  This one
lower performing action allows other actions to be faster.

> Again: paging records. You can't do statistics.

Sure you can.  When you're near the front, no one cares how exact it
is.  Ever search for a common term on google?  You don't get an exact
count, you get an approximation, and you get it for a reason.

Other sites often lie, and give what looks like an exact count, but if
it's in the 100,000 who really cares?  Seriously, are you going to
10,000th page on google for a search term?  If someone does start
going that far out, the cost of limit/offset are going to kick in, and
not just in postgresql, and queries are going to take longer and
longer.  At that point, you can switch to an exact count(*) if you
need to have it.  It won't be the most expensive thing you're doing.

> Still everybody knows that a frequent complain about postgresql is it
> has a slow count.

I've never heard it before (just kidding).  I do think it's frequent.
I also think it's overstated.

Re: count(*) and bad design was: Experiences with extensibility

От
Andrew Sullivan
Дата:
On Wed, Jan 09, 2008 at 05:21:24PM +0100, Ivan Sergio Borgonovo wrote:
>
> I got the impression that even counting with clauses on on indexed
> columns means you'll have to check if columns are still there. That
> seems to imply that the extra cost make pg under perform compared to
> other DB even in that scenario.

You have to do this for any row you need to see, for _any_ database
operation in Postgres.  But that's no different from any other database
system: they all have to locate all the rows that satisfy the condition, and
then tell you how many there are.

Many other systems, however, "know" how many rows there are in the table.
In some sense, they have optimised for that case at the expense of other
cases (like, for instance, more aggressive locks than Postgres takes, or
failures due to rollback segment exhaustion, or whatever your favourite
limitation of your system of choice is).  When you build a system, you're
going to trade some features for others more than likely, and the real
question is what things you trade away.  The speed of counting all the rows
in the table seems to me to be a good thing to trade away, because it's very
rare that you actually need to know that.

> If you're interested in all the record in a table, there is no way to
> have an "engraved in stone" answer

Sure there is: SELECT count(*) from table.  That tells you how many rows
there were in the table when your transaction started (modulo read
visibility rules), just like in any other database system.

A

Re: count(*) and bad design was: Experiences with extensibility

От
"Scott Marlowe"
Дата:
On Jan 9, 2008 10:46 AM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
> On Wed, Jan 09, 2008 at 05:21:24PM +0100, Ivan Sergio Borgonovo wrote:
> >
> > I got the impression that even counting with clauses on on indexed
> > columns means you'll have to check if columns are still there. That
> > seems to imply that the extra cost make pg under perform compared to
> > other DB even in that scenario.
>
> You have to do this for any row you need to see, for _any_ database
> operation in Postgres.  But that's no different from any other database
> system: they all have to locate all the rows that satisfy the condition, and
> then tell you how many there are.
>
> Many other systems, however, "know" how many rows there are in the table.
> In some sense, they have optimised for that case at the expense of other
> cases (like, for instance, more aggressive locks than Postgres takes, or
> failures due to rollback segment exhaustion, or whatever your favourite
> limitation of your system of choice is).  When you build a system, you're
> going to trade some features for others more than likely, and the real
> question is what things you trade away.  The speed of counting all the rows
> in the table seems to me to be a good thing to trade away, because it's very
> rare that you actually need to know that.
>
> > If you're interested in all the record in a table, there is no way to
> > have an "engraved in stone" answer
>
> Sure there is: SELECT count(*) from table.  That tells you how many rows
> there were in the table when your transaction started (modulo read
> visibility rules), just like in any other database system.

And if, for some god forsaken reason, you need to operate on that
number, there's always "lock table"...

I feel dirty. :)

Re: count(*) and bad design was: Experiences with extensibility

От
Andrew Sullivan
Дата:
On Wed, Jan 09, 2008 at 11:03:59AM -0600, Scott Marlowe wrote:
>
> And if, for some god forsaken reason, you need to operate on that
> number, there's always "lock table"...

Yes.  You could also store the data in ISAM :-P

> I feel dirty. :)

You should.  Go wash your brain out with soap.  LOCK TABLE indeed.

A


Re: count(*) and bad design was: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Wed, 9 Jan 2008 10:30:45 -0600
"Scott Marlowe" <scott.marlowe@gmail.com> wrote:

> Now, everything's a tradeoff.  If PostgreSQL had visibility
> information in the indexes, it would have to lock both the table and
> index for every write, thus slowing down all the other queries that
> are trying to access the table.  It would be a tradeoff that
> sacrificed write speed for read speed. In a db that was used mostly
> for writing, it would likely be a fair trade.  In a db that did a
> lot of writing, it might slow the whole thing to a crawl.

OK... we are getting near to the point. I understand the trade-off
problem in storing into indexes id the row is still there.
Is there a way to get the count of the rows that *may be* there, I
mean the one that satisfy the where clauses no matter if in the
meanwhile they were deleted or not and reindex on a regular basis.
That would be simple and I think fast and an acceptable solution for
most CMS.

> Now, if there's a where clause that's selective enough, then a
> count(*) query may use the index and be a bit faster, but remember,
> in pgsql, it's still got to actually hit the table to see if each
> tuple really is visible to this transaction, so the index needs to
> be fairly selective to be a win.

But well if the query is not selective enough I think the problem is
shared with other DB as well.

> A possible workaround is to have something like a separate table
> with nothing but the IDs and whatever would be in your where clause
> for the tables you're accessing with a foreign key to it, and use
> THAT for a count(*).  Since the rows are skinnier, the count(*)
> will be faster. Another alternative is to have a trigger fire that
> keeps a track of the size of the table in a summary table when rows
> are added and deleted.

If the where clause is dynamic, how can it help?

> select count(*) from bigtable where inserttime > now() - interval
> '1 week'; count
> ---------
>  1254269
> (1 row)
>
> Time: 21422.368 ms
>
> Second run:
> select count(*) from bigtable where inserttime > now() - interval
> '1 week'; count
> ---------
>  1254320
> (1 row)
>
> Time: 2001.204 ms
>
> With the data loaded into shared_buffers / linux kernel cache,
> that's not too bad.
>
> Now, I run the same query against our production oracle machine,
> which is a MUCH more powerful server...
>
> SQL> select count(*) from bigtable where inserttime > SYSDATE-7;
>
>   COUNT(*)
> ----------
>    1255972
>
> Elapsed: 00:00:18.62
>
> second run:
> SQL> select count(*) from bigtable where inserttime > SYSDATE-7;
>
>   COUNT(*)
> ----------
>    1255973
> Elapsed: 00:00:00.98
>
> Now, obviously, Oracle's got some optimizations for what it's got in
> the buffer there, but the first run isn't really any faster. In
> fact, for a much more powerful machine, the performance was,
> relatively speaking, pretty bad compared to my little 1 CPU 1 sw
> RAID-10 reporting server.

Interesting... oh let's put some emotions in it: cool ;)

> So, while PostgreSQL's count(*) performance isn't blindingly fast,
> it's not the dog some people make it out to be either.

Continuing here from your previous post... while you guys are
concerned of competing with big boxes running Oracle or thinking
about Google's webfarms I'm thinking about stuff that is in between a
mere CMS for mortals and stuff that deals with money where you don't
have budgets and hits enough to justify a complex logic to do stats
analysis or collect enough stats to make any forecast reasonable.

In this context a simpler faster even if less accurate count may be
very handy.

thanks again

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: count(*) and bad design was: Experiences with extensibility

От
"Joshua D. Drake"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 9 Jan 2008 20:01:05 +0100
Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:

> On Wed, 9 Jan 2008 10:30:45 -0600
> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
> 
> > Now, everything's a tradeoff.  If PostgreSQL had visibility
> > information in the indexes, it would have to lock both the table and
> > index for every write, thus slowing down all the other queries that
> > are trying to access the table.  It would be a tradeoff that
> > sacrificed write speed for read speed. In a db that was used mostly
> > for writing, it would likely be a fair trade.  In a db that did a
> > lot of writing, it might slow the whole thing to a crawl.
> 
> OK... we are getting near to the point. I understand the trade-off
> problem in storing into indexes id the row is still there.
> Is there a way to get the count of the rows that *may be* there, 

If you analyze regularly you can use pg_class. It isn't exact but is
usually close enough (especially if you are just using it for something
like pagination).

Joshua D. Drake



- -- 
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ 
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHhRlVATb/zqfZUUQRAoQmAJ4nkBHmZEsC8UusCT7+qul1Qa9/0QCeJFru
gnBj3ROCMz+vqbF/1Z78roY=
=iVOP
-----END PGP SIGNATURE-----

Re: count(*) and bad design was: Experiences with extensibility

От
"Scott Marlowe"
Дата:
On Jan 9, 2008 12:58 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Wed, 9 Jan 2008 20:01:05 +0100
> Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
>
> > On Wed, 9 Jan 2008 10:30:45 -0600
> > "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
> >
> > > Now, everything's a tradeoff.  If PostgreSQL had visibility
> > > information in the indexes, it would have to lock both the table and
> > > index for every write, thus slowing down all the other queries that
> > > are trying to access the table.  It would be a tradeoff that
> > > sacrificed write speed for read speed. In a db that was used mostly
> > > for writing, it would likely be a fair trade.  In a db that did a
> > > lot of writing, it might slow the whole thing to a crawl.
> >
> > OK... we are getting near to the point. I understand the trade-off
> > problem in storing into indexes id the row is still there.
> > Is there a way to get the count of the rows that *may be* there,
>
> If you analyze regularly you can use pg_class. It isn't exact but is
> usually close enough (especially if you are just using it for something
> like pagination).

Yeah, but the OP's point was that it doesn't work if you have a where clause.

I could see a use for an approximate count(*) with where clause, just
like I could see a use for the ability to retrieve random rows from a
table without using order by random() on it.  And those are both
things that would require some form of hacking in the db that I'm
certainly not capable of pulling off...

Re: count(*) and bad design was: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Wed, 9 Jan 2008 10:58:29 -0800
"Joshua D. Drake" <jd@commandprompt.com> wrote:

> > OK... we are getting near to the point. I understand the trade-off
> > problem in storing into indexes id the row is still there.
> > Is there a way to get the count of the rows that *may be* there,

> If you analyze regularly you can use pg_class. It isn't exact but is
> usually close enough (especially if you are just using it for
> something like pagination).

But what if I've a

select count(*) from table where condition;

where condition involves just indexed columns and I want to trust the
indexes and I'm not concerned of the deleted rows?
Just to get an estimate between reindexing cycles, that would be
perfect for paging.

pg_class does look as returning all the rows.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: count(*) and bad design was: Experiences with extensibility

От
Zoltan Boszormenyi
Дата:
Ivan Sergio Borgonovo írta:
> On Wed, 9 Jan 2008 13:04:39 +0100
> "Harald Armin Massa" <haraldarminmassa@gmail.com> wrote:
>
>
>> Ivan,
>>
>
>
>>> Please forgive my naiveness in this field but what does it mean an
>>> "exact count" and what other DB means with "an exact count" and
>>> how other DB deal with it?
>>>
>
>
>> PostgreSQL will give you an exact count of the contents of the
>> database as it is in the moment you begin your count. (i.e. the
>> transaction starts)
>>
>
>
>> BUT as the table is not locked, in parallel somebody can bulkload
>> MANY items into the database, so at the moment (start of your
>> transaction)
>> + 1msec your count may be invalid allready.
>>
>
> That's reasonable. What other DB do and what is the SQL standard
> requirement for count(*)?
>
>
>>> I'd expect it perform as good or better than other DB since now
>>> the bottleneck should be how efficiently it can filter records...
>>> but still a count(*) with a where clause will incur in the same
>>> problem of what "exact" means.
>>>
>> I know of 3 concepts to answer count() faster then PostreSQL:
>>
>> 1) just lie, present an estimate
>> 2) do not have MVCC
>> 3) store "record deleted info" in index, so you can answer count()
>> with only scanning the index
>>
>
> Sorry if I insist but I think this is a beaten selling point against
> postgresql.
> One of the most recurring argument about why someone else db is
> better than pg is count is slow.
>
> Who lies?
> If it is possible to do otherwise to have count run faster what are
> the trade off that make it unreasonable to implement it in pg?
>

The decision to use MVCC in PostgreSQL makes the point moot.
What your transaction cannot see (e.g. another transaction inserted
a new record but didn't COMMITted yet) cannot be counted as visible.
It's not only a theoretical but also practical. Compare that to the
MaxDB way. At the very lowest level MaxDB keeps the records
in a balanced b-tree based on its primary key. Look at this URL:
http://blog.ulf-wendel.de/?p=76
MaxDB keeps how many records are in the tree at all times which is
handy if  your database is close to idle. Consequences:

- very fast COUNT(*) when idle
- records unCOMMITted by other transactions can be seen by COUNT(*)
  if your transaction is in "read committed" isolation level but not the
  actual records
- COUNT(*) waits for all other transactions that modified the table
  in question to finish if your transaction   is in "repeatable read" or
  "serializable" isolation levels. Consider that transactions can take
  a loooong time to finish if they process many things. This way your
  SELECT COUNT(*) doesn't respond instantly but doesn't slow down
  your server either. But the end user perception is the same:
  COUNT(*) is slow!

In PostgreSQL, COUNT(*) responds closely at the same speed regardless
of other transactions. Which way do you prefer?

> This is not very useful question but 3) imply that select scan the
> index return the rows and just later check if they are still there.
> Is it?
> And since indexes aren't updated "on the fly" you may get back a
> larger number than what is actually the real value.
>
> Let me consider an everyday use where count() looks as the most
> obvious solution: paging.
>
> I search trough a table and I need to know which is the last page.
> Concurrently someone is deleting a ton of records.
> No matter if count() is fast or not when I output the pager it will
> be "wrong". But still I'll need an estimate of the number of pages,
> it is not fun if that estimate is wrong *and* slow.
> And once you add the where clauses there is no way to cache the count.
> Is there a way to count based on indexes without taking into account
> deleted rows so to "count" faster?
>
> I can make the search faster using indexes as Dann Corbit suggested,
> but as you imply that won't make count as fast as the
> "concurrence"[1] that lie or don't use MVCC or store deleted info in
> indexes.
>
> SELECT reltuples FROM pg_class WHERE relname = <table_name>;
>
> doesn't apply since you can't add "conditions".
>
> Please be patient. I hear this over and over and over. Postgresql is
> faulty because it can't count fast.
> And the only reply I've seen are: it is bad design to use count,
> you're a dumb ass. I admit I may be a dumb ass, but it is hard to
> sell Postgres if I continue to be a dumb ass ;)
>
> - What count(*) really does?
> Now I understood that count(*) return the # of rows as it sees them
> at the moment it was invoked. That should be true for other DB as
> well. That means that unless other DB lock the table they can't take
> into account records that are going to be deleted once the count has
> been issued.
>
> - When count can be avoided?
> Well since even other DB may report the "wrong" number, this makes
> count() look less a Sacre Graal. But still if you need an estimate,
> wouldn't it be better to have it quickly?
> How does postgresql compare to other DB when you run:
> select count(*) from table where conditions
> once you use indexes?
> If such kind of query will have anyway to scan the results to see if
> they are still there since info about deleted records aren't stored
> in indexes, is there a way to ignore this and just have a faster
> estimate?
> I still can't see why it is bad design to use count().
>
> - When count() can't be avoided?
> All the situation where you may really need count() I think you also
> need to lock the table but well I'd be curious to see an example
> where you need count()
>
> Still can somebody make an example of bad design and one where
> count() couldn't be avoided if any?
>
> Consider that while it makes few sense to rely on "wrong" numbers in
> a "business" environment where data integrity/coherence makes *a lot*
> of sense it is not so clear in a CMS world where most of those
> critics come from.
>
> I know that the arguments to promote postgres in the "business" world
> where DB2, Oracle and MS SQL play (?) may be different and count()
> may lose its importance in that context and you could say that other
> advantages plenty pay off the "slowness" of an operation that in such
> a context is rare.
>
>
> thanks
>
>
> [1] or does postgres perform as the concurrence once you add where
> clauses?
>
>
>


--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/



Re: Experiences with extensibility

От
Andrew Sullivan
Дата:
On Tue, Jan 08, 2008 at 10:59:56PM -0700, Guido Neitzer wrote:
>
> Easy multi-master clustering with just two machines.

To my knowledge, _nobody_ actually offers that.

There are three companies I know of that have done effective marketing of
systems.

Company O has a very advanced system with plenty of features.  When you see
it functioning, it is very impressive.  Casual acquaintance with anyone who
has attempted to implement it, however, will yield many stories that give
the lie to any claims of "easy" multi-master.  Some implementors would be
happy to get to "hard to do, but working" multi-master, as far as I've been
able to ascertain.

Company M has a nifty 80% solution for in-memory clustering.  It's a cool
hack.  But it has a remarkably large number of failure modes and corner
cases that make it a risky answer for really high-value data.  If my bank
were using this technology, I would transfer my money to another bank.

Company I actually has the most mature technology in this area, if you're
willing to use VMS.  It relies on the features of VMS to do this.  Given
that those features were delivered precisely for the finance and insurance
industries where extremely valuable data was being stored, there is a long
history of reliable field deployments.  Unfortunately, the continued life of
VMS is in some doubt, and skilled VMS operators are increasingly hard to
find and expensive.

There are other systems, including PostgreSQL, that can do a kind of
"clustering" with multiple machines, shared disk, and some sort of heartbeat
arrangement.

A

Re: Experiences with extensibility

От
Guido Neitzer
Дата:
On 09.01.2008, at 09:05, Andrew Sullivan wrote:

>> Easy multi-master clustering with just two machines.
>
> To my knowledge, _nobody_ actually offers that.

As I said: FrontBase is offering that.

cug

--
http://www.event-s.net


Re: Experiences with extensibility

От
"Scott Marlowe"
Дата:
On Jan 9, 2008 10:05 AM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
> On Tue, Jan 08, 2008 at 10:59:56PM -0700, Guido Neitzer wrote:
> >
> > Easy multi-master clustering with just two machines.
>
> To my knowledge, _nobody_ actually offers that.
>
> There are three companies I know of that have done effective marketing of
> systems.
>
> Company O has a very advanced system with plenty of features.  When you see
> it functioning, it is very impressive.  Casual acquaintance with anyone who
> has attempted to implement it, however, will yield many stories that give
> the lie to any claims of "easy" multi-master.  Some implementors would be
> happy to get to "hard to do, but working" multi-master, as far as I've been
> able to ascertain.

But my account rep told me it was easy, and he'd never lie to me,
would he?  <@_@>

Re: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Wed, 9 Jan 2008 13:45:10 -0600
"Scott Marlowe" <scott.marlowe@gmail.com> wrote:

> But my account rep told me it was easy, and he'd never lie to me,
> would he?  <@_@>

If he uses count(*) maybe, otherwise he is locking your $.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: count(*) and bad design was: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Wed, 09 Jan 2008 20:29:39 +0100
Zoltan Boszormenyi <zb@cybertec.at> wrote:

> The decision to use MVCC in PostgreSQL makes the point moot.

...

thanks.

> In PostgreSQL, COUNT(*) responds closely at the same speed
> regardless of other transactions. Which way do you prefer?

Considering the relative value of count my interest was for something
that is even less precise than the "usual" count but performs better.
I'm not proposing to turn Postgres into MySQL.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: Experiences with extensibility

От
Martin
Дата:
In article <92404875-3F83-43A2-B9B1-88C662A077FD@event-s.net>,
Guido Neitzer <lists@event-s.net> wrote:

>FrontBase. It has an incredibly easy to configure replication and
>multi master clustering support, is very reliable and can also handle
>really big databases.

I've been working with FrontBase a lot lately and I wouldn't say
anything about it qualifies as "incredibly easy" and reliable it
is not. Performance of FrontBase is just plain terrible. One of
our reports takes 9 minutes on FrontBase and 10 seconds on
Postgres.

Then there's the documentation issue...


Re: count(*) and bad design was: Experiences with extensibility

От
Zoltan Boszormenyi
Дата:
Ivan Sergio Borgonovo írta:
> On Wed, 09 Jan 2008 20:29:39 +0100
> Zoltan Boszormenyi <zb@cybertec.at> wrote:
>
>
>> The decision to use MVCC in PostgreSQL makes the point moot.
>>
>
> ...
>
> thanks.
>
>
>> In PostgreSQL, COUNT(*) responds closely at the same speed
>> regardless of other transactions. Which way do you prefer?
>>
>
> Considering the relative value of count my interest was for something
> that is even less precise than the "usual" count but performs better.
> I'm not proposing to turn Postgres into MySQL.
>

This below might be a good compromise.
Although every INSERT/DELETE will be a bit slower
because of the additional UPDATE on the administrative table.

create table rec_count (tablename text unique, rec_cnt bigint) with
(fillfactor 50);

Add any tables you want to it with their current record count and
for any tables you want to watch:

create or replace function inc_tablename_rec_cnt()
returns trigger as $$
begin
    update rec_count set rec_cnt = rec_cnt + 1 where tablename =
'tablename';
    return new;
end; $$ language plpgsql;

create or replace function dec_tablename_rec_cnt()
returns trigger as $$
begin
    update rec_count set rec_cnt = rec_cnt - 1 where tablename =
'tablename';
    return new;
end; $$ language plpgsql;

create trigger ins_tablename_rec_cnt after insert on tablename for each
row execute procedure inc_tablename_rec_cnt();
create trigger del_tablename_rec_cnt after insert on tablename for each
row execute procedure dec_tablename_rec_cnt();

The administrative table will be a highly updated one if you want
to watch a high-traffic table, hence the FILLFACTOR setting.
You may need to adjust max_fsm_pages. Later, you can do a

SELECT rec_cnt FROM rec_count WHERE tablename = 'tablename';

which will be fast and depending on the initial value of COUNT(*)
it will be very close to the exact figure. You can extend the example
with more columns if you know your SELECT COUNT(*) ... WHERE
conditions in advance but this way you have to keep several administrative
tables for different monitored tables. Again, this trades some disk space
and INSERT/DELETE operation speed on the monitored tables for
quicker count.

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/



Re: count(*) and bad design was: Experiences with extensibility

От
Chris Browne
Дата:
zb@cybertec.at (Zoltan Boszormenyi) writes:
> which will be fast and depending on the initial value of COUNT(*)
> it will be very close to the exact figure. You can extend the example
> with more columns if you know your SELECT COUNT(*) ... WHERE
> conditions in advance but this way you have to keep several administrative
> tables for different monitored tables. Again, this trades some disk space
> and INSERT/DELETE operation speed on the monitored tables for
> quicker count.

Actually, this approach will be Really Terrible for any cases where
multiple connections are adding/deleting tuples concurrently, as it
will force ALL updates to serialize behind the update to the central
table.

Occasionally, you'll have something even worse, namely a deadlock,
where two or more of the updates fighting over the single summary
tuple fall into a bad state, and one of them is forced to give up,
potentially rolling back its whole transaction.

[Waving hands for a moment]

What I would do *instead* would be for each INSERT to add a tuple with
a count of 1, and for each DELETE to add a tuple with count of -1, and
then to periodically have a single process walk through to summarize
the table.  There may be a further optimization to be had by doing a
per-statement trigger that counts the number of INSERTs/DELETEs done,
so that inserting 30 tuples (in the table being tracked) leads to
adding a single tuple with count of 30 in the summary table.

That changes the tradeoffs, again...

 - Since each INSERT/DELETE is simply doing an INSERT into the summary
   table, the ongoing activity is *never* blocking anything

 - You get the count by requesting
    SELECT SUM(rec_cnt) as record_count from rec_count where tablename = 'foo';

 - Note that the query is MVCC-consistent with the table!

 - Once in a while, you'll want to run a single procedure that, for
   each table, deletes all the existing records, and replaces them
   with a single one consisting of the sum of the individual values.

 - You can re-sync a table by running the query:
     begin;
        delete from record_count where tablename = 'foo';
        insert into record_count(tablename, rec_cnt) select 'foo', (select count(*) from foo);
     commit;
--
output = reverse("ofni.sesabatadxunil" "@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/languages.html
Rules of  the Evil Overlord #174.  "If I am dangling  over a precipice
and the hero reaches  his hand down to me, I will  not attempt to pull
him down with  me. I will allow him to rescue  me, thank him properly,
then return to the safety of my fortress and order his execution."
<http://www.eviloverlord.com/>

Re: Experiences with extensibility

От
Guido Neitzer
Дата:
On 09.01.2008, at 13:51, Martin wrote:

> I've been working with FrontBase a lot lately and I wouldn't say
> anything about it qualifies as "incredibly easy" and reliable it
> is not.

We had never ever any reliability issues with FrontBase as long as
didn't try to insert garbage. It really doesn't like that.

> Performance of FrontBase is just plain terrible. One of
> our reports takes 9 minutes on FrontBase and 10 seconds on
> Postgres.

As I said: depends on what you are doing.

Which version did you use where you got that terrible performance? The
latest one? There was a bug in there query planner in an old version
that totally killed some queries.

There is also the or-query problem (not sure whether that one is still
in there): if your report does something like "select bar from foo
where a = 1 or b = 1;" it didn't use indexes on a or b which is
terrible. If you have that, use separate selects combined with a union
if you can.

In my cases it was never as fast as PostgreSQL as soon as there are
more than two tables involved. But except a couple of cases where I
had to use PostgreSQL for performance reasons, I was okay with it. But
performance is not everything.

PostgreSQL is for most cases the better product, but for some cases,
FrontBase is. Depends on what you're doing how much which side has ...

> Then there's the documentation issue...

PostgreSQL documentation is better, but what is your issue exactly?
The main problem I have with the FrontBase documentation is that it
mostly gives no useful examples.

As this is off-topic here: we can transfer that either to the
FrontBase list or to private mail if you like.

cug

--
http://www.event-s.net


Re: Experiences with extensibility

От
Andrew Sullivan
Дата:
On Wed, Jan 09, 2008 at 12:38:43PM -0700, Guido Neitzer wrote:
> >>Easy multi-master clustering with just two machines.

> As I said: FrontBase is offering that.

It looks like a two-phase commit answer, if I'm reading correctly.  You can
do this today on many systems (including Postgres), but the classical
problem with 2PC clustering is that it is very slow compared to single-node
systems.  Anyway, FrontBase looks interesting.  Thanks for the reference.

A


Re: count(*) and bad design was: Experiences with extensibility

От
Sim Zacks
Дата:
It would be an administrative nightmare unless you had very few where clauses
that you were tracking.

Instead of using a trigger, you could use Listen/Notify to call a daemon on the
server to run the procedure and then you have no insert/delete overhead.
Or you could call the function on a cron job every 10 minutes...

Chris Browne wrote:
> zb@cybertec.at (Zoltan Boszormenyi) writes:
>> which will be fast and depending on the initial value of COUNT(*)
>> it will be very close to the exact figure. You can extend the example
>> with more columns if you know your SELECT COUNT(*) ... WHERE
>> conditions in advance but this way you have to keep several administrative
>> tables for different monitored tables. Again, this trades some disk space
>> and INSERT/DELETE operation speed on the monitored tables for
>> quicker count.
>
> Actually, this approach will be Really Terrible for any cases where
> multiple connections are adding/deleting tuples concurrently, as it
> will force ALL updates to serialize behind the update to the central
> table.
>
> Occasionally, you'll have something even worse, namely a deadlock,
> where two or more of the updates fighting over the single summary
> tuple fall into a bad state, and one of them is forced to give up,
> potentially rolling back its whole transaction.
>
> [Waving hands for a moment]
>
> What I would do *instead* would be for each INSERT to add a tuple with
> a count of 1, and for each DELETE to add a tuple with count of -1, and
> then to periodically have a single process walk through to summarize
> the table.  There may be a further optimization to be had by doing a
> per-statement trigger that counts the number of INSERTs/DELETEs done,
> so that inserting 30 tuples (in the table being tracked) leads to
> adding a single tuple with count of 30 in the summary table.
>
> That changes the tradeoffs, again...
>
>  - Since each INSERT/DELETE is simply doing an INSERT into the summary
>    table, the ongoing activity is *never* blocking anything
>
>  - You get the count by requesting
>     SELECT SUM(rec_cnt) as record_count from rec_count where tablename = 'foo';
>
>  - Note that the query is MVCC-consistent with the table!
>
>  - Once in a while, you'll want to run a single procedure that, for
>    each table, deletes all the existing records, and replaces them
>    with a single one consisting of the sum of the individual values.
>
>  - You can re-sync a table by running the query:
>      begin;
>         delete from record_count where tablename = 'foo';
>         insert into record_count(tablename, rec_cnt) select 'foo', (select count(*) from foo);
>      commit;

Re: count(*) and bad design was: Experiences with extensibility

От
Martijn van Oosterhout
Дата:
On Wed, Jan 09, 2008 at 03:28:04PM +0100, Ivan Sergio Borgonovo wrote:
> Let me consider an everyday use where count() looks as the most
> obvious solution: paging.
>
> I search trough a table and I need to know which is the last page.

There's an often overlooked solution to this. Let's say your count
returns 100,000 records, are you going to give them link to 1000
different pages? Not really. Probably about 10, so really your count
only is interested in an exact result less than 100, or that's it's
more than 100.

By placing the where clause in a subselect with a limit of 101 and a
count() around it you have an upper bound on the cost of the count, the
result 101 simply means "more than 100". Depending on the data you
might just put the limit on the query that fetches the data and using
everything after the 10th record to determine your count on the client
side and skip the extra round trip.

> Is there a way to count based on indexes without taking into account
> deleted rows so to "count" faster?

I've also often pondered whether the SQL standard support for table
sampling would be good here. Sure, you still need to check visibility,
but if you specify that the DB only needs to check 10% of the tuples
and to extrapolate the results from that, you could get a fast yet
reasonably accurate result. IIRC patches for this have been floated on
the lists.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Вложения

Re: count(*) and bad design was: Experiences with extensibility

От
Harald Fuchs
Дата:
In article <60ejcqy6j0.fsf@dba2.int.libertyrms.com>,
Chris Browne <cbbrowne@acm.org> writes:

> There may be a further optimization to be had by doing a
> per-statement trigger that counts the number of INSERTs/DELETEs done,
> so that inserting 30 tuples (in the table being tracked) leads to
> adding a single tuple with count of 30 in the summary table.

This would be nice, but at least the 8.2.4 docs say

  Statement-level triggers do not currently have any way to examine
  the individual row(s) modified by the statement.

Is this restriction removed in a later version?

Re: count(*) and bad design was: Experiences with extensibility

От
Alvaro Herrera
Дата:
Harald Fuchs wrote:
> In article <60ejcqy6j0.fsf@dba2.int.libertyrms.com>,
> Chris Browne <cbbrowne@acm.org> writes:
>
> > There may be a further optimization to be had by doing a
> > per-statement trigger that counts the number of INSERTs/DELETEs done,
> > so that inserting 30 tuples (in the table being tracked) leads to
> > adding a single tuple with count of 30 in the summary table.
>
> This would be nice, but at least the 8.2.4 docs say
>
>   Statement-level triggers do not currently have any way to examine
>   the individual row(s) modified by the statement.
>
> Is this restriction removed in a later version?

Nope.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Experiences with extensibility

От
Guido Neitzer
Дата:
On 09.01.2008, at 13:51, Martin wrote:

> I've been working with FrontBase a lot lately and I wouldn't say
> anything about it qualifies as "incredibly easy" and reliable it
> is not.

We had never ever any reliability issues with FrontBase as long as
didn't try to insert garbage. It really doesn't like that.

> Performance of FrontBase is just plain terrible. One of
> our reports takes 9 minutes on FrontBase and 10 seconds on
> Postgres.

As I said: depends on what you are doing.

Which version did you use where you got that terrible performance? The
latest one? There was a bug in there query planner in an old version
that totally killed some queries.

There is also the or-query problem (not sure whether that one is still
in there): if your report does something like "select bar from foo
where a = 1 or b = 1;" it didn't use indexes on a or b which is
terrible. If you have that, use separate selects combined with a union
if you can.

In my cases it was never as fast as PostgreSQL as soon as there are
more than two tables involved. But except a couple of cases where I
had to use PostgreSQL for performance reasons, I was okay with it. But
performance is not everything.

PostgreSQL is for most cases the better product, but for some cases,
FrontBase is. Depends on what you're doing how much which side has ...

> Then there's the documentation issue...

PostgreSQL documentation is better, but what is your issue exactly?
The main problem I have with the FrontBase documentation is that it
mostly gives no useful examples.

As this is off-topic here: we can transfer that either to the
FrontBase list or to private mail if you like.

cug

--
http://www.event-s.net


Re: count(*) and bad design was: Experiences with extensibility

От
Ron Mayer
Дата:
Chris Browne wrote:
> zb@cybertec.at (Zoltan Boszormenyi) writes:
>> SELECT COUNT(*)
> [Waving hands for a moment]

Would what Chris describes below be a good candidate for
a pgfoundry project that has functions that'll create the
triggers for you?  (yeah, I might be volunteering, but would
undoubtedly need help)

Then when people ask it again and again the response could
be "install http://pgfoundry.org/fast_count_star" rather
than "go read the mailing list archives and roll
your own - but remember to worry about deadlock and
contention on the table containing the counts".


> What I would do *instead* would be for each INSERT to add a tuple with
> a count of 1, and for each DELETE to add a tuple with count of -1, and
> then to periodically have a single process walk through to summarize
> the table.  There may be a further optimization to be had by doing a
> per-statement trigger that counts the number of INSERTs/DELETEs done,
> so that inserting 30 tuples (in the table being tracked) leads to
> adding a single tuple with count of 30 in the summary table.
>
> That changes the tradeoffs, again...
>
>  - Since each INSERT/DELETE is simply doing an INSERT into the summary
>    table, the ongoing activity is *never* blocking anything
>
>  - You get the count by requesting
>     SELECT SUM(rec_cnt) as record_count from rec_count where tablename = 'foo';
>
>  - Note that the query is MVCC-consistent with the table!
>
>  - Once in a while, you'll want to run a single procedure that, for
>    each table, deletes all the existing records, and replaces them
>    with a single one consisting of the sum of the individual values.
>
>  - You can re-sync a table by running the query:
>      begin;
>         delete from record_count where tablename = 'foo';
>         insert into record_count(tablename, rec_cnt) select 'foo', (select count(*) from foo);
>      commit;

Re: count(*) and bad design was: Experiences with extensibility

От
Alban Hertroys
Дата:
On Jan 9, 2008, at 8:07 PM, Scott Marlowe wrote:
> I could see a use for an approximate count(*) with where clause, just
> like I could see a use for the ability to retrieve random rows from a
> table without using order by random() on it.  And those are both
> things that would require some form of hacking in the db that I'm
> certainly not capable of pulling off...

About returning random rows... I've successfully applied a scrolling
cursor for that.

You need to scroll to the last row to find the size of the result
set, but after that it's pretty easy to return random rows by
scrolling to them (and marking them 'read' in some way to prevent
accidentally returning the same row again).

It does require some specific application code though - doing it
server side would mean to pass the query as a function argument
(which still requires unnatural SQL statements in your application
code) or write a function for each query (*cough*).

Performance was quite adequate (a few 100 ms) for a query returning
random 5 rows from 3 joined tables or more, some of which had a few
100k rows. Calculating random() for each record in the result set (to
sort on) was taking much longer. That was on a dual 64-bit opteron
with 4GB RAM, iirc.

Of course a built-in statement would be preferable, I just felt like
pointing out that order by random() isn't necessarily the best
alternative ;)

Regards,
Alban Hertroys.

!DSPAM:737,478cb43e9496078213597!



Re: count(*) and bad design was: Experiences with extensibility

От
Ivan Sergio Borgonovo
Дата:
On Tue, 15 Jan 2008 14:43:35 +0100
Alban Hertroys <dalroi@solfertje.student.utwente.nl> wrote:

> You need to scroll to the last row to find the size of the result
> set, but after that it's pretty easy to return random rows by
> scrolling to them (and marking them 'read' in some way to prevent
> accidentally returning the same row again).

Could you post a snippet of code or something giving a more detailed
idea of it?

BTW since cursors support offset if you're not interested if the
order of the retrieved rows is random too you don't even have to
remember which one you read I think.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: count(*) and bad design was: Experiences with extensibility

От
Alban Hertroys
Дата:
On Jan 15, 2008, at 3:03 PM, Ivan Sergio Borgonovo wrote:

> On Tue, 15 Jan 2008 14:43:35 +0100
> Alban Hertroys <dalroi@solfertje.student.utwente.nl> wrote:
>
>> You need to scroll to the last row to find the size of the result
>> set, but after that it's pretty easy to return random rows by
>> scrolling to them (and marking them 'read' in some way to prevent
>> accidentally returning the same row again).
>
> Could you post a snippet of code or something giving a more detailed
> idea of it?
>
> BTW since cursors support offset if you're not interested if the
> order of the retrieved rows is random too you don't even have to
> remember which one you read I think.

I posted it on this list a while ago when I came up with this
solution. I had some trouble finding my old post in the pgsql-general
archives though - I could find the thread, just not my final posting,
and searching didn't even turn up the thread.

I did find it here: http://www.mail-archive.com/pgsql-
general@postgresql.org/msg103670.html
The thread contains several other approaches to the problem, it
really depends on your problem domain which one fits your bill.

I think the function in my original posting could do with clearer
comments though, so here's the function again:

/*
  * Return $limit random rows from the result set of SQL query $query
  */
function randomSet(
    $query,        // The query to execute
    $limit        // The (max) number of random rows required
) {
         // SQL to declare the cursor
         query("DECLARE _cur SCROLL CURSOR WITHOUT HOLD FOR $query");

         /* Get the range for random(1, n)
     *
     * Determined by scrolling the cursor to the last row.
     * Equivalent to select count(*), but without a separate query.
     */
         query("MOVE FORWARD ALL IN _cur");
         $count = pg_affected_rows();

         $uniques = array(); // A list of used cursor offsets
         $resultSet = array();

    // Fetch random rows until we have enough or there are no more
         while ($limit > 0 && count($uniques) < $count) {
        // Determine random scroll offset
                 $idx = random(1, $count);

                 // Skip records with an index we already used
                 if (in_array($idx, $uniques))
                         continue;

                 //Fetch the random row
                 $record = query("FETCH ABSOLUTE $idx FROM _cur");

        // Add the row offset to the list of used offsets
                 $uniques[] = $idx;

                 $resultSet[] = $record;
                 $limit--;
         }

         // query
         query("CLOSE _cur");

         return $resultSet;
}



Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,478f32e59497683469944!