Обсуждение: Postgres vs other Postgres based MPP implementations

Поиск
Список
Период
Сортировка

Postgres vs other Postgres based MPP implementations

От
Ondrej Ivanič
Дата:
Hi,

I have simple question (I think which is not easy to answer): why
Postgres is so slow comparing to other Postgres based MPP products
(even on the same box in single node configuration)?

I'm mot talking about multi node setup; all benchmarks were done on
single box (CentOS 5.5, 16 cores, 80GB ram, fast SAS local disks) and
Postgres is slower by order of magnitude (10 or sometimes 100 times
slower). Secondly, I've run simple selects and aggregations on
vertically partitioned star schema and I haven't used features like
columnar tables or replicated dimension tables. I believe that my
Postgres configuration is close to optimal one.

Another interesting experiment was to parallelise query by hand:
select sum(count) from T where org_id = ... and  date_in_tz >=
'2011-08-01' and date_in_tz < '2011-11-01'

The query above was revritten as series of queries like this:
create table t00 as select sum(count) from T where (... previous where
...)  and date_in_tz >= hist[1] < date_in_tz >= hist[2]
create table t01 as select sum(count) from T where (... previous where
...)  and date_in_tz >= hist[2] < date_in_tz >= hist[3]...
create table t08 as select sum(count) from T where (... previous where
...)  and date_in_tz >= hist[8] < date_in_tz >= hist[9]
create table t09 as select sum(count) from T where (... previous where
...)  and date_in_tz >= hist[9] < date_in_tz >= hist[10]

"hist" is similar to pg_stat.histogram_bounds (A list of values that
divide the column's values into groups of approximately equal
population). The idea is to slice T table by primary key (date_in_tz)
into 10 groups (or N groups) with similar row count and execute those
CTAS queries thru multiple connections.

Final query just sums:
select sum(count) from (
select count from t00
union all
...
union all
select count from t09
) as x

All above were faster than single query at the begging.

Disk activity was lower and spanned over longer timespan for original
query comparing to sliced query scenario. Maybe there is no need for
quite complex query rewrite and queries can be speed up by async
parallel reads...

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres vs other Postgres based MPP implementations

От
Craig Ringer
Дата:
On 11/08/2011 09:28 AM, Ondrej Ivanič wrote:

> I have simple question (I think which is not easy to answer): why
> Postgres is so slow comparing to other Postgres based MPP products

Which one(s) are you referring to? In what kind of workloads?

Are you talking about Greenplum or similar?

> Another interesting experiment was to parallelise query by hand:
> select sum(count) from T where org_id = ... and  date_in_tz>=
> '2011-08-01' and date_in_tz<  '2011-11-01'

Pg isn't very good at parallelism within a single query. It handles lots
of small queries concurrently fairly well, but isn't as good at using
all the resources of a box on one big query because it can only use one
CPU per query and has only a very limited ability to do concurrent I/O
on a single query too.

That said, you should be tuning effective_io_concurrency to match your
storage; if you're not, then you aren't getting the benefit of the
concurrent I/O that PostgreSQL *is* capable of. You'll also need to have
tweaked your shared_buffers, work_mem etc appropriately for your query
workload.

Since Pg needs some kind of pooling and admission control to perform
very well in OLTP, it's only highly performant without addons in the
middle of the range - medium numbers of medium-complexity queries. For
huge numbers of simple queries it needs a pooler, and for small numbers
of hugely complex queries it won't perform all that well without
something to try to parallelise the queries outside Pg.

> All above were faster than single query at the begging.

I'm not at all surprised by that. PostgreSQL couldn't use the full
resources of your system when it was expressed as just one query.

--
Craig Ringer

Re: Postgres vs other Postgres based MPP implementations

От
Ondrej Ivanič
Дата:
Hi,

On 8 November 2011 16:58, Craig Ringer <ringerc@ringerc.id.au> wrote:
> Which one(s) are you referring to? In what kind of workloads?
>
> Are you talking about Greenplum or similar?

Yes, mainly Geenplum and nCluster (AsterData). I haven't played with
gridSQL and pgpool-II's parallel query mode too much. Queries are
simple aggregations/drill downs/roll ups/... -- mostly heavy read
workloads but OLTP performance is required (like run query over 100m+
dataset in 15 sec)

> Pg isn't very good at parallelism within a single query. It handles lots of
> small queries concurrently fairly well, but isn't as good at using all the
> resources of a box on one big query because it can only use one CPU per
> query and has only a very limited ability to do concurrent I/O on a single
> query too.

Usually CPU is not bottleneck but I it was when I put Pustgres on
FusionIO. The problem is that PG spreads reads too much . iostat
reports very low drive utilisation and very low queue size.

> That said, you should be tuning effective_io_concurrency to match your
> storage; if you're not, then you aren't getting the benefit of the
> concurrent I/O that PostgreSQL *is* capable of. You'll also need to have
> tweaked your shared_buffers, work_mem etc appropriately for your query
> workload.

I've played with effective_io_concurrency (went thru entire range: 1,
2, 5, 10, 20, 50, 100, 200, 500, 1000) but nothing improved. Is there
a way to get PG backed IO stats using stock CentOS (5.7) kernel and
tools? (I can't change my env easily)

> queries it won't perform all that well without something to try to
> parallelise the queries outside Pg.

yeah, I have one moster query which needs half a day to finish but it
finishes in less than two hours on the same hw if is executed in
parallel...

> I'm not at all surprised by that. PostgreSQL couldn't use the full resources
> of your system when it was expressed as just one query.

This is very interesting area to work in but my lack of C/C++ and PG
internals puts me out of the game :)

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres vs other Postgres based MPP implementations

От
John R Pierce
Дата:
On 11/07/11 10:49 PM, Ondrej Ivanič wrote:
>   mostly heavy read
> workloads but OLTP performance is required (like run query over 100m+
> dataset in 15 sec)

that isn't OLTP, its OLAP.  Online Analytic Processing rather than
Online Transaction Processing....   large complex reporting queries that
have to aggregate many rows is classic OLAP.

OLTP transactions tend to refer to a few rows at a time of a bunch of
tables, and update a few rows of a various tables, and you execute many
of them per second, often for a large number of concurrent clients.
classic example is a web store system (adding/modifying items in a
shopping cart, then purchasing the items in the cart).



--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast



Re: Postgres vs other Postgres based MPP implementations

От
Craig Ringer
Дата:
On 11/08/2011 02:49 PM, Ondrej Ivanič wrote:
> Usually CPU is not bottleneck but I it was when I put Pustgres on
> FusionIO. The problem is that PG spreads reads too much . iostat
> reports very low drive utilisation and very low queue size.
"Spreads reads too much" ?

Are you saying there's too much random I/O? Is it possible it'd benefit
from a column store?

When you're using Greenplum are you using "Polymorphic Data Storage"
column storage "WITH (orientation=column)" ?

Or is the performance different just in better utilisation of the
hardware under Greenplum?
> I've played with effective_io_concurrency (went thru entire range: 1,
> 2, 5, 10, 20, 50, 100, 200, 500, 1000) but nothing improved.
Hm, ok. Scratch that then.

> Is there
> a way to get PG backed IO stats using stock CentOS (5.7) kernel and
> tools? (I can't change my env easily)

Dunno; check postgresql high performance (book), the manual, etc. Useful
tools are the pg_stat_ tables, "vmstat", "iostat", "iotop", etc.
>
>> I'm not at all surprised by that. PostgreSQL couldn't use the full resources
>> of your system when it was expressed as just one query.
> This is very interesting area to work in but my lack of C/C++ and PG
> internals puts me out of the game :)

That's a cop-out! I say that as someone who _used_ to have no C
knowledge, but learned it to get things done on code I use.

That said, Pg's codebase isn't exactly trivial :S and trying to get
involved in major re-engineering like parallelisation isn't going to be
practical when you're just getting started.

--
Craig Ringer

Re: Postgres vs other Postgres based MPP implementations

От
Ondrej Ivanič
Дата:
Hi,
>>  mostly heavy read
>> workloads but OLTP performance is required (like run query over 100m+
>> dataset in 15 sec)
>
> that isn't OLTP, its OLAP.  Online Analytic Processing rather than Online
> Transaction Processing....   large complex reporting queries that have to
> aggregate many rows is classic OLAP.

I didn't say OLTP. Our workload is "aggregations/drill downs/roll
ups/... " (= OLAP) but we need OLTP like performance i.e. our users
are not keen to wait more than several seconds for the result.
Greenplum or Postgres + Fusion IO can deliver this performance for us.

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres vs other Postgres based MPP implementations

От
"Tomas Vondra"
Дата:
On 8 Listopad 2011, 10:49, Ondrej Ivanič wrote:
> Hi,
>>>  mostly heavy read
>>> workloads but OLTP performance is required (like run query over 100m+
>>> dataset in 15 sec)
>>
>> that isn't OLTP, its OLAP.  Online Analytic Processing rather than
>> Online
>> Transaction Processing....   large complex reporting queries that have
>> to
>> aggregate many rows is classic OLAP.
>
> I didn't say OLTP. Our workload is "aggregations/drill downs/roll
> ups/... " (= OLAP) but we need OLTP like performance i.e. our users
> are not keen to wait more than several seconds for the result.
> Greenplum or Postgres + Fusion IO can deliver this performance for us.

Sure you did - you've stated that "mostly heavy read
workloads but OLTP performance is required (like run query over 100m+
dataset in 15 sec)." That clearly mentions OLTP  ...

And OLTP has nothing to do with the amount of time the user is willing to
wait, it's rather about the amount of data the user read/modifies and the
number of tables. OLTP workload usually consists of transactions accessing
small amount of data, usually by primary key (get customer with id X,
update account with id Y, ...).

OLAP/DSS workload is somehow opposite - read large amounts of data,
perform complex analysis etc.

Tomas


Re: Postgres vs other Postgres based MPP implementations

От
Ondrej Ivanič
Дата:
Hi,

2011/11/8 Craig Ringer <ringerc@ringerc.id.au>:
> "Spreads reads too much" ?
>
> Are you saying there's too much random I/O? Is it possible it'd benefit from
> a column store?
> When you're using Greenplum are you using "Polymorphic Data Storage" column
> storage "WITH (orientation=column)" ?

yes, exactly. Column store and compression  speed up queries even more
(sometimes beyond 100x times) comparing to postgres.

>
> Or is the performance different just in better utilisation of the hardware
> under Greenplum?
>>

Yes, looks like that they can better utilise available hardware.

>> Is there
>> a way to get PG backed IO stats using stock CentOS (5.7) kernel and
>> tools? (I can't change my env easily)
>
> Dunno; check postgresql high performance (book), the manual, etc. Useful
> tools are the pg_stat_ tables, "vmstat", "iostat", "iotop", etc.

Yeah, I know about those.. I like iotop but enterprise distributions
do not ship fresh kernels... I need something which can I "safely"
(slightly worse performance is acceptable but machine must survie) run
in production for several hours and then cross reference it with
postgres and other system logs

> That said, Pg's codebase isn't exactly trivial :S and trying to get involved
> in major re-engineering like parallelisation isn't going to be practical
> when you're just getting started.

That's what I meant ;)

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres vs other Postgres based MPP implementations

От
Ondrej Ivanič
Дата:
Hi,

2011/11/8 Tomas Vondra <tv@fuzzy.cz>:
> Sure you did - you've stated that "mostly heavy read
> workloads but OLTP performance is required (like run query over 100m+
> dataset in 15 sec)." That clearly mentions OLTP  ...

Whatever :) Let's make it clear: I need to run aggregates/roll
ups/drill downs on large dataset (100m+) but the query should return
result (less than 1000 rows, mostly around 100) under 30 sec

My point is that  MPP (same box used) solution can deliver required
performance in most cases (>75%).

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres vs other Postgres based MPP implementations

От
John R Pierce
Дата:
On 11/08/11 1:49 AM, Ondrej Ivanič wrote:
> Greenplum or Postgres + Fusion IO can deliver this performance for us.

then, thats your answer!   it ain't free, oh well.



--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast



Re: Postgres vs other Postgres based MPP implementations

От
John R Pierce
Дата:
On 11/08/11 2:36 AM, Ondrej Ivanič wrote:
> Yeah, I know about those.. I like iotop but enterprise distributions
> do not ship fresh kernels... I need something which can I "safely"
> (slightly worse performance is acceptable but machine must survie) run
> in production for several hours and then cross reference it with
> postgres and other system logs

nmon in data collection mode, with a suitable sample interval.
generates a .CSV file that you can pick apart and graph with a
spreadsheet or whatever.

--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast



Re: Postgres vs other Postgres based MPP implementations

От
Ondrej Ivanič
Дата:
Hi,

On 9 November 2011 04:53, John R Pierce <pierce@hogranch.com> wrote:
> On 11/08/11 1:49 AM, Ondrej Ivanič wrote:
>>
>> Greenplum or Postgres + Fusion IO can deliver this performance for us.
>
> then, thats your answer!   it ain't free, oh well.

FusionIO is little bit problematic: smaller card (2.4TB) has serious
scalability issues (not card but PG and card); bigger one can't fit
our server due to thermal restrictions. Anyway, both cards just do
scale. Geenplum DCA is on site already.

What I see is that single node Greenplum can utilise available
resources more efficiently...

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres vs other Postgres based MPP implementations

От
Craig Ringer
Дата:
On 11/08/2011 06:36 PM, Ondrej Ivanič wrote:

>> Are you saying there's too much random I/O? Is it possible it'd benefit from
>> a column store?
>> When you're using Greenplum are you using "Polymorphic Data Storage" column
>> storage "WITH (orientation=column)" ?
>
> yes, exactly. Column store and compression  speed up queries even more
> (sometimes beyond 100x times) comparing to postgres.

Yup, that'll happen. Right tool, meet right job.

PostgreSQL isn't a column-oriented DB optimised for OLAP workloads. One
that is will outperform it in OLAP workloads. This is not surprising. If
a free compressed column store implementation becomes available for
PostgreSQL I'm sure that'll change, but it's a lot of work and right now
the only people who've done that work aren't giving it away for free -
or not in any form that can be integrated into PostgreSQL without
removing other capabilities other users need.

A non-ACID-compliant database will outperform PostgreSQL in transaction
processing (OLTP) workloads, too; if you don't need ACID compliance then
PostgreSQL won't be the best choice there either. You won't beat a
document database for fetching and retrieving free-form records for
similar reasons. Right tool, right job.

That's not to say Pg can't improve. It can, and not just by adding
column store or index-structured table support. Improved parallelism
capabilities are needed in Pg, but like column store support are a LOT
of HARD work and not something anyone can whip up overnight. Especially
someone doing it for fun in their spare time who we have no right to
make any demands of, or someone working on the project for a company
that has its customers' priorities to think of first.

> Yeah, I know about those.. I like iotop but enterprise distributions
> do not ship fresh kernels... I need something which can I "safely"
> (slightly worse performance is acceptable but machine must survie) run
> in production for several hours and then cross reference it with
> postgres and other system logs

I don't do "enterprise" distributions, so I can't help there. Ask their
support maybe, presumably you pay for it for a reason.

--
Craig Ringer

Re: Postgres vs other Postgres based MPP implementations

От
Ondrej Ivanič
Дата:
Hi,

> it's a lot of work and right now the only people
> who've done that work aren't giving it away for free - or not in any form
> that can be integrated into PostgreSQL without removing other capabilities
> other users need.

One MPP vendor implemented columnar store in roughly six months --
lot's of work is involved there!. Anyway, all implementation what I
came across took several shortcuts like no updates(append only) or no
foreign keys, ... but it works!

> That's not to say Pg can't improve. It can, and not just by adding column
> store or index-structured table support. Improved parallelism capabilities
> are needed in Pg

I see most benefits coming from parallelism: 12hr query can finish in
2hr if sliced properly


--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)