Обсуждение: Four issues why "old elephants" lack performance: Explanation sought Four issues why "old elephants" lack performance: Explanation sought

Поиск
Список
Период
Сортировка
Hi,

Recently Mike Stonebraker identified four areas where "old elephants"
lack performance [1]:

1. Buffering/paging
2. Locking/Multithreading
3. WAL logging
4. Latches (aka memory locks for concurrent access of btree structures
in buffer pool?).

He claims having solved these issues while retaining SQL and ACID.
But the only issue I understood is #1 by loading all tuples in-memory.
=> Are there any ideas on how to tell Postgres to aggressively load
all data into memory (issue #1)?

All remaining issues make me wonder.
I actually doubt that there are alternatives even theoretically.
=> Can anyone help explaining me issues 2,3 and 4, their solutions,
and why Postgres would be unable to resolve them?

Yours, Stefan

[1] "NewSQL vs NoSQL for New OLTP", NoSQLNow! Conference, August 2011.
http://voltdb.com/resources/videos

On Sat, Feb 25, 2012 at 5:54 PM, Stefan Keller <sfkeller@gmail.com> wrote:
> Hi,
>
> Recently Mike Stonebraker identified four areas where "old elephants"
> lack performance [1]:
>
> 1. Buffering/paging
> 2. Locking/Multithreading
> 3. WAL logging
> 4. Latches (aka memory locks for concurrent access of btree structures
> in buffer pool?).
>
> He claims having solved these issues while retaining SQL and ACID.
> But the only issue I understood is #1 by loading all tuples in-memory.
> => Are there any ideas on how to tell Postgres to aggressively load
> all data into memory (issue #1)?
>
> All remaining issues make me wonder.
> I actually doubt that there are alternatives even theoretically.
> => Can anyone help explaining me issues 2,3 and 4, their solutions,
> and why Postgres would be unable to resolve them?
>
> Yours, Stefan
>
> [1] "NewSQL vs NoSQL for New OLTP", NoSQLNow! Conference, August 2011.
> http://voltdb.com/resources/videos

Here's a great speech he gave at the USENIX conference:

http://www.youtube.com/watch?v=uhDM4fcI2aI

Basically he makes the point that IF your dataset fits in memory and
you need fast performance, then using multiple machines like a RAID
array with everything in memory beats everything out there, and that's
the methodology he's shooting for.

For super fast transactional systems that fit in memory, I can see the
advantage of moving everything into memory and using redundant
servers, possibly geographically distant from each other, to ensure
durability.

But he does make the point as well that for LARGE systems that need
transactional integrity, there's still nothing that beats an elephant
like system.

BTW, there's some other great presentations at that conference as
well.  The one or two about btrfs from an oracle guy are quite
fascinating.

On 02/25/2012 06:54 PM, Stefan Keller wrote:
> Hi,
>
> Recently Mike Stonebraker identified four areas where "old elephants"
> lack performance [1]:
>
> 1. Buffering/paging
> 2. Locking/Multithreading
> 3. WAL logging
> 4. Latches (aka memory locks for concurrent access of btree structures
> in buffer pool?).
>
> He claims having solved these issues while retaining SQL and ACID.
> But the only issue I understood is #1 by loading all tuples in-memory.
> =>  Are there any ideas on how to tell Postgres to aggressively load
> all data into memory (issue #1)?
> All remaining issues make me wonder.
> I actually doubt that there are alternatives even theoretically.
> =>  Can anyone help explaining me issues 2,3 and 4, their solutions,
> and why Postgres would be unable to resolve them?
>

> 1. Buffering/paging

PG, and your operating system, already do this for reads.  It also keeps things that are hit harder and lets things go
thatare not hit as much.  On the writing side, you can configure it PG from "write it! write it NOW!", to "running with
scissors"depending on how safe you want to feel. 


> 2. Locking/Multithreading

PG does have some internal structures that it needs to lock (and anytime you read lock, think single user access, or
oneat a time, or slow).  Any time you hear about lock contention, it's multiple processes waiting in a line for a lock.
If you only had one client, then you really would not need locks.  There is where multithreading comes from, but in PG
weuse multi-process instead of multi-thread, but its the same thing.  Two (or more) people are needing to lock
somethingso they can really screw with it.  PG does not need as many locks as other db's however.  It uses an MVCC
architectureso under normal operations (insert, update, select, delete) people dont block eacth other. (ie readers dont
blockwriters and visa versa). 

I don't see locking going away, but there are not many loads that are lock bound.  Most database loads are IO bound,
andthen you'd probably be CPU bound before you are lock bound.  (although it can be hard to tell if its a spin lock
that'smaking you cpu bound).  I'm sure there are loads that hit lock contention, but there are probably ways to
mitigateit.  Say you have a process that alters the table and adds a new column every two seconds, thing updates a
singlerow to add data to the new column just added.  I can see that being lock bound.  And a really stupid
implementation.

> 3. WAL logging

PG writes a transaction twice.  Once to WAL and once to the DB.  WAL is a simple and quick write, and is only ever used
ifyour computer crashes and PG has to re-play transactions to get the db into a good/known state.  Its a safety measure
thatdoesn't really take much time, and I don't think I've heard of anyone being WAL bound.  Although it does increase
IOops, it's not the biggest usage of IO.  This one falls under "lets be safe" which is something NoSQL did away with.
Itsnot something I want to give up, personally.  I like using a net. 

> 4. Latches

I can only guess at this one.  Its similar to locks I think.  Data structures come in different types.  In the old days
weonly had single user access to data structures, then when we wanted two users to access it we just locked it to
serializeaccess (one at a time mode), but that does not scale well at all, so we invented two new types: lock free and
waitfree. 

An index is stored as a btree.  To insert a new record into the index you have to reorganize it (rotate it, sort it,
add/deletenodes, etc), and while one client is doing that it can make it hard for another to try and search it.  Lock
free(and wait free) let multiple people work on a btree at the same time with much less contention.  Wikipedia does a
betterjob of explaining them than I could: 

http://en.wikipedia.org/wiki/Non-blocking_algorithm

I have no idea if PG uses single user locks or some kind of lock free structure for its internals.  I can see different
partsof the internals needing different levels. 

Maybe I'm old fashioned, but I don't see how you'll get rid of these.  You have to insert a record.  You have to have
12people hitting the db at the same time.  You have to organize that read/write access somehow so they dont blow each
otherup. 

-Andy

On Sun, Feb 26, 2012 at 08:37:54AM -0600, Andy Colson wrote:

>> 3. WAL logging
>
> PG writes a transaction twice.  Once to WAL and once to
> the DB.  WAL is a simple and quick write, and is only ever
> used if your computer crashes and PG has to re-play
> transactions to get the db into a good/known state.  Its a
> safety measure that doesn't really take much time, and I
> don't think I've heard of anyone being WAL bound.  Although
> it does increase IO ops, it's not the biggest usage of IO.
> This one falls under "lets be safe" which is something NoSQL
> did away with.  Its not something I want to give up,
> personally.  I like using a net.

And, one could still effectively disable WAL by using
unlogged tables.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Thanks to all who responded so far. I got some more insights from Mike
Stonebraker himself in the USENIX talk Scott pointed to before.
I'd like to revise the four points a little bit I enumerated in my
initial question and to sort out what PG already does or could do:

1. Buffering Pool

To get rid of I/O bounds Mike proposes in-memory database structures.
He argues that it's impossible to be implemented by "old elephants"
because it would be a huge code rewrite since there is also a need to
store memory structures (instead disk oriented structures).
Now I'm still wondering why PG could'nt realize that probably in
combination with unlogged tables? I don't overview the respective code
but I think it's worthwhile to discuss even if implementation of
memory-oriented structures would be to difficult.

2. Locking

This critique obviously does'nt hold for PG since we have MVCC here already.

3. WAL logging

Here Mike proposes replication over several nodes as an alternative to
WAL which fits nicely with High Availability. PG 9 has built-in
replication but just not for unlogged tables :-<

4. Latches

This is an issue I never heard before. I found some notion of latches
in the code but I does'nt seem to be related to concurrently accessing
btree structures as Mike suggests.
So if anyone could confirm that this problem exists producing overhead
I'd be interested to hear.
Mike proposes single-threads running on many cores where each core
processes a non overlapping shard.
But he also calls for ideas to invent btrees which can be processed
concurrently with as less memory locks as possible (instead of looking
to make btrees faster).

So to me the bottom line is, that PG already has reduced overhead at
least for issue #2 and perhaps for #4.
Remain issues of in-memory optimization (#2) and replication (#3)
together with High Availability to be investigated in PG.

Yours, Stefan


2012/2/26 Karsten Hilbert <Karsten.Hilbert@gmx.net>:
> On Sun, Feb 26, 2012 at 08:37:54AM -0600, Andy Colson wrote:
>
>>> 3. WAL logging
>>
>> PG writes a transaction twice.  Once to WAL and once to
>> the DB.  WAL is a simple and quick write, and is only ever
>> used if your computer crashes and PG has to re-play
>> transactions to get the db into a good/known state.  Its a
>> safety measure that doesn't really take much time, and I
>> don't think I've heard of anyone being WAL bound.  Although
>> it does increase IO ops, it's not the biggest usage of IO.
>> This one falls under "lets be safe" which is something NoSQL
>> did away with.  Its not something I want to give up,
>> personally.  I like using a net.
>
> And, one could still effectively disable WAL by using
> unlogged tables.
>
> Karsten
> --
> GPG key ID E4071346 @ gpg-keyserver.de
> E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

On Sun, Feb 26, 2012 at 1:11 PM, Stefan Keller <sfkeller@gmail.com> wrote:

> So to me the bottom line is, that PG already has reduced overhead at
> least for issue #2 and perhaps for #4.
> Remain issues of in-memory optimization (#2) and replication (#3)
> together with High Availability to be investigated in PG.

Yeah, the real "problem" pg has to deal with is that it writes to
disk, and expects that to provide durability, while voltdb (Mike's db
project) writes to multiple machines in memory and expects that to be
durable.  No way a disk subsystem is gonna compete with an in memory
cluster for performance.



On Sun, Feb 26, 2012 at 12:11 PM, Stefan Keller <sfkeller@gmail.com> wrote:
Thanks to all who responded so far. I got some more insights from Mike
Stonebraker himself in the USENIX talk Scott pointed to before.
I'd like to revise the four points a little bit I enumerated in my
initial question and to sort out what PG already does or could do:

1. Buffering Pool

To get rid of I/O bounds Mike proposes in-memory database structures.
He argues that it's impossible to be implemented by "old elephants"
because it would be a huge code rewrite since there is also a need to
store memory structures (instead disk oriented structures).
Now I'm still wondering why PG could'nt realize that probably in
combination with unlogged tables? I don't overview the respective code
but I think it's worthwhile to discuss even if implementation of
memory-oriented structures would be to difficult.

The reason is that the data structures assume disk-based data structures, so they are written to be efficient to look up on disk but not as efficient in memory.

Note that VoltDB is a niche product and Stonebreaker makes this pretty clear.  However, the more interesting question is what the tradeoffs are when looking at VoltDB vs Postgres-XC.
 

2. Locking

This critique obviously does'nt hold for PG since we have MVCC here already.

3. WAL logging

Here Mike proposes replication over several nodes as an alternative to
WAL which fits nicely with High Availability. PG 9 has built-in
replication but just not for unlogged tables :-<

I find it interesting that two of the four areas he identifies have to do with durability.....
 

4. Latches

This is an issue I never heard before. I found some notion of latches
in the code but I does'nt seem to be related to concurrently accessing
btree structures as Mike suggests.
So if anyone could confirm that this problem exists producing overhead
I'd be interested to hear.
Mike proposes single-threads running on many cores where each core
processes a non overlapping shard.
But he also calls for ideas to invent btrees which can be processed
concurrently with as less memory locks as possible (instead of looking
to make btrees faster).

So to me the bottom line is, that PG already has reduced overhead at
least for issue #2 and perhaps for #4.
Remain issues of in-memory optimization (#2) and replication (#3)
together with High Availability to be investigated in PG.


If he were looking at PostgreSQL for #4, I think that would be stuff like waiting for semaphores...  I suspect that since this work is probably really minimal and PostgreSQL is single-threaded per process, that this would be low overhead in this area.

The issue seems to be concurrent access to shared data structures, which are a problem particularly when you start looking at multithreaded backends......

Best Wishes,
Chris Travers
Hi,

2012/2/27 Chris Travers <chris.travers@gmail.com> wrote:
>> 1. Buffering Pool
>>
>> To get rid of I/O bounds Mike proposes in-memory database structures.
...
>> Now I'm still wondering why PG could'nt realize that probably in
>> combination with unlogged tables? I don't overview the respective code
>> but I think it's worthwhile to discuss even if implementation of
>> memory-oriented structures would be to difficult.
>
>
> The reason is that the data structures assume disk-based data structures, so
> they are written to be efficient to look up on disk but not as efficient in
> memory.

That means, that this could be enhanced in PG.
Is there really no research or implementation projects going on in the
direction where all table content can be hold in-memory? This could be
enhanced in many ways (besides optimized in-memory structures)
including index-only scans.

> Note that VoltDB is a niche product and Stonebreaker makes this pretty
> clear.  However, the more interesting question is what the tradeoffs are
> when looking at VoltDB vs Postgres-XC.

Ok, that's interesting too, looking at Postgres-XC (eXtensible
Cluster) is a multi-master write-scalable PostgreSQL cluster based on
shared-nothing architecture.
But I'm thinking more about enhancing PostgreSQL core.

VoltDB niche product? Look at his USENIX conferenc speech
(http://www.youtube.com/watch?v=uhDM4fcI2aI ) minute 11:53: There he
puts VoltDB into "high OLTP" and "old elephants" into category "low"
which only get the crevices and which have "The Innovators Dilemma".

My thesis is that PG doesn't have that problem necessarily because
it's open source and can be (and has been) refactored since then.

Yours, Stefan



On Mon, Feb 27, 2012 at 3:46 AM, Stefan Keller <sfkeller@gmail.com> wrote:
Hi,

2012/2/27 Chris Travers <chris.travers@gmail.com> wrote:
>> 1. Buffering Pool
>>
>> To get rid of I/O bounds Mike proposes in-memory database structures.
...
>> Now I'm still wondering why PG could'nt realize that probably in
>> combination with unlogged tables? I don't overview the respective code
>> but I think it's worthwhile to discuss even if implementation of
>> memory-oriented structures would be to difficult.
>
>
> The reason is that the data structures assume disk-based data structures, so
> they are written to be efficient to look up on disk but not as efficient in
> memory.

That means, that this could be enhanced in PG.
Is there really no research or implementation projects going on in the
direction where all table content can be hold in-memory? This could be
enhanced in many ways (besides optimized in-memory structures)
including index-only scans.


So if I want to read a row in a middle of a table and it is now on disk, I have to read the whole thing into memory first?  I see so many problems with trying to do in-memory-structure on disk that I just don't see that as happening. 

> Note that VoltDB is a niche product and Stonebreaker makes this pretty
> clear.  However, the more interesting question is what the tradeoffs are
> when looking at VoltDB vs Postgres-XC.

Ok, that's interesting too, looking at Postgres-XC (eXtensible
Cluster) is a multi-master write-scalable PostgreSQL cluster based on
shared-nothing architecture.
But I'm thinking more about enhancing PostgreSQL core.

VoltDB niche product? Look at his USENIX conferenc speech
(http://www.youtube.com/watch?v=uhDM4fcI2aI ) minute 11:53: There he
puts VoltDB into "high OLTP" and "old elephants" into category "low"
which only get the crevices and which have "The Innovators Dilemma".

Strangely, he doesn't consider PostgreSQL to be an elephant......  His point is, as I understood it, that open source databases like PostgreSQL will drive the proprietary equivalents out of the market because they cannot reach these specific challenges.  That's why the crevices exists between open source and the niche corners of his triangle.
 

My thesis is that PG doesn't have that problem necessarily because
it's open source and can be (and has been) refactored since then.


If you watch the Usenix speech closely you will notice something.

Two of the major sources of overhead have to do with durability (you know, writing to the hard disk), and the other two have to do with concurrency.  His solution is to get rid intra-machine durability (durability only exists distributed throughout the cluster in VoltDB), and also get rid of concurrency.  As he puts it, there's no need to deschedule a process that's going to run for only 50 microseconds.   This is why this is a niche product.  It requires a specialized environment to run with real ACID compliance.  This isn't something like hydrolics in back-hoes.  It's a fundamental attribute of their innovative architecture.

There are a heck of a lot of environments where this sort of approach really doesn't make sense.  In fact I would say that the majority of databases deployed in the future will be in areas where it doesn't make sense.  I'd say, realistically, that you have to be at least close to the edge of what you can reasonably do on modern hardware with new versions of PostgreSQL before it makes sense to consider that tradeoff.  That means, what, 100000 transactions per sec?

For the sorts of applications I write, the costs of going with something like VoltDB would easily eclipse the benefits in every single deployment, and moreover this is not due to questions of the maturity of the technology but rather of fundamental design choices.  There are a lot really, really cool things VoltDB would be good at doing (from state handling of MMORGS to handling of huge numbers of inputs from a million sensors in real time, and providing reporting data on these).

But those are niche markets.

Best Wishes,
Chris Travers
On 02/27/2012 06:19 AM, Chris Travers wrote:
>
>
> For the sorts of applications I write, the costs of going with something like VoltDB would easily eclipse the
benefitsin every single deployment, and moreover this is not due to questions of the maturity of the technology but
ratherof fundamental design choices.  There are a lot really, really cool things VoltDB would be good at doing (from
statehandling of MMORGS to handling of huge numbers of inputs from a million sensors in real time, and providing
reportingdata on these). 
>
> But those are niche markets.
>
> Best Wishes,
> Chris Travers

Yeah, same with me.  I use PG as both OLTP and data warehouse.  My problems are not facebook problems, and PG
transactionrate is much faster than I need (now or in the foreseeable future). 

I could not really use voltdb because I'd like to have a very long history (5 gig and counting).  In his little
triangleI sit pretty squarely in the middle.  I want some OLTP and some warehouse, and PG is fast enough at both. 

I just watched the video about CERN (also in the usenix list alonside Stonebreakers), and they describe their DB usage,
wherethey collect 4 billion rows of history, a day, on the heat and alignment of magnets.  VoltDB wont work there
either.(As Mike himself says, its not for warehousing, don't use it as such) 

I'm just saying these are different tools for different jobs.  Your in memory tables wont help me, I dont really want
them. I'd rather have core development time on other things. 

Its just like different programming languages... use the right tool for the right job.

-Andy

On Mon, Feb 27, 2012 at 04:19:10AM -0800, Chris Travers wrote:
>
> Strangely, he doesn't consider PostgreSQL to be an elephant.

The last presentation I saw from Volt guys explicitly mentioned
Postgres as one of the elephants.

Also, if you read their literature carefully, you discover that the
reason there's no locking overhead is because there's no lock;
basically, if you're joining a lot, you're thrown back on
old-fashioned locks of some sort.  They also don't permit
in-transaction round trips to the application, so that source of lock
contention is also gone.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Hi Scott

2012/2/26 Scott Marlowe <scott.marlowe@gmail.com>:
> On Sun, Feb 26, 2012 at 1:11 PM, Stefan Keller <sfkeller@gmail.com> wrote:
>
>> So to me the bottom line is, that PG already has reduced overhead at
>> least for issue #2 and perhaps for #4.
>> Remain issues of in-memory optimization (#2) and replication (#3)
>> together with High Availability to be investigated in PG.
>
> Yeah, the real "problem" pg has to deal with is that it writes to
> disk, and expects that to provide durability, while voltdb (Mike's db
> project) writes to multiple machines in memory and expects that to be
> durable.  No way a disk subsystem is gonna compete with an in memory
> cluster for performance.

That's the point where I'd like to ask for ideas on how to extend PG
to manage "in-memory tables"!

To me it's obvious that memory becomes cheaper and cheaper while PG
still is designed with low memory in mind.

In my particular scenario I even can set durability aside since I
write once and read 1000 times. My main problem is heavy geometry
calculations on geospatial data (like ST_Relate or ST_Intersection
fns) which I expect to be run close to the data an in-memory. I don't
want PG to let table rows be pushed to disk just because to free
memory before hand (because of the "low memory assumption").

-Stefan

On Mon, Feb 27, 2012 at 1:31 PM, Stefan Keller <sfkeller@gmail.com> wrote:
> Hi Scott
>
> 2012/2/26 Scott Marlowe <scott.marlowe@gmail.com>:
>> On Sun, Feb 26, 2012 at 1:11 PM, Stefan Keller <sfkeller@gmail.com> wrote:
>>
>>> So to me the bottom line is, that PG already has reduced overhead at
>>> least for issue #2 and perhaps for #4.
>>> Remain issues of in-memory optimization (#2) and replication (#3)
>>> together with High Availability to be investigated in PG.
>>
>> Yeah, the real "problem" pg has to deal with is that it writes to
>> disk, and expects that to provide durability, while voltdb (Mike's db
>> project) writes to multiple machines in memory and expects that to be
>> durable.  No way a disk subsystem is gonna compete with an in memory
>> cluster for performance.
>
> That's the point where I'd like to ask for ideas on how to extend PG
> to manage "in-memory tables"!
>
> To me it's obvious that memory becomes cheaper and cheaper while PG
> still is designed with low memory in mind.
>
> In my particular scenario I even can set durability aside since I
> write once and read 1000 times. My main problem is heavy geometry
> calculations on geospatial data (like ST_Relate or ST_Intersection
> fns) which I expect to be run close to the data an in-memory. I don't
> want PG to let table rows be pushed to disk just because to free
> memory before hand (because of the "low memory assumption").

I would imagine unlogged tables to a tablespace mounted in memory
would get you close.