Re: I'd like to discuss scaleout at PGCon

Поиск
Список
Период
Сортировка
От Konstantin Knizhnik
Тема Re: I'd like to discuss scaleout at PGCon
Дата
Msg-id a147739a-dd03-73e1-0187-1bafa14dec5e@postgrespro.ru
обсуждение исходный текст
Ответ на Re: I'd like to discuss scaleout at PGCon  ("MauMau" <maumau307@gmail.com>)
Ответы Re: I'd like to discuss scaleout at PGCon  (Pavel Stehule <pavel.stehule@gmail.com>)
RE: I'd like to discuss scaleout at PGCon  ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>)
Список pgsql-hackers

On 05.06.2018 20:17, MauMau wrote:
> From: Merlin Moncure
>> FWIW, Distributed analytical queries is the right market to be in.
>> This is the field in which I work, and this is where the action is
> at.
>> I am very, very, sure about this.  My view is that many of the
>> existing solutions to this problem (in particular hadoop class
>> soltuions) have major architectural downsides that make them
>> inappropriate in use cases that postgres really shines at; direct
>> hookups to low latency applications for example.  postgres is
>> fundamentally a more capable 'node' with its multiple man-millennia
> of
>> engineering behind it.  Unlimited vertical scaling (RAC etc) is
>> interesting too, but this is not the way the market is moving as
>> hardware advancements have reduced or eliminated the need for that
> in
>> many spheres.
> I'm feeling the same.  As the Moore's Law ceases to hold, software
> needs to make most of the processor power.  Hadoop and Spark are
> written in Java and Scala.  According to Google [1] (see Fig. 8), Java
> is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
> 2.5x - 3.6x.
>
> Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
> someday, when PostgreSQL supports scaleout, in-memory database,
> multi-model capability, and in-database filesystem?  That may be a
> pipedream, but why do people have to tolerate the separation of the
> relational-based data  warehouse and Hadoop-based data lake?
>
>
> [1]    Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
> Proceedings of Scala Days 2011
>
> Regards
> MauMau
>
>
I can not completely agree with it. I have done a lot of benchmarking of 
PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for 
TPC-H queries.
The picture is not so obvious... All this systems provides different 
scalability and so shows best performance at different hardware 
configurations.
Also Java JIT has made a good progress since 2011. Calculation intensive 
code (like matrix multiplication) implemented in Java is about 2 times 
slower than optimized C code.
But DBMSes are rarely CPU bounded. Even if all database fits in memory 
(which is not so common scenario for big data applications), speed of 
modern CPU is much higher than RAM access speed... Java application are 
slower than C/C++ mostly because of garbage collection. This is why 
SparkSQL is moving to off-heap approach when objects are allocated 
outside Java heap and so not affecting Java GC.  New versions of 
SparkSQL with off-heap memory and native code generation show very good 
performance. And high scalability always was one of the major features 
of SparkSQL.

So it is naive to expect that Postgres will be 4 times faster than 
SparkSQL on analytic queries just because it is written in C and 
SparkSQL - in Scala.
Postgres has made a very good progress in support of OLAP in last 
releases: it now supports parallel query execution, JIT, partitioning...
But still its scalability is very limited comparing with SparkSQL. I am 
not sure about GreenPlum with its sophisticated distributed query 
optimizer, but
most of other OLAP solutions for Postgres are not able to efficiently 
handle complex queries (with a lot of joins by non-partitioning keys).

I do not want to say that it is not possible to implement good analytic 
platform for OLAP on top of Postgres. But it is very challenged task.
And IMHO choice of programming language is not so important. What is 
more important is format of storing data. The bast systems for data 
analytic: Vartica, HyPer, KDB,...
are using vertical data mode. SparkSQL is also using Parquet file format 
which provides efficient extraction and processing of data.
With abstract storage API Postgres is also given a chance to implement 
efficient storage for OLAP data processing. But huge amount of work has 
to be done here.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: I'd like to discuss scaleout at PGCon
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Loaded footgun open_datasync on Windows