Re: New Copy Formats - avro/orc/parquet

Поиск
Список
Период
Сортировка
От Nicolas Paris
Тема Re: New Copy Formats - avro/orc/parquet
Дата
Msg-id 20180211220236.nwskn6nnrpe3zvyf@gmail.com
обсуждение исходный текст
Ответ на Re: New Copy Formats - avro/orc/parquet  (Adrian Klaver <adrian.klaver@aklaver.com>)
Список pgsql-general
Le 11 févr. 2018 à 22:19, Adrian Klaver écrivait :
> On 02/11/2018 12:57 PM, Nicolas Paris wrote:
> > Le 11 févr. 2018 à 21:53, Andres Freund écrivait :
> > > On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote:
> > > > I have also the storage and network transfers overhead in mind:
> > > > All those new formats are compressed; this is not true for current
> > > > postgres BINARY format and obviously text based format. By experience,
> > > > the binary format is 10 to 30% larger than the text one. On the
> > > > contrary, an ORC file can be up to 10 times smaller than a text base
> > > > format.
> > > 
> > > That seems largely irrelevant when arguing about using PROGRAM though,
> > > right?
> > > 
> > 
> > Indeed those storage and network transfers are only considered versus
> > CSV/BINARY format. No link with PROGRAM aspect.
> > 
> 
> Just wondering what your time frame is on this? Asking because this would be
> considered a new feature and so would need to be added to a major release of
> Postgres. Currently work is going on for Postgres version 11 to be
> released(just a guess) late Fall 2018/early Winter 2019. The
> CommitFest(https://commitfest.postgresql.org/) for this release is currently
> approximately 3/4 of the way through. Not sure that new code could make it
> in at this point. This means it would be bumped to version 12 for 2019/2020.
> 

Right now, exporting (bilions rows * hundred columns) from postgres to
distributed tools such spark is feasible while beeing based on parsing,
transfers, tooling and workaround overhead.

Waiting until 2020 to get the oportunity to write COPY extensions would
mean using this feature around 2022. I mean, writing the ORC COPY
extension, extending the postgres JDBC driver, extending the spark jdbc
connector, all from different communities: this will be a long process.

But again, posgres would be the most advanced RDBMS because AFAIK not
any DB deal with those distributed format for the moment. Having in mind
that such feature will be released one day, make think the place of
postgres in a datawarehouse architecture accordingly.


В списке pgsql-general по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Re: New Copy Formats - avro/orc/parquet
Следующее
От: Tom Lane
Дата:
Сообщение: Re: New Copy Formats - avro/orc/parquet