Обсуждение: Columnar format export in Postgres
Hey Postgres team,
I have been working on adding support for columnar format export to Postgres to speed up analytics queries.
I've created an extension that achieves this functionality here.
I"m looking to improve the performance of this extension to enable drop-in analytics support for Postgres. Some immediate improvements I have in mind are:
- Reduce memory consumption when exporting table data to columnar format
- Create a native planner / execution hook that can read columnar data with vectorised operations.
It would be very helpful if you could take a look and suggest improvements to the extension.
Hopefully, this extension can be shipped by default with postgres at some point in the future.
Thanks,
Sushrut
Em qua., 12 de jun. de 2024 às 13:56, Sushrut Shivaswamy <sushrut.shivaswamy@gmail.com> escreveu:
Hey Postgres team,I have been working on adding support for columnar format export to Postgres to speed up analytics queries.I've created an extension that achieves this functionality here.I"m looking to improve the performance of this extension to enable drop-in analytics support for Postgres. Some immediate improvements I have in mind are:- Reduce memory consumption when exporting table data to columnar format- Create a native planner / execution hook that can read columnar data with vectorised operations.It would be very helpful if you could take a look and suggest improvements to the extension.Hopefully, this extension can be shipped by default with postgres at some point in the future.
If you want to have any hope, the license must be BSD.
GPL is incompatible.
best regards,
Ranier Vilela
Hi, In <CAH5mb9_YW76_0xBU2T4C7HF33b+b2w3QBtV50_ZZMy8SA8ChjA@mail.gmail.com> "Columnar format export in Postgres" on Wed, 12 Jun 2024 22:26:30 +0530, Sushrut Shivaswamy <sushrut.shivaswamy@gmail.com> wrote: > I have been working on adding support for columnar format export to > Postgres to speed up analytics queries. FYI: I'm proposing making COPY format extendable: * https://www.postgresql.org/message-id/flat/20231204.153548.2126325458835528809.kou@clear-code.com * https://commitfest.postgresql.org/48/4681/ If it's accepted, we can implement extensions for COPY FORMAT arrow and COPY FORMAT parquet. With these extensions, we can use file_fdw to read Apache Arrow and Apache Parquet file because file_fdw is based on COPY FROM: https://www.postgresql.org/docs/current/file-fdw.html If you're interested in this proposal, you can review the latest proposed patch set to proceed this proposal. > - Reduce memory consumption when exporting table data to columnar format The above COPY support will help this. Thanks, -- kou
Thanks for the response.
I had considered using COPY TO to export columnar data but gave up on it since the formats weren't extensible.
It's great to see that you are making it extensible.
I'm still going through the thread of comments on your patch but I have some early thoughts about using it for columnar data export.
- To maintain data freshness there would need to be a way to schedule exports using `COPY TO 'parquet`` periodically
- pg_analytica has the scheduling logic, once available COPY TO can be used to export the data instead of reading table in chunks being used currently.
- To facilitate efficient querying it would help to export multiple parquet files for the table instead of a single file.
Having multiple files allows queries to skip chunks if the key range in the chunk does not match query filter criteria.
Even within a chunk it would help to be able to configure the size of a row group.
- I'm not sure how these parameters will be exposed within `COPY TO`.
Or maybe the extension implementing the `COPY TO` handler will allow this configuration?
- Regarding using file_fdw to read Apache Arrow and Apache Parquet file because file_fdw is based on COPY FROM:
- I'm not too clear on this. file_fdw seems to allow creating a table from data on disk exported using COPY TO.
But is the newly created table still using the data on disk(maybe in columnar format or csv) or is it just reading that data to create a row based table.
I'm not aware of any capability in the postgres planner to read columnar files currently without using an extension like parquet_fdw.
- For your usecase how do you plan to query the arrow / parquet data?
If you want to have any hope, the license must be BSD.
GPL is incompatible.
Ack, will update the license to BSD. Thanks
On Wed, Jun 12, 2024 at 10:49 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 12 de jun. de 2024 às 13:56, Sushrut Shivaswamy <sushrut.shivaswamy@gmail.com> escreveu:Hey Postgres team,I have been working on adding support for columnar format export to Postgres to speed up analytics queries.I've created an extension that achieves this functionality here.I"m looking to improve the performance of this extension to enable drop-in analytics support for Postgres. Some immediate improvements I have in mind are:- Reduce memory consumption when exporting table data to columnar format- Create a native planner / execution hook that can read columnar data with vectorised operations.It would be very helpful if you could take a look and suggest improvements to the extension.Hopefully, this extension can be shipped by default with postgres at some point in the future.If you want to have any hope, the license must be BSD.GPL is incompatible.best regards,Ranier Vilela
Hi, In <CAH5mb98Dq7ssrQq9n5yW3G1YznH=Q7VvOZ20uhG7Vxg33ZBLDg@mail.gmail.com> "Re: Columnar format export in Postgres" on Thu, 13 Jun 2024 22:30:24 +0530, Sushrut Shivaswamy <sushrut.shivaswamy@gmail.com> wrote: > - To facilitate efficient querying it would help to export multiple > parquet files for the table instead of a single file. > Having multiple files allows queries to skip chunks if the key range in > the chunk does not match query filter criteria. > Even within a chunk it would help to be able to configure the size of a > row group. > - I'm not sure how these parameters will be exposed within `COPY TO`. > Or maybe the extension implementing the `COPY TO` handler will > allow this configuration? Yes. But adding support for custom COPY TO options is out-of-scope in the first version. We will focus on only the minimal features in the first version. We can improve it later based on use-cases. See also: https://www.postgresql.org/message-id/20240131.141122.279551156957581322.kou%40clear-code.com > - Regarding using file_fdw to read Apache Arrow and Apache Parquet file > because file_fdw is based on COPY FROM: > - I'm not too clear on this. file_fdw seems to allow creating a table > from data on disk exported using COPY TO. Correct. > But is the newly created table still using the data on disk(maybe in > columnar format or csv) or is it just reading that data to create a row > based table. The former. > I'm not aware of any capability in the postgres planner to read > columnar files currently without using an extension like parquet_fdw. Correct. We still need another approach such as parquet_fdw with the COPY format extensible feature to optimize query against Apache Parquet data. file_fdw can just read Apache Parquet data by SELECT. Sorry for confusing you. Thanks, -- kou