Re: Make COPY format extendable: Extract COPY TO format implementations
От | Masahiko Sawada |
---|---|
Тема | Re: Make COPY format extendable: Extract COPY TO format implementations |
Дата | |
Msg-id | CAD21AoBa0Wm3C2H12jaqkvLidP2zEhsC+gf=3w7XiA4LQnvx0g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Make COPY format extendable: Extract COPY TO format implementations (Masahiko Sawada <sawada.mshk@gmail.com>) |
Список | pgsql-hackers |
On Mon, Jul 28, 2025 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Jul 18, 2025 at 3:05 AM Sutou Kouhei <kou@clear-code.com> wrote: > > > > Hi, > > > > In <CAD21AoAZL2RzPM4RLOJKm_73z5LXq2_VOVF+S+T0tnbjHdWTFA@mail.gmail.com> > > "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 17 Jul 2025 13:44:11 -0700, > > Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > >> > How about adding accessors instead of splitting > > >> > Copy{From,To}State to Copy{From,To}ExecutionData? If we use > > >> > the accessors approach, we can export only needed > > >> > information step by step without breaking ABI. > > > > > > Yeah, while it can export required fields without breaking ABI, I'm > > > concerned that setter and getter functions could be bloated if we need > > > to have them for many fields. > > > > In general, I choose this approach in my projects even when > > I need to define many accessors. Because I can hide > > implementation details from users. I can change > > implementation details without breaking API/ABI. > > > > But PostgreSQL isn't my project. Is there any guideline for > > PostgreSQL API(/ABI?) design that we can refer for this > > case? > > As far as I know there is no official guideline for PostgreSQL API and > ABI design, but I've never seen structs having more than 10 getter and > setter in PostgreSQL source code. > > > > > FYI: We need to export at least the following fields: > > > > https://www.postgresql.org/message-id/flat/20250714.173803.865595983884510428.kou%40clear-code.com#78fdbccf89742f856aa2cf95eaf42032 > > > > > FROM: > > > > > > - attnumlist (*) > > > - bytes_processed > > > - cur_attname > > > - escontext > > > - in_functions (*) > > > - input_buf > > > - input_reached_eof > > > - line_buf > > > - opts (*) > > > - raw_buf > > > - raw_buf_index > > > - raw_buf_len > > > - rel (*) > > > - typioparams (*) > > > > > > TO: > > > > > > - attnumlist (*) > > > - fe_msgbuf > > > - opts (*) > > I think we can think of the minimum list of fields that we need to > expose. For instance, fields used for buffered reads for COPY FROM > such as input_buf and raw_buf related fields don't necessarily need to > be exposed as extension can implement it in its own way. We can start > with the supporting simple copy format extensions like that read and > parse the binary data from the data source and fill 'values' and > 'nulls' arrays as output. Considering these facts, it might be > sufficient for copy format extensions if they could access 'rel', > 'attnumlist', and 'opts' in both COPY FROM and COPY TO (and > CopyFromErrorCallback related fields for COPY FROM). > > Apart from this, we might want to reorganize CopyFromStateData fields > and CopyToStateData fields since they have mixed fields of general > purpose fields for COPY operations (e.g., num_defaults, whereClause, > and range_table) and built-in format specific fields (e.g., line_buf > and input_buf). Text and CSV formats are using some fields for parsing > fields with buffered reads so one idea is that we move related fields > to another struct so that both built-in formats (text and CSV) and > external extensions that want to use the buffered reads for text > parsing can use this functionality. So probably it might be worth refactoring the codes in terms of: 1. hiding internal data from format callbacks 2. separating format-specific fields from the main state data. I categorized the fields in CopyFromStateData. I think there are roughly three different kind of fields mixed there: 1. fields used only the core (not by format callback) - filename - is_program - whereClause - cur_relname - copycontext - defmap - num_defaults - volatile_defexprs - range_table - rtrperminfos - qualexpr - transition_capture 2. fields used by both the core and format callbacks - rel - attnumlist - cur_lineno - cur_attname - cur_attval - relname_only - num_errors - opts - in_functions - typioparams - escontext - defexprs - Input-related fields - copy_src - copy_file - fe_msgbuf - data_source_cb - byteprocessed 3. built-in format specific fields (mostly for text and csv) - eol_type - defaults - Encoding related fields - file_encoding - need_transcoding - conversion_proc - convert_select_flags - raw data pointers - max_fields - raw_fields - attribute_buf - line_buf related fields - line_buf - line_buf_valid - input_buf related fields - input_buf - input_buf_index - input_buf_len - input_reached_eof - input_reached_error - raw_buf related fields - raw_buf - raw_buf_index - raw_buf_len - raw_reached_eof The fields in 1 are mostly static fields, and the fields in 2 and 3 are likely to be accessed in hot functions during COPY FROM. Would it be a good idea to restructure these fields so that we can hide the fields in 1 from callback functions and having the fields in 3 in a separate format-specific struct that can be accessed via an opaque pointer? But could the latter change potentially cause performance overheads? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: