Обсуждение: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

Поиск
Список
Период
Сортировка
Hello hackers,

For some of my specific hardware tests I needed to generate big databases well beyond RAM size. Hence I turned to pgbench tool and its default 2 modes for client- and server-side generation for TPC-B tests. When I use "scale" factor in range of few thousands (eg, 3000 - 5000) data generation phase takes quite some time. I looked at it as opportunity to prove/disprove 2 hypothesises:
As a result of my experiments I produced significant patch for pgbench utility and though that it might be of interest not just for me. Therefore I'm sending draft version of it in diff format for current development tree on GitHub. As of November 11, 2025 I can merge with main branch of the project on GitHub.

Spoiler alert: "COPY FROM BINARY" is significantly faster than current "COPY FROM TEXT"

Would be happy to polish it if there is interest to such change.

Cheers
Вложения
Hi Boris,

On Wed, Nov 12, 2025 at 3:25 AM Boris Mironov <boris_mironov@outlook.com> wrote:
>
> Hello hackers,
>
> For some of my specific hardware tests I needed to generate big databases well beyond RAM size. Hence I turned to
pgbenchtool and its default 2 modes for client- and server-side generation for TPC-B tests. When I use "scale" factor
inrange of few thousands (eg, 3000 - 5000) data generation phase takes quite some time. I looked at it as opportunity
toprove/disprove 2 hypothesises: 
>
> will INSERT mode work faster if we commit once every "scale" and turn single INSERT into "for" loop with commits for
3tables in the end of each loop 
> will "INSERT .. SELECT FROM unnest" be faster than "INSERT .. SELECT FROM generate_series"
> will BINARY mode work faster than TEXT even though we send much more data
> and so on
>
> As a result of my experiments I produced significant patch for pgbench utility and though that it might be of
interestnot just for me. Therefore I'm sending draft version of it in diff format for current development tree on
GitHub.As of November 11, 2025 I can merge with main branch of the project on GitHub. 
>
> Spoiler alert: "COPY FROM BINARY" is significantly faster than current "COPY FROM TEXT"
>
> Would be happy to polish it if there is interest to such change.

Making pgbench data initialization faster at a higher scale is
desirable and the community might be willing to accept such a change.
Running very large benchmarks is becoming common these days. However,
it's not clear what you are proposing and what's the performance
improvement. Answering following question may help: Your patch
implements all the above methods? Do all of them provide performance
improvement? If each of them performs better under certain conditions,
what are those conditions? Is there one method which performs better
than all others, which is it and why not implement just that method?
What performance numbers are we looking at? Can the methods which use
batch commits, also run those batches in parallel?

--
Best Wishes,
Ashutosh Bapat



Hi Ashutosh,

> If there is one method that is better than all others, community will
> be more willing to accept implementation of that one method than
> multiple implementations so as to reduce maintenance burden.

Ok then. I'll leave "COPY FROM STDIN BINARY" implementation out of 3 only.
Would you prefer to replace original COPY FROM STDIN TEXT by this
code or add it as new "init-step" (e.g., with code "c")?

I also have noted that current code doesn't prevent pgbench parameter
like "--init-steps=dtgG". It allows to run data generation step twice.
Each of these "g" and "G" will present own timing in status line. Is this
an oversight or intentional?

> The code in the patch does not have enough comments. It's hard to
> understand the methods just from the code. Each of the generateData*
> functions could use a prologue explaining the data generation method
> it uses.

To add comments is not a problem at all. So far, it was just "code for myself"
and I was checking if there is any interest in community to include it.

>> I'm sure that much more testing is required to run this code under different
>> conditions and hardware to get a better picture. So far it looks very promising.

> Sure.

Cheers,
Boris