Обсуждение: [psycopg] speed concerns with executemany()

Поиск
Список
Период
Сортировка

[psycopg] speed concerns with executemany()

От
mike bayer
Дата:
I'm getting more and more regular complaints among users of SQLAlchemy
of the relatively slow speed of the cursor.executemany() call in
psycopg2.   In almost all cases, these users have discovered that
Postgresql is entirely capable of running an INSERT or UPDATE of many
values with a high degree of speed using a single statement with a form
like this:

INSERT INTO table (a, b, c)
VALUES  (1, 2, 3), (4, 5, 6), (7, 8, 9), ...

whereas if they run the same form using a single VALUES insert and
sending the parameters as a sequence via executemany(), they will
observe hyperbolically slow speeds - today, someone claims they can run
approximately 200 sets of three integers each using the multiple VALUES
approach in approximately .02 seconds, whereas running 200 values into a
single executemany() call of the otherwise identical INSERT statement,
they are claiming takes 20 seconds; that is, 100000% slower.   I'm not
really sure how that's even possible, considering the single INSERT with
many VALUES is a much larger string to send over the network and be
parsed by the server, if the overhead of a single INSERT is .02 seconds,
we would think an executemany() of 200 INSERT statements each with a
single parameter set would be at most four seconds.

Obviously something is wrong with these users' environment, although I
will note that the relative speed of psycopg2 executemany() over a 1G
network is still pretty bad, compared to both sending a single INSERT
with a large VALUES clause as well as compared to the executemany()
speed of DBAPIs (even pure Python) for other databases like MySQL, I can
provide some quick benchmarks if that's helpful.

I understand that psycopg2 does not use prepared statements, and I have
dim recollections that internal use of prepared statements for
executemany is not on the roadmap for psycopg2.  However, I'm still not
sure what I should be telling my users when I get reports of these
vastly slower results with executemany().

I'm not asking that psycopg2 change anything, I'm just looking to
understand what the heck is going on when people are reporting this.
Should I:

1. tell them they have a network issue that is causing executemany() to
have a problem?  (even though I can also observe executemany() is kind
of slow, though not as slow as these people are reporting)

2. tell them there's some known issue, vacuuming / indexes/ or something
that is known to have this effect?

3. tell them that yes, they should use multiple-VALUES within a single
INSERT (which would eventually lead to strong pressure on me to reinvent
executemany() within SQLAlchemy's psycopg2 dialect to use this form) ?
(also if so, why is this the case?   can this claimed 100000% slowdown
be real?)

4. use a hack to actually make my own prepared statements within
executemany() (I vaguely recall some recipe that you can get a prepared
statement going with psycopg2 by rolling it on the outside) ?

5. Other reasons that executemany() is known to sometimes be extremely slow?

I'm purposely trying to stay out of the realm of picking apart the libpq
internals, assuming psycopg2 devs can shed some light what's going on
here.   Thanks for any guidance you can offer!





Re: [psycopg] speed concerns with executemany()

От
Christophe Pettus
Дата:
Are you running with the transaction isolation level set to ISOLATION_LEVEL_AUTOCOMMIT?  If so, each of those INSERTs
willbe in its own transaction, and thus will go through the COMMIT overhead.  That by itself wouldn't explain a jump
thatlarge (in most environments), but it will definitely be *much* slower. 

> On Dec 23, 2016, at 16:05, mike bayer <mike_mp@zzzcomputing.com> wrote:
>
> I'm getting more and more regular complaints among users of SQLAlchemy of the relatively slow speed of the
cursor.executemany()call in psycopg2.   In almost all cases, these users have discovered that Postgresql is entirely
capableof running an INSERT or UPDATE of many values with a high degree of speed using a single statement with a form
likethis: 
>
> INSERT INTO table (a, b, c)
> VALUES  (1, 2, 3), (4, 5, 6), (7, 8, 9), ...
>
> whereas if they run the same form using a single VALUES insert and sending the parameters as a sequence via
executemany(),they will observe hyperbolically slow speeds - today, someone claims they can run approximately 200 sets
ofthree integers each using the multiple VALUES approach in approximately .02 seconds, whereas running 200 values into
asingle executemany() call of the otherwise identical INSERT statement, they are claiming takes 20 seconds; that is,
100000%slower.   I'm not really sure how that's even possible, considering the single INSERT with many VALUES is a much
largerstring to send over the network and be parsed by the server, if the overhead of a single INSERT is .02 seconds,
wewould think an executemany() of 200 INSERT statements each with a single parameter set would be at most four seconds. 
>
> Obviously something is wrong with these users' environment, although I will note that the relative speed of psycopg2
executemany()over a 1G network is still pretty bad, compared to both sending a single INSERT with a large VALUES clause
aswell as compared to the executemany() speed of DBAPIs (even pure Python) for other databases like MySQL, I can
providesome quick benchmarks if that's helpful. 
>
> I understand that psycopg2 does not use prepared statements, and I have dim recollections that internal use of
preparedstatements for executemany is not on the roadmap for psycopg2.  However, I'm still not sure what I should be
tellingmy users when I get reports of these vastly slower results with executemany(). 
>
> I'm not asking that psycopg2 change anything, I'm just looking to understand what the heck is going on when people
arereporting this. Should I: 
>
> 1. tell them they have a network issue that is causing executemany() to have a problem?  (even though I can also
observeexecutemany() is kind of slow, though not as slow as these people are reporting) 
>
> 2. tell them there's some known issue, vacuuming / indexes/ or something that is known to have this effect?
>
> 3. tell them that yes, they should use multiple-VALUES within a single INSERT (which would eventually lead to strong
pressureon me to reinvent executemany() within SQLAlchemy's psycopg2 dialect to use this form) ? (also if so, why is
thisthe case?   can this claimed 100000% slowdown be real?) 
>
> 4. use a hack to actually make my own prepared statements within executemany() (I vaguely recall some recipe that you
canget a prepared statement going with psycopg2 by rolling it on the outside) ? 
>
> 5. Other reasons that executemany() is known to sometimes be extremely slow?
>
> I'm purposely trying to stay out of the realm of picking apart the libpq internals, assuming psycopg2 devs can shed
somelight what's going on here.   Thanks for any guidance you can offer! 
>
>
>
>
>
> --
> Sent via psycopg mailing list (psycopg@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/psycopg

--
-- Christophe Pettus
   xof@thebuild.com



Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Sat, Dec 24, 2016 at 1:09 AM, Christophe Pettus <xof@thebuild.com> wrote:
> Are you running with the transaction isolation level set to ISOLATION_LEVEL_AUTOCOMMIT?  If so, each of those INSERTs
willbe in its own transaction, and thus will go through the COMMIT overhead.  That by itself wouldn't explain a jump
thatlarge (in most environments), but it will definitely be *much* slower. 

Why do you say this? Psycopg doesn't wrap statements in BEGIN/COMMIT
when in autocommit mode. Are you referring about some implicit
transaction created by the database?

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 12/23/2016 04:29 PM, Daniele Varrazzo wrote:
> On Sat, Dec 24, 2016 at 1:09 AM, Christophe Pettus <xof@thebuild.com> wrote:
>> Are you running with the transaction isolation level set to ISOLATION_LEVEL_AUTOCOMMIT?  If so, each of those
INSERTswill be in its own transaction, and thus will go through the COMMIT overhead.  That by itself wouldn't explain a
jumpthat large (in most environments), but it will definitely be *much* slower. 
>
> Why do you say this? Psycopg doesn't wrap statements in BEGIN/COMMIT
> when in autocommit mode. Are you referring about some implicit

I understood it did:

http://initd.org/psycopg/docs/usage.html#transactions-control

"It is possible to set the connection in autocommit mode: this way all
the commands executed will be immediately committed and no rollback is
possible. A few commands (e.g. CREATE DATABASE, VACUUM...) require to be
run outside any transaction: in order to be able to run these commands
from Psycopg, the connection must be in autocommit mode: you can use the
autocommit property (set_isolation_level() in older versions)."

> transaction created by the database?
>
> -- Daniele
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Christophe Pettus
Дата:
> On Dec 23, 2016, at 16:58, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> I understood it did:
>
> http://initd.org/psycopg/docs/usage.html#transactions-control
>
> "It is possible to set the connection in autocommit mode: this way all the commands executed will be immediately
committedand no rollback is possible. A few commands (e.g. CREATE DATABASE, VACUUM...) require to be run outside any
transaction:in order to be able to run these commands from Psycopg, the connection must be in autocommit mode: you can
usethe autocommit property (set_isolation_level() in older versions)." 

My somewhat garbled thought was that each of the component INSERTs in the .executemany would be getting its own
transactionunless the connection was set to autocommit... but I'll admit I haven't tested it. 

--
-- Christophe Pettus
   xof@thebuild.com



Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 12/23/2016 04:59 PM, Christophe Pettus wrote:
>
>> On Dec 23, 2016, at 16:58, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>> I understood it did:
>>
>> http://initd.org/psycopg/docs/usage.html#transactions-control
>>
>> "It is possible to set the connection in autocommit mode: this way all the commands executed will be immediately
committedand no rollback is possible. A few commands (e.g. CREATE DATABASE, VACUUM...) require to be run outside any
transaction:in order to be able to run these commands from Psycopg, the connection must be in autocommit mode: you can
usethe autocommit property (set_isolation_level() in older versions)." 
>
> My somewhat garbled thought was that each of the component INSERTs in the .executemany would be getting its own
transactionunless the connection was set to autocommit... but I'll admit I haven't tested it. 

Don't you mean?:
My somewhat garbled thought was that each of the component INSERTs in
the .executemany would be getting its own transaction if the connection
was set to autocommit... but I'll admit I haven't tested it.

>
> --
> -- Christophe Pettus
>    xof@thebuild.com
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
Hi Mike

On Sat, Dec 24, 2016 at 1:05 AM, mike bayer <mike_mp@zzzcomputing.com> wrote:
> I'm getting more and more regular complaints among users of SQLAlchemy of
> the relatively slow speed of the cursor.executemany() call in psycopg2.   In
> almost all cases, these users have discovered that Postgresql is entirely
> capable of running an INSERT or UPDATE of many values with a high degree of
> speed using a single statement with a form like this:
>
> INSERT INTO table (a, b, c)
> VALUES  (1, 2, 3), (4, 5, 6), (7, 8, 9), ...
>
> whereas if they run the same form using a single VALUES insert and sending
> the parameters as a sequence via executemany(), they will observe
> hyperbolically slow speeds - today, someone claims they can run
> approximately 200 sets of three integers each using the multiple VALUES
> approach in approximately .02 seconds, whereas running 200 values into a
> single executemany() call of the otherwise identical INSERT statement, they
> are claiming takes 20 seconds; that is, 100000% slower.   I'm not really
> sure how that's even possible, considering the single INSERT with many
> VALUES is a much larger string to send over the network and be parsed by the
> server, if the overhead of a single INSERT is .02 seconds, we would think an
> executemany() of 200 INSERT statements each with a single parameter set
> would be at most four seconds.

A x1000 slowdown sounds dodgy yes.

> Obviously something is wrong with these users' environment, although I will
> note that the relative speed of psycopg2 executemany() over a 1G network is
> still pretty bad, compared to both sending a single INSERT with a large
> VALUES clause as well as compared to the executemany() speed of DBAPIs (even
> pure Python) for other databases like MySQL, I can provide some quick
> benchmarks if that's helpful.

You know something that could be worked out quickly? Currently
executemany boils down to pseudocode:

    def executemany(self, stmt, argslist):
        for args in argslist:
            self.execute(stmt, args)

This is not efficient because it's not prepared and because it does a
roundtrip per args. We could save on the second by running something
like:

    def executemany(self, stmt, argslist):
        # TODO: Do this in pages of 100 args, not all together
        stmts = []
        for args in argslist:
            stmts.append(self.mogrify(stmt, argslist))
        self.execute(";".join(stmts))

If you are in a mood for benchmarks, could you please check if this
approach is so noticeably faster that we ought to change the
executemany implementation in 2.7?

This is something that only works with the old protocol, i.e. what
psycopg 2.x implements. In a version implementing the "new protocol"
we would likely implement executemany with a prepared statement, so
this shouldn't be a concern.

> I understand that psycopg2 does not use prepared statements, and I have dim
> recollections that internal use of prepared statements for executemany is
> not on the roadmap for psycopg2.  However, I'm still not sure what I should
> be telling my users when I get reports of these vastly slower results with
> executemany().
>
> I'm not asking that psycopg2 change anything, I'm just looking to understand
> what the heck is going on when people are reporting this. Should I:
>
> 1. tell them they have a network issue that is causing executemany() to have
> a problem?  (even though I can also observe executemany() is kind of slow,
> though not as slow as these people are reporting)

As I said, while executemany is not a screaming arrow I don't expect
it x1000 slower either.

> 2. tell them there's some known issue, vacuuming / indexes/ or something
> that is known to have this effect?

That could be, but rather than you or me they should have a chat in
pgsql-general or -performance to solve that.

> 3. tell them that yes, they should use multiple-VALUES within a single
> INSERT (which would eventually lead to strong pressure on me to reinvent
> executemany() within SQLAlchemy's psycopg2 dialect to use this form) ? (also
> if so, why is this the case?   can this claimed 100000% slowdown be real?)

Even if we had a better implemented executemany, I doubt it would be
implemented with an INSERT ... VALUES list, because that would mean
manipulate the SQL anyway. What we would do would be prepare and then
execute prepared repeatedly. In my knowledge there is no libpq
function to execute repeatedly a prepared statement with an array of
arrays of values
(https://www.postgresql.org/docs/9.3/static/libpq-exec.html). I'd be
very happy if someone could prove me wrong on that.

My bet is that `cur.execute(";".join(cur.mogrify(stmt, args) for args
in argslist)` on old protocol will be faster than prepare + `for stmt
in stmtlist: execute(prepared, stmt)` on new protocol: we would be
trading parsing time for network roundtrips and, while I don't have
benchmarks, the latters smell more expensive than the formers.

> 4. use a hack to actually make my own prepared statements within
> executemany() (I vaguely recall some recipe that you can get a prepared
> statement going with psycopg2 by rolling it on the outside) ?

Yes, I have a sort of recipe for a preparing cursor here:
https://gist.github.com/dvarrazzo/3797445 but I don't think it would
make executemany faster, because it's still one network roundtrip per
values and one statement to parse (likely an EXECUTE instead of an
INSERT)


> 5. Other reasons that executemany() is known to sometimes be extremely slow?
>
> I'm purposely trying to stay out of the realm of picking apart the libpq
> internals, assuming psycopg2 devs can shed some light what's going on here.
> Thanks for any guidance you can offer!

Hope my notes are useful. While I haven't really thrown much brain
juice at the slow executemany problem (because COPY will always be
faster and that's what I tend to use... but that has adaptation and
interface problem of its own) If you can help me with some testing,
and if nobody thinks that my ';'.join() solution is totally daft, I
think we could roll this trick quite quickly.

As for a definition of "quickly": I am currently in holiday, which
means that I'm working at wrapping up psycopg 2.7 instead of $JOB. My
plan is to release 2.7 within the first days of January.


-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 12/23/2016 04:05 PM, mike bayer wrote:
> I'm getting more and more regular complaints among users of SQLAlchemy
> of the relatively slow speed of the cursor.executemany() call in
> psycopg2.   In almost all cases, these users have discovered that
> Postgresql is entirely capable of running an INSERT or UPDATE of many
> values with a high degree of speed using a single statement with a form
> like this:
>
> INSERT INTO table (a, b, c)
> VALUES  (1, 2, 3), (4, 5, 6), (7, 8, 9), ...
>
> whereas if they run the same form using a single VALUES insert and
> sending the parameters as a sequence via executemany(), they will
> observe hyperbolically slow speeds - today, someone claims they can run
> approximately 200 sets of three integers each using the multiple VALUES
> approach in approximately .02 seconds, whereas running 200 values into a
> single executemany() call of the otherwise identical INSERT statement,
> they are claiming takes 20 seconds; that is, 100000% slower.   I'm not
> really sure how that's even possible, considering the single INSERT with
> many VALUES is a much larger string to send over the network and be
> parsed by the server, if the overhead of a single INSERT is .02 seconds,
> we would think an executemany() of 200 INSERT statements each with a
> single parameter set would be at most four seconds.
>
> Obviously something is wrong with these users' environment, although I
> will note that the relative speed of psycopg2 executemany() over a 1G
> network is still pretty bad, compared to both sending a single INSERT
> with a large VALUES clause as well as compared to the executemany()
> speed of DBAPIs (even pure Python) for other databases like MySQL, I can
> provide some quick benchmarks if that's helpful.
>
> I understand that psycopg2 does not use prepared statements, and I have
> dim recollections that internal use of prepared statements for
> executemany is not on the roadmap for psycopg2.  However, I'm still not
> sure what I should be telling my users when I get reports of these
> vastly slower results with executemany().
>
> I'm not asking that psycopg2 change anything, I'm just looking to
> understand what the heck is going on when people are reporting this.
> Should I:
>
> 1. tell them they have a network issue that is causing executemany() to
> have a problem?  (even though I can also observe executemany() is kind
> of slow, though not as slow as these people are reporting)
>
> 2. tell them there's some known issue, vacuuming / indexes/ or something
> that is known to have this effect?
>
> 3. tell them that yes, they should use multiple-VALUES within a single
> INSERT (which would eventually lead to strong pressure on me to reinvent
> executemany() within SQLAlchemy's psycopg2 dialect to use this form) ?
> (also if so, why is this the case?   can this claimed 100000% slowdown
> be real?)

I have to go with Christophe's explanation. They are seeing the effects
of 200 separate transactions, though like he stated later this more an
assumption then something I have tested.

>
> 4. use a hack to actually make my own prepared statements within
> executemany() (I vaguely recall some recipe that you can get a prepared
> statement going with psycopg2 by rolling it on the outside) ?
>
> 5. Other reasons that executemany() is known to sometimes be extremely
> slow?
>
> I'm purposely trying to stay out of the realm of picking apart the libpq
> internals, assuming psycopg2 devs can shed some light what's going on
> here.   Thanks for any guidance you can offer!
>
>
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Sat, Dec 24, 2016 at 2:05 AM, Adrian Klaver
<adrian.klaver@aklaver.com> wrote:

> I have to go with Christophe's explanation. They are seeing the effects of
> 200 separate transactions, though like he stated later this more an
> assumption then something I have tested.

My wild wild wild assumption is that they have a trigger on the table
that gets triggered 200 times instead of one.

The only thing I am sure of is that psycopg doesn't use any BEGIN
specifically in executemany.


-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 12/23/2016 05:09 PM, Daniele Varrazzo wrote:
> On Sat, Dec 24, 2016 at 2:05 AM, Adrian Klaver
> <adrian.klaver@aklaver.com> wrote:
>
>> I have to go with Christophe's explanation. They are seeing the effects of
>> 200 separate transactions, though like he stated later this more an
>> assumption then something I have tested.
>
> My wild wild wild assumption is that they have a trigger on the table
> that gets triggered 200 times instead of one.
>
> The only thing I am sure of is that psycopg doesn't use any BEGIN
> specifically in executemany.

I see that now:

In [5]: val_list = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
In [12]: sql = "INSERT INTO psycopg_table VALUES(%s, %s, %s)"
In [16]: con.set_session(autocommit=True)
In [17]: cur = con.cursor()
In [18]: cur.executemany(sql, val_list)


[unknown]-2016-12-23 17:17:10.829 PST-0 LOG:  connection received:
host=::1 port=37352
aklaver-2016-12-23 17:17:10.830 PST-0 LOG:  connection authorized:
user=aklaver database=production
aklaver-2016-12-23 17:17:35.373 PST-0 LOG:  statement: INSERT INTO
psycopg_table VALUES(1, 2, 3)
aklaver-2016-12-23 17:17:35.416 PST-0 LOG:  statement: INSERT INTO
psycopg_table VALUES(4, 5, 6)
aklaver-2016-12-23 17:17:35.457 PST-0 LOG:  statement: INSERT INTO
psycopg_table VALUES(7, 8, 9)

Yet the data was available immediately in another session without a
explicit COMMIT, so how is psycopg2 committing the INSERTs?


>
>
> -- Daniele
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Christophe Pettus
Дата:
> On Dec 23, 2016, at 17:23, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> Yet the data was available immediately in another session without a explicit COMMIT, so how is psycopg2 committing
theINSERTs? 

autocommit = True means psycopg2 doesn't issue an initial BEGIN, so each INSERT commits when done (just like a
top-levelpsql session)... thus "autocommit." 

--
-- Christophe Pettus
   xof@thebuild.com



Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 12/23/2016 05:37 PM, Christophe Pettus wrote:
>
>> On Dec 23, 2016, at 17:23, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>> Yet the data was available immediately in another session without a explicit COMMIT, so how is psycopg2 committing
theINSERTs? 
>
> autocommit = True means psycopg2 doesn't issue an initial BEGIN, so each INSERT commits when done (just like a
top-levelpsql session)... thus "autocommit." 

Alright that I get. Still the practical outcome is each INSERT is being
done in a transaction (an implicit one) so the transaction overhead
comes into play. Or am I missing something?

>
> --
> -- Christophe Pettus
>    xof@thebuild.com
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Christophe Pettus
Дата:
> On Dec 23, 2016, at 18:55, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> Alright that I get. Still the practical outcome is each INSERT is being done in a transaction (an implicit one) so
thetransaction overhead comes into play. Or am I missing something? 

Nope, not missing a thing.  The theory (and it is only that) is that when they do the .executemany(), each of those
INSERTspays the transaction overhead, while if they do one big INSERT, just that one statement does. 

--
-- Christophe Pettus
   xof@thebuild.com



Re: [psycopg] speed concerns with executemany()

От
Jim Nasby
Дата:
On 12/23/16 7:04 PM, Daniele Varrazzo wrote:
> In my knowledge there is no libpq
> function to execute repeatedly a prepared statement with an array of
> arrays of values

I don't think there is, but the community might welcome one. Though, I
think the first question you'll get is "why not just use async commands".

Somewhat related to this, I recently heard a complaint that
Pandas.DataFrame.from_sql() was so incredibly slow on Postgres that
people were moving to storing data in things like Hadoop just for doing
data science modelling. I'm not looking into that right now because it
turns out moving data into plpython is almost as slow as via libpq in
some simple testing (which strikes me as rather absurd), but it'd be
worth looking into improving the libpq case as well. I suspect the way
to do that would be to create a tuple of lists, instead of a list of
tuples (or dicts, or whatever). Single row mode[1] would probably be
helpful too...

1: https://www.postgresql.org/docs/current/static/libpq-single-row-mode.html
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 12/23/2016 06:57 PM, Christophe Pettus wrote:
>
>> On Dec 23, 2016, at 18:55, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>> Alright that I get. Still the practical outcome is each INSERT is being done in a transaction (an implicit one) so
thetransaction overhead comes into play. Or am I missing something? 
>
> Nope, not missing a thing.  The theory (and it is only that) is that when they do the .executemany(), each of those
INSERTspays the transaction overhead, while if they do one big INSERT, just that one statement does. 

Just ran a quick and dirty test using IPython %timeit.

With a list of 200 tuples each which had 3 integers INSERTing into:
test=> \d psycopg_table
  Table "public.psycopg_table"
  Column |  Type   | Modifiers
--------+---------+-----------
  a      | integer |
  b      | integer |
  c      | integer |


The results where:

sql = "INSERT INTO psycopg_table VALUES(%s, %s, %s)"

Without autocommit:

In [65]: timeit -n 10 cur.executemany(sql, l)
10 loops, best of 3: 12.5 ms per loop


With autocommit:

In [72]: timeit -n 10 cur.executemany(sql, l)
10 loops, best of 3: 1.71 s per loop


>
> --
> -- Christophe Pettus
>    xof@thebuild.com
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Dorian Hoxha
Дата:
Sending stuff in big-batches + autocommit (fast transactions) + few network calls is performance 101 I thought. I think the "executemany" should be documented what it does (it looked suspicious when I saw it long time ago, why I didn't use it).

On Sat, Dec 24, 2016 at 6:00 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 12/23/2016 06:57 PM, Christophe Pettus wrote:

On Dec 23, 2016, at 18:55, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
Alright that I get. Still the practical outcome is each INSERT is being done in a transaction (an implicit one) so the transaction overhead comes into play. Or am I missing something?

Nope, not missing a thing.  The theory (and it is only that) is that when they do the .executemany(), each of those INSERTs pays the transaction overhead, while if they do one big INSERT, just that one statement does.

Just ran a quick and dirty test using IPython %timeit.

With a list of 200 tuples each which had 3 integers INSERTing into:
test=> \d psycopg_table
 Table "public.psycopg_table"
 Column |  Type   | Modifiers
--------+---------+-----------
 a      | integer |
 b      | integer |
 c      | integer |


The results where:

sql = "INSERT INTO psycopg_table VALUES(%s, %s, %s)"

Without autocommit:

In [65]: timeit -n 10 cur.executemany(sql, l)
10 loops, best of 3: 12.5 ms per loop


With autocommit:

In [72]: timeit -n 10 cur.executemany(sql, l)
10 loops, best of 3: 1.71 s per loop



--
-- Christophe Pettus
   xof@thebuild.com



--
Adrian Klaver
adrian.klaver@aklaver.com



--
Sent via psycopg mailing list (psycopg@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/psycopg

Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
The implementation of executemany as described by me a few days ago is
available in this gist, not heavily tested:

https://gist.github.com/dvarrazzo/4204cca5d1cd7c9c95bc814d81e6b83e

I would like to know if anyone sees any shortcoming in this new implementation.

-- Daniele

On Sun, Dec 25, 2016 at 10:11 AM, Dorian Hoxha <dorian.hoxha@gmail.com> wrote:
> Sending stuff in big-batches + autocommit (fast transactions) + few network
> calls is performance 101 I thought. I think the "executemany" should be
> documented what it does (it looked suspicious when I saw it long time ago,
> why I didn't use it).
>
> On Sat, Dec 24, 2016 at 6:00 AM, Adrian Klaver <adrian.klaver@aklaver.com>
> wrote:
>>
>> On 12/23/2016 06:57 PM, Christophe Pettus wrote:
>>>
>>>
>>>> On Dec 23, 2016, at 18:55, Adrian Klaver <adrian.klaver@aklaver.com>
>>>> wrote:
>>>> Alright that I get. Still the practical outcome is each INSERT is being
>>>> done in a transaction (an implicit one) so the transaction overhead comes
>>>> into play. Or am I missing something?
>>>
>>>
>>> Nope, not missing a thing.  The theory (and it is only that) is that when
>>> they do the .executemany(), each of those INSERTs pays the transaction
>>> overhead, while if they do one big INSERT, just that one statement does.
>>
>>
>> Just ran a quick and dirty test using IPython %timeit.
>>
>> With a list of 200 tuples each which had 3 integers INSERTing into:
>> test=> \d psycopg_table
>>  Table "public.psycopg_table"
>>  Column |  Type   | Modifiers
>> --------+---------+-----------
>>  a      | integer |
>>  b      | integer |
>>  c      | integer |
>>
>>
>> The results where:
>>
>> sql = "INSERT INTO psycopg_table VALUES(%s, %s, %s)"
>>
>> Without autocommit:
>>
>> In [65]: timeit -n 10 cur.executemany(sql, l)
>> 10 loops, best of 3: 12.5 ms per loop
>>
>>
>> With autocommit:
>>
>> In [72]: timeit -n 10 cur.executemany(sql, l)
>> 10 loops, best of 3: 1.71 s per loop
>>
>>
>>>
>>> --
>>> -- Christophe Pettus
>>>    xof@thebuild.com
>>>
>>
>>
>> --
>> Adrian Klaver
>> adrian.klaver@aklaver.com
>>
>>
>>
>> --
>> Sent via psycopg mailing list (psycopg@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/psycopg
>
>


Re: [psycopg] speed concerns with executemany()

От
Christophe Pettus
Дата:
> On Dec 30, 2016, at 14:24, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
>
> The implementation of executemany as described by me a few days ago is
> available in this gist, not heavily tested:
>
> https://gist.github.com/dvarrazzo/4204cca5d1cd7c9c95bc814d81e6b83e
>
> I would like to know if anyone sees any shortcoming in this new implementation.

Seems fine to me!  I wish there was a way of feeding the queries down the pipe asynchronously, rather than having to
builda gigantic string, but that's probably more trouble than the feature is worth. 

--
-- Christophe Pettus
   xof@thebuild.com



Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 12/30/2016 02:24 PM, Daniele Varrazzo wrote:
> The implementation of executemany as described by me a few days ago is
> available in this gist, not heavily tested:
>
> https://gist.github.com/dvarrazzo/4204cca5d1cd7c9c95bc814d81e6b83e
>
> I would like to know if anyone sees any shortcoming in this new implementation.

A quick test. I added an argument to change the page_size on the command
line:

With NRECS=10000:

aklaver@tito:~> python psycopg_executemany.py -p 10
classic: 0.800544023514 sec
joined: 0.514330863953 sec
aklaver@tito:~> python psycopg_executemany.py -p 100
classic: 0.780461072922 sec
joined: 0.473304986954 sec
aklaver@tito:~> python psycopg_executemany.py -p 1000
classic: 0.820818901062 sec
joined: 0.488647937775 sec


With NRECS=100000:

aklaver@tito:~> python psycopg_executemany.py -p 10
classic: 7.78319811821 sec
joined: 4.18683385849 sec
aklaver@tito:~> python psycopg_executemany.py -p 100
classic: 7.75992202759 sec
joined: 4.06096816063 sec
aklaver@tito:~> python psycopg_executemany.py -p 1000
classic: 7.76269102097 sec
joined: 4.12301802635 sec


The relative difference between the classic and joined seems to hold,
you just do not seem to get much benefit from changing the page_size.
Not sure how much that matters and you do get a benefit from the joined
solution.

>
> -- Daniele
>
> On Sun, Dec 25, 2016 at 10:11 AM, Dorian Hoxha <dorian.hoxha@gmail.com> wrote:
>> Sending stuff in big-batches + autocommit (fast transactions) + few network
>> calls is performance 101 I thought. I think the "executemany" should be
>> documented what it does (it looked suspicious when I saw it long time ago,
>> why I didn't use it).
>>
>> On Sat, Dec 24, 2016 at 6:00 AM, Adrian Klaver <adrian.klaver@aklaver.com>
>> wrote:
>>>
>>> On 12/23/2016 06:57 PM, Christophe Pettus wrote:
>>>>
>>>>
>>>>> On Dec 23, 2016, at 18:55, Adrian Klaver <adrian.klaver@aklaver.com>
>>>>> wrote:
>>>>> Alright that I get. Still the practical outcome is each INSERT is being
>>>>> done in a transaction (an implicit one) so the transaction overhead comes
>>>>> into play. Or am I missing something?
>>>>
>>>>
>>>> Nope, not missing a thing.  The theory (and it is only that) is that when
>>>> they do the .executemany(), each of those INSERTs pays the transaction
>>>> overhead, while if they do one big INSERT, just that one statement does.
>>>
>>>
>>> Just ran a quick and dirty test using IPython %timeit.
>>>
>>> With a list of 200 tuples each which had 3 integers INSERTing into:
>>> test=> \d psycopg_table
>>>  Table "public.psycopg_table"
>>>  Column |  Type   | Modifiers
>>> --------+---------+-----------
>>>  a      | integer |
>>>  b      | integer |
>>>  c      | integer |
>>>
>>>
>>> The results where:
>>>
>>> sql = "INSERT INTO psycopg_table VALUES(%s, %s, %s)"
>>>
>>> Without autocommit:
>>>
>>> In [65]: timeit -n 10 cur.executemany(sql, l)
>>> 10 loops, best of 3: 12.5 ms per loop
>>>
>>>
>>> With autocommit:
>>>
>>> In [72]: timeit -n 10 cur.executemany(sql, l)
>>> 10 loops, best of 3: 1.71 s per loop
>>>
>>>
>>>>
>>>> --
>>>> -- Christophe Pettus
>>>>    xof@thebuild.com
>>>>
>>>
>>>
>>> --
>>> Adrian Klaver
>>> adrian.klaver@aklaver.com
>>>
>>>
>>>
>>> --
>>> Sent via psycopg mailing list (psycopg@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/psycopg
>>
>>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Dorian Hoxha
Дата:
Since lists are fixed-arrays that grow in 2x size when they fill, it's better to build a list in 1 go instead of appending to it (which makes it full, and then create a new 2x bigger, copying there, appending etc etc until it's full)
example:
`sqls = [self.mogrify(sql, args) for args in args]`
Even better make it a tuple instead of a list if it's immutable.

On Sat, Dec 31, 2016 at 12:55 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 12/30/2016 02:24 PM, Daniele Varrazzo wrote:
The implementation of executemany as described by me a few days ago is
available in this gist, not heavily tested:

https://gist.github.com/dvarrazzo/4204cca5d1cd7c9c95bc814d81e6b83e

I would like to know if anyone sees any shortcoming in this new implementation.

A quick test. I added an argument to change the page_size on the command line:

With NRECS=10000:

aklaver@tito:~> python psycopg_executemany.py -p 10
classic: 0.800544023514 sec
joined: 0.514330863953 sec
aklaver@tito:~> python psycopg_executemany.py -p 100
classic: 0.780461072922 sec
joined: 0.473304986954 sec
aklaver@tito:~> python psycopg_executemany.py -p 1000
classic: 0.820818901062 sec
joined: 0.488647937775 sec


With NRECS=100000:

aklaver@tito:~> python psycopg_executemany.py -p 10
classic: 7.78319811821 sec
joined: 4.18683385849 sec
aklaver@tito:~> python psycopg_executemany.py -p 100
classic: 7.75992202759 sec
joined: 4.06096816063 sec
aklaver@tito:~> python psycopg_executemany.py -p 1000
classic: 7.76269102097 sec
joined: 4.12301802635 sec


The relative difference between the classic and joined seems to hold, you just do not seem to get much benefit from changing the page_size. Not sure how much that matters and you do get a benefit from the joined solution.



-- Daniele

On Sun, Dec 25, 2016 at 10:11 AM, Dorian Hoxha <dorian.hoxha@gmail.com> wrote:
Sending stuff in big-batches + autocommit (fast transactions) + few network
calls is performance 101 I thought. I think the "executemany" should be
documented what it does (it looked suspicious when I saw it long time ago,
why I didn't use it).

On Sat, Dec 24, 2016 at 6:00 AM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 12/23/2016 06:57 PM, Christophe Pettus wrote:


On Dec 23, 2016, at 18:55, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
Alright that I get. Still the practical outcome is each INSERT is being
done in a transaction (an implicit one) so the transaction overhead comes
into play. Or am I missing something?


Nope, not missing a thing.  The theory (and it is only that) is that when
they do the .executemany(), each of those INSERTs pays the transaction
overhead, while if they do one big INSERT, just that one statement does.


Just ran a quick and dirty test using IPython %timeit.

With a list of 200 tuples each which had 3 integers INSERTing into:
test=> \d psycopg_table
 Table "public.psycopg_table"
 Column |  Type   | Modifiers
--------+---------+-----------
 a      | integer |
 b      | integer |
 c      | integer |


The results where:

sql = "INSERT INTO psycopg_table VALUES(%s, %s, %s)"

Without autocommit:

In [65]: timeit -n 10 cur.executemany(sql, l)
10 loops, best of 3: 12.5 ms per loop


With autocommit:

In [72]: timeit -n 10 cur.executemany(sql, l)
10 loops, best of 3: 1.71 s per loop



--
-- Christophe Pettus
   xof@thebuild.com



--
Adrian Klaver
adrian.klaver@aklaver.com



--
Sent via psycopg mailing list (psycopg@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/psycopg




--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [psycopg] speed concerns with executemany()

От
mike bayer
Дата:

On 12/30/2016 06:42 PM, Christophe Pettus wrote:
>
>> On Dec 30, 2016, at 14:24, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
>>
>> The implementation of executemany as described by me a few days ago is
>> available in this gist, not heavily tested:
>>
>> https://gist.github.com/dvarrazzo/4204cca5d1cd7c9c95bc814d81e6b83e
>>
>> I would like to know if anyone sees any shortcoming in this new implementation.
>
> Seems fine to me!  I wish there was a way of feeding the queries down the pipe asynchronously, rather than having to
builda gigantic string, but that's probably more trouble than the feature is worth. 

you'd need to be careful with that as a series of multiple parameter
sets may have dependencies on each other, not to mention people might
find it surprising that sequences / defaults / SERIAL etc. aren't firing
off in the order in which parameter sets were given.




>
> --
> -- Christophe Pettus
>    xof@thebuild.com
>


Re: [psycopg] speed concerns with executemany()

От
mike bayer
Дата:

On 12/24/2016 12:00 AM, Adrian Klaver wrote:
> On 12/23/2016 06:57 PM, Christophe Pettus wrote:
>>
>>> On Dec 23, 2016, at 18:55, Adrian Klaver <adrian.klaver@aklaver.com>
>>> wrote:
>>> Alright that I get. Still the practical outcome is each INSERT is
>>> being done in a transaction (an implicit one) so the transaction
>>> overhead comes into play. Or am I missing something?
>>
>> Nope, not missing a thing.  The theory (and it is only that) is that
>> when they do the .executemany(), each of those INSERTs pays the
>> transaction overhead, while if they do one big INSERT, just that one
>> statement does.
>
> Just ran a quick and dirty test using IPython %timeit.
>
> With a list of 200 tuples each which had 3 integers INSERTing into:
> test=> \d psycopg_table
>  Table "public.psycopg_table"
>  Column |  Type   | Modifiers
> --------+---------+-----------
>  a      | integer |
>  b      | integer |
>  c      | integer |
>
>
> The results where:
>
> sql = "INSERT INTO psycopg_table VALUES(%s, %s, %s)"
>
> Without autocommit:
>
> In [65]: timeit -n 10 cur.executemany(sql, l)
> 10 loops, best of 3: 12.5 ms per loop
>
>
> With autocommit:
>
> In [72]: timeit -n 10 cur.executemany(sql, l)
> 10 loops, best of 3: 1.71 s per loop


please ensure you run this test with statements passing over a real
network connection and not localhost.  makes a significant difference.





>
>
>>
>> --
>> -- Christophe Pettus
>>    xof@thebuild.com
>>
>
>


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 01/01/2017 11:14 AM, mike bayer wrote:
>
>
> On 12/24/2016 12:00 AM, Adrian Klaver wrote:
>> On 12/23/2016 06:57 PM, Christophe Pettus wrote:
>>>
>>>> On Dec 23, 2016, at 18:55, Adrian Klaver <adrian.klaver@aklaver.com>
>>>> wrote:
>>>> Alright that I get. Still the practical outcome is each INSERT is
>>>> being done in a transaction (an implicit one) so the transaction
>>>> overhead comes into play. Or am I missing something?
>>>
>>> Nope, not missing a thing.  The theory (and it is only that) is that
>>> when they do the .executemany(), each of those INSERTs pays the
>>> transaction overhead, while if they do one big INSERT, just that one
>>> statement does.
>>
>> Just ran a quick and dirty test using IPython %timeit.
>>
>> With a list of 200 tuples each which had 3 integers INSERTing into:
>> test=> \d psycopg_table
>>  Table "public.psycopg_table"
>>  Column |  Type   | Modifiers
>> --------+---------+-----------
>>  a      | integer |
>>  b      | integer |
>>  c      | integer |
>>
>>
>> The results where:
>>
>> sql = "INSERT INTO psycopg_table VALUES(%s, %s, %s)"
>>
>> Without autocommit:
>>
>> In [65]: timeit -n 10 cur.executemany(sql, l)
>> 10 loops, best of 3: 12.5 ms per loop
>>
>>
>> With autocommit:
>>
>> In [72]: timeit -n 10 cur.executemany(sql, l)
>> 10 loops, best of 3: 1.71 s per loop
>
>
> please ensure you run this test with statements passing over a real
> network connection and not localhost.  makes a significant difference.

Same code across network, client in Bellingham WA, server in Fremont CA:

Without autocommit:

In [51]: %timeit -n 10 cur.executemany(sql, l)
10 loops, best of 3: 8.22 s per loop


With autocommit:

In [56]: %timeit -n 10 cur.executemany(sql, l)
10 loops, best of 3: 8.38 s per loop



>
>
>
>
>
>>
>>
>>>
>>> --
>>> -- Christophe Pettus
>>>    xof@thebuild.com
>>>
>>
>>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Sun, Jan 1, 2017 at 8:12 PM, mike bayer <mike_mp@zzzcomputing.com> wrote:
>
> you'd need to be careful with that as a series of multiple parameter sets
> may have dependencies on each other, not to mention people might find it
> surprising that sequences / defaults / SERIAL etc. aren't firing off in the
> order in which parameter sets were given.

I don't expect the parameters to be interpreted in a different way by
the server: I think the order would be maintained.

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Sun, Jan 1, 2017 at 9:33 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

> Same code across network, client in Bellingham WA, server in Fremont CA:
>
> Without autocommit:
>
> In [51]: %timeit -n 10 cur.executemany(sql, l)
> 10 loops, best of 3: 8.22 s per loop
>
>
> With autocommit:
>
> In [56]: %timeit -n 10 cur.executemany(sql, l)
> 10 loops, best of 3: 8.38 s per loop

Adrian, have you got a benchmark "classic vs. joined" on remote
network? Thank you.

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 01/02/2017 05:05 AM, Daniele Varrazzo wrote:
> On Sun, Jan 1, 2017 at 9:33 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
>> Same code across network, client in Bellingham WA, server in Fremont CA:
>>
>> Without autocommit:
>>
>> In [51]: %timeit -n 10 cur.executemany(sql, l)
>> 10 loops, best of 3: 8.22 s per loop
>>
>>
>> With autocommit:
>>
>> In [56]: %timeit -n 10 cur.executemany(sql, l)
>> 10 loops, best of 3: 8.38 s per loop
>
> Adrian, have you got a benchmark "classic vs. joined" on remote
> network? Thank you.

Knew I was forgetting something, thanks for the reminder. Will set that
up and report back.

>
> -- Daniele
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 01/02/2017 05:05 AM, Daniele Varrazzo wrote:
> On Sun, Jan 1, 2017 at 9:33 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
>> Same code across network, client in Bellingham WA, server in Fremont CA:
>>
>> Without autocommit:
>>
>> In [51]: %timeit -n 10 cur.executemany(sql, l)
>> 10 loops, best of 3: 8.22 s per loop
>>
>>
>> With autocommit:
>>
>> In [56]: %timeit -n 10 cur.executemany(sql, l)
>> 10 loops, best of 3: 8.38 s per loop
>
> Adrian, have you got a benchmark "classic vs. joined" on remote
> network? Thank you.

With NRECS=10000 and page size=100:

aklaver@tito:~> python psycopg_executemany.py -p 100
classic: 427.618795156 sec
joined: 7.55754685402 sec

>
> -- Daniele
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Karsten Hilbert
Дата:
On Mon, Jan 02, 2017 at 07:35:46AM -0800, Adrian Klaver wrote:

>>> In [56]: %timeit -n 10 cur.executemany(sql, l)

Someone along the way was asking for a possibility to "feed
in" the 'l' values.

Any chance 'l' can be supported to be a generator expression ?

Thanks for these two great developments !

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
> With NRECS=10000 and page size=100:
>
> aklaver@tito:~> python psycopg_executemany.py -p 100
> classic: 427.618795156 sec
> joined: 7.55754685402 sec

Ugh! :D

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Mon, Jan 2, 2017 at 4:46 PM, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
> On Mon, Jan 02, 2017 at 07:35:46AM -0800, Adrian Klaver wrote:
>
>>>> In [56]: %timeit -n 10 cur.executemany(sql, l)
>
> Someone along the way was asking for a possibility to "feed
> in" the 'l' values.
>
> Any chance 'l' can be supported to be a generator expression ?

The function would consume only page_size records at time. This is to
avoid creating a humongous query in memory and then on the server, but
has the nice effect of requesting only as many records from the
sequence.

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 01/02/2017 08:07 AM, Daniele Varrazzo wrote:
> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>>
>> With NRECS=10000 and page size=100:
>>
>> aklaver@tito:~> python psycopg_executemany.py -p 100
>> classic: 427.618795156 sec
>> joined: 7.55754685402 sec
>
> Ugh! :D

Well it does show the benefit of the joined approach.

>
> -- Daniele
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Karsten Hilbert
Дата:
On Mon, Jan 02, 2017 at 05:10:10PM +0100, Daniele Varrazzo wrote:

> > Any chance 'l' can be supported to be a generator expression ?
>
> The function would consume only page_size records at time. This is to
> avoid creating a humongous query in memory and then on the server, but
> has the nice effect of requesting only as many records from the
> sequence.

Nice.

Can said sequence be a generator (IOW not needing to support
len() upfront) ?

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Mon, Jan 2, 2017 at 5:16 PM, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
> On Mon, Jan 02, 2017 at 05:10:10PM +0100, Daniele Varrazzo wrote:
>
>> > Any chance 'l' can be supported to be a generator expression ?
>>
>> The function would consume only page_size records at time. This is to
>> avoid creating a humongous query in memory and then on the server, but
>> has the nice effect of requesting only as many records from the
>> sequence.
>
> Nice.
>
> Can said sequence be a generator (IOW not needing to support
> len() upfront) ?

Correct: see the implementation of paginate() in the gist at
<https://gist.github.com/dvarrazzo/4204cca5d1cd7c9c95bc814d81e6b83e>:
the sequence is only consumed via next(). Final implementation may
change but accessing the sequence only as a generator is a desired
feature for me too.

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Karsten Hilbert
Дата:
> On 01/02/2017 08:07 AM, Daniele Varrazzo wrote:
> > On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> > >
> > > With NRECS=10000 and page size=100:
> > >
> > > aklaver@tito:~> python psycopg_executemany.py -p 100
> > > classic: 427.618795156 sec
> > > joined: 7.55754685402 sec
> >
> > Ugh! :D

While 7 as such may be a lot, a decrease from 427 to 7 is
simply brilliant.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: [psycopg] speed concerns with executemany()

От
Karsten Hilbert
Дата:
On Mon, Jan 02, 2017 at 05:24:47PM +0100, Daniele Varrazzo wrote:

> > Can said sequence be a generator (IOW not needing to support
> > len() upfront) ?
>
> Correct: see the implementation of paginate() in the gist at
> <https://gist.github.com/dvarrazzo/4204cca5d1cd7c9c95bc814d81e6b83e>:
> the sequence is only consumed via next().

Ah, missed that, sorry.

Great work,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 01/02/2017 08:27 AM, Karsten Hilbert wrote:
>> On 01/02/2017 08:07 AM, Daniele Varrazzo wrote:
>>> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>>>>
>>>> With NRECS=10000 and page size=100:
>>>>
>>>> aklaver@tito:~> python psycopg_executemany.py -p 100
>>>> classic: 427.618795156 sec
>>>> joined: 7.55754685402 sec
>>>
>>> Ugh! :D
>
> While 7 as such may be a lot, a decrease from 427 to 7 is
> simply brilliant.

Also the server I am running this against is a stock install running on
a VM that just backs an issue tracker. No optimizations. Also on the
client end we are experiencing a Winter storm that is making things sort
of flaky. So I would take this as a relative comparison not a absolute
truth.

>
> Karsten
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Oswaldo
Дата:
El 02/01/17 a las 17:07, Daniele Varrazzo escribió:
> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>>
>> With NRECS=10000 and page size=100:
>>
>> aklaver@tito:~> python psycopg_executemany.py -p 100
>> classic: 427.618795156 sec
>> joined: 7.55754685402 sec
>

Hello,

There is a third option that provides a small improvement: generate a
single sql with multple values.

- Test with local database:

classic: 1.53970813751 sec
joined: 0.564052820206 sec
joined values: 0.175103187561 sec

- Test with db on an internet server

classic: 236.342775822 sec
joined: 6.08789801598 sec
joined values: 4.49090409279 sec

I often need to move data between different internet servers (sql server
<-> Postgresql). In my experience this is the fastest way to move
hundreds of thousands of data records.

I attach the sample modified with it executemany3 function.

(Sorry for my bad english)
Regards.




Re: [psycopg] speed concerns with executemany()

От
Oswaldo
Дата:
El 02/01/17 a las 20:33, Oswaldo escribió:
> El 02/01/17 a las 17:07, Daniele Varrazzo escribió:
>> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver
>> <adrian.klaver@aklaver.com> wrote:
>>>
>>> With NRECS=10000 and page size=100:
>>>
>>> aklaver@tito:~> python psycopg_executemany.py -p 100
>>> classic: 427.618795156 sec
>>> joined: 7.55754685402 sec
>>
>
> Hello,
>
> There is a third option that provides a small improvement: generate a
> single sql with multple values.
>
> - Test with local database:
>
> classic: 1.53970813751 sec
> joined: 0.564052820206 sec
> joined values: 0.175103187561 sec
>
> - Test with db on an internet server
>
> classic: 236.342775822 sec
> joined: 6.08789801598 sec
> joined values: 4.49090409279 sec
>
> I often need to move data between different internet servers (sql server
> <-> Postgresql). In my experience this is the fastest way to move
> hundreds of thousands of data records.
>
> I attach the sample modified with it executemany3 function.
>

Sorry i forgot the file.



Вложения

Re: [psycopg] speed concerns with executemany()

От
Federico Di Gregorio
Дата:
On 02/01/17 17:07, Daniele Varrazzo wrote:
> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>> With NRECS=10000 and page size=100:
>>
>> aklaver@tito:~> python psycopg_executemany.py -p 100
>> classic: 427.618795156 sec
>> joined: 7.55754685402 sec
> Ugh! :D

That's great. Just a minor point: I won't overload executemany() with
this feature but add a new method UNLESS the semantics are exactly the
same especially regarding session isolation. Also, right now psycopg
keeps track of the number of affected rows over executemany() calls: I'd
like to not lose that because it is a breaking change to the API.

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
DNDG srl                                                  http://dndg.it
   We should forget about small efficiencies, say about 97% of the
    time: premature optimization is the root of all evil.    -- D.E.Knuth


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Thu, Jan 5, 2017 at 5:32 PM, Federico Di Gregorio <fog@dndg.it> wrote:
> On 02/01/17 17:07, Daniele Varrazzo wrote:
>>
>> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com>
>> wrote:
>>>
>>> With NRECS=10000 and page size=100:
>>>
>>> aklaver@tito:~> python psycopg_executemany.py -p 100
>>> classic: 427.618795156 sec
>>> joined: 7.55754685402 sec
>>
>> Ugh! :D
>
>
> That's great. Just a minor point: I won't overload executemany() with this
> feature but add a new method UNLESS the semantics are exactly the same
> especially regarding session isolation. Also, right now psycopg keeps track
> of the number of affected rows over executemany() calls: I'd like to not
> lose that because it is a breaking change to the API.

It seems to me that the semantics would stay the same, even in
presence of volatile functions. However unfortunately rowcount would
break. That's just sad.

We can have no problem an extra argument to executemany: page_size
defaulting to 1 (previous behaviour) which could be bumped. It's sad
the default cannot be 100.

Mike Bayer reported (https://github.com/psycopg/psycopg2/issues/491)
that SQLAlchemy actually uses the aggregated rowcount for concurrency
control.

So, how much it is of a deal-breaker? Can we afford losing aggregated
rowcount to obtain a juicy speedup in default usage, or we'd rather
leave the behaviour untouched but having people "opting in for speed"?

ponder, ponder...

Pondered: as the features had little test and I don't want to delay
releasing 2.7 further, I'd rather release the feature with a page_size
default of 1. People could use it and report eventual failures if they
use a page_size > 1. If tests turn out to be positive that the
database behaves ok we could think about changing the default in the
future. We may want to drop the aggregated rowcount in the future but
with better planning, e.g. to allow SQLAlchemy to ignore aggregated
rowcount from psycopg >= 2.8...

How does it sound?

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Adrian Klaver
Дата:
On 01/05/2017 11:00 AM, Daniele Varrazzo wrote:
> On Thu, Jan 5, 2017 at 5:32 PM, Federico Di Gregorio <fog@dndg.it> wrote:
>> On 02/01/17 17:07, Daniele Varrazzo wrote:
>>>
>>> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com>
>>> wrote:
>>>>
>>>> With NRECS=10000 and page size=100:
>>>>
>>>> aklaver@tito:~> python psycopg_executemany.py -p 100
>>>> classic: 427.618795156 sec
>>>> joined: 7.55754685402 sec
>>>
>>> Ugh! :D
>>
>>
>> That's great. Just a minor point: I won't overload executemany() with this
>> feature but add a new method UNLESS the semantics are exactly the same
>> especially regarding session isolation. Also, right now psycopg keeps track
>> of the number of affected rows over executemany() calls: I'd like to not
>> lose that because it is a breaking change to the API.
>
> It seems to me that the semantics would stay the same, even in
> presence of volatile functions. However unfortunately rowcount would
> break. That's just sad.
>
> We can have no problem an extra argument to executemany: page_size
> defaulting to 1 (previous behaviour) which could be bumped. It's sad
> the default cannot be 100.
>
> Mike Bayer reported (https://github.com/psycopg/psycopg2/issues/491)
> that SQLAlchemy actually uses the aggregated rowcount for concurrency
> control.
>
> So, how much it is of a deal-breaker? Can we afford losing aggregated
> rowcount to obtain a juicy speedup in default usage, or we'd rather
> leave the behaviour untouched but having people "opting in for speed"?
>
> ponder, ponder...
>
> Pondered: as the features had little test and I don't want to delay
> releasing 2.7 further, I'd rather release the feature with a page_size
> default of 1. People could use it and report eventual failures if they
> use a page_size > 1. If tests turn out to be positive that the
> database behaves ok we could think about changing the default in the
> future. We may want to drop the aggregated rowcount in the future but
> with better planning, e.g. to allow SQLAlchemy to ignore aggregated
> rowcount from psycopg >= 2.8...
>
> How does it sound?

Works for me.
>
> -- Daniele
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [psycopg] speed concerns with executemany()

От
Federico Di Gregorio
Дата:
On 05/01/17 20:00, Daniele Varrazzo wrote:
> On Thu, Jan 5, 2017 at 5:32 PM, Federico Di Gregorio <fog@dndg.it> wrote:
>> On 02/01/17 17:07, Daniele Varrazzo wrote:
>>> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com>
>>> wrote:
>>>> With NRECS=10000 and page size=100:
>>>>
>>>> aklaver@tito:~> python psycopg_executemany.py -p 100
>>>> classic: 427.618795156 sec
>>>> joined: 7.55754685402 sec
>>> Ugh! :D
>>
>> That's great. Just a minor point: I won't overload executemany() with this
>> feature but add a new method UNLESS the semantics are exactly the same
>> especially regarding session isolation. Also, right now psycopg keeps track
>> of the number of affected rows over executemany() calls: I'd like to not
>> lose that because it is a breaking change to the API.
> It seems to me that the semantics would stay the same, even in
> presence of volatile functions. However unfortunately rowcount would
> break. That's just sad.
>
> We can have no problem an extra argument to executemany: page_size
> defaulting to 1 (previous behaviour) which could be bumped. It's sad
> the default cannot be 100.
>
> Mike Bayer reported (https://github.com/psycopg/psycopg2/issues/491)
> that SQLAlchemy actually uses the aggregated rowcount for concurrency
> control.
>
> So, how much it is of a deal-breaker? Can we afford losing aggregated
> rowcount to obtain a juicy speedup in default usage, or we'd rather
> leave the behaviour untouched but having people "opting in for speed"?
>
> ponder, ponder...
>
> Pondered: as the features had little test and I don't want to delay
> releasing 2.7 further, I'd rather release the feature with a page_size
> default of 1. People could use it and report eventual failures if they
> use a page_size > 1. If tests turn out to be positive that the
> database behaves ok we could think about changing the default in the
> future. We may want to drop the aggregated rowcount in the future but
> with better planning, e.g. to allow SQLAlchemy to ignore aggregated
> rowcount from psycopg >= 2.8...
>
> How does it sound?

Fine for me.

federico


--
Federico Di Gregorio                         federico.digregorio@dndg.it
DNDG srl                                                  http://dndg.it
            Purtroppo i creazionisti non si sono ancora estinti. -- vodka


Re: [psycopg] speed concerns with executemany()

От
mike bayer
Дата:

On 01/05/2017 02:00 PM, Daniele Varrazzo wrote:
> On Thu, Jan 5, 2017 at 5:32 PM, Federico Di Gregorio <fog@dndg.it> wrote:
>> On 02/01/17 17:07, Daniele Varrazzo wrote:
>>>
>>> On Mon, Jan 2, 2017 at 4:35 PM, Adrian Klaver <adrian.klaver@aklaver.com>
>>> wrote:
>>>>
>>>> With NRECS=10000 and page size=100:
>>>>
>>>> aklaver@tito:~> python psycopg_executemany.py -p 100
>>>> classic: 427.618795156 sec
>>>> joined: 7.55754685402 sec
>>>
>>> Ugh! :D
>>
>>
>> That's great. Just a minor point: I won't overload executemany() with this
>> feature but add a new method UNLESS the semantics are exactly the same
>> especially regarding session isolation. Also, right now psycopg keeps track
>> of the number of affected rows over executemany() calls: I'd like to not
>> lose that because it is a breaking change to the API.
>
> It seems to me that the semantics would stay the same, even in
> presence of volatile functions. However unfortunately rowcount would
> break. That's just sad.
>
> We can have no problem an extra argument to executemany: page_size
> defaulting to 1 (previous behaviour) which could be bumped. It's sad
> the default cannot be 100.
>
> Mike Bayer reported (https://github.com/psycopg/psycopg2/issues/491)
> that SQLAlchemy actually uses the aggregated rowcount for concurrency
> control.
>
> So, how much it is of a deal-breaker? Can we afford losing aggregated
> rowcount to obtain a juicy speedup in default usage, or we'd rather
> leave the behaviour untouched but having people "opting in for speed"?
>
> ponder, ponder...
>
> Pondered: as the features had little test and I don't want to delay
> releasing 2.7 further, I'd rather release the feature with a page_size
> default of 1. People could use it and report eventual failures if they
> use a page_size > 1. If tests turn out to be positive that the
> database behaves ok we could think about changing the default in the
> future. We may want to drop the aggregated rowcount in the future but
> with better planning, e.g. to allow SQLAlchemy to ignore aggregated
> rowcount from psycopg >= 2.8...

SQLAlchemy can definitely ignore the aggregated rowcount as most DBAPIs
don't support it anyway, so we can flip the flag off if we know exactly
what psycopg version breaks it.   The ORM in most cases prefers to use
executemany in any case unless the mapping has specified a versioning
column, in which case it has to use the method that supplies accurate
rowcount.

Ideally if we can control whether or not we get "aggreagted rowcount" or
"speed" via alternate API / flags / etc. would be nice.  Seems like
SQLAlchemy will need downstream changes to support this in any case.


>
> How does it sound?
>
> -- Daniele
>
>


Re: [psycopg] speed concerns with executemany()

От
Jim Nasby
Дата:
On 1/9/17 10:04 AM, mike bayer wrote:
> SQLAlchemy can definitely ignore the aggregated rowcount as most DBAPIs
> don't support it anyway, so we can flip the flag off if we know exactly
> what psycopg version breaks it.

ISTM it'd be better to add an attribute to indicate whether rowcount was
available or not, especially since it would be available with a page
size of 1?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


Re: [psycopg] speed concerns with executemany()

От
mike bayer
Дата:

On 01/09/2017 11:45 AM, Jim Nasby wrote:
> On 1/9/17 10:04 AM, mike bayer wrote:
>> SQLAlchemy can definitely ignore the aggregated rowcount as most DBAPIs
>> don't support it anyway, so we can flip the flag off if we know exactly
>> what psycopg version breaks it.
>
> ISTM it'd be better to add an attribute to indicate whether rowcount was
> available or not, especially since it would be available with a page
> size of 1?


so it's a dialect-level flag which on a SQLAlchemy engine you'd see as
engine.dialect.supports_sane_multi_rowcount.  When psycopg2 first adds
the page size API, we can begin supporting it by detecting the
availability of the flag via psycopg2 version info, as well as the
default for this value - if psycopg2 is setting it to a number greater
than 1 by default, we'd want to keep that default.   If these two things
are true then the supports_sane_multi_rowcount flag would be turned off
for the dialect that imports this version of psycopg2.    SQLAlchemy at
the very least needs to respond if/when psycopg2 changes this default
upstream since if we don't, it will break our own system and tests.

The next addition to SQLAlchemy would be explicit API that allows this
value to be configured both at dialect startup time as well as a
per-statement execution option.   supports_sane_multi_rowcount would be
flagged based on how this is set up.   There may need to be changes to
ensure supports_sane_multi_rowcount is consulted local to a statement
execution context, if we allow on-the-fly modifications to it.


Re: [psycopg] speed concerns with executemany()

От
Aryeh Leib Taurog
Дата:
On Mon, Jan 2, 2017 at 3:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>>> Same code across network, client in Bellingham WA, server in Fremont CA:
>>>
>>> Without autocommit:
>>>
>>> In [51]: %timeit -n 10 cur.executemany(sql, l)
>>> 10 loops, best of 3: 8.22 s per loop
>>>
>>>
>>> With autocommit:
>>>
>>> In [56]: %timeit -n 10 cur.executemany(sql, l)
>>> 10 loops, best of 3: 8.38 s per loop
>>
>> Adrian, have you got a benchmark "classic vs. joined" on remote
>> network? Thank you.
>
> With NRECS=10000 and page size=100:
>
> aklaver(at)tito:~> python psycopg_executemany.py -p 100
> classic: 427.618795156 sec
> joined: 7.55754685402 sec

This is really interesting.  I have long been using a utility I put
together to insert using BINARY COPY.  In fact I just brushed it up a
bit and put it on PyPi: <https://pypi.python.org/pypi/pgcopy>

I'm curious to run a benchmark against the improved executemany.  I'd
hoped that pgcopy would be generally useful, but it may no longer be
necessary.  A fast executemany() certainly suits more use cases.


Best,
Aryeh Leib Taurog


Re: [psycopg] speed concerns with executemany()

От
Aryeh Leib Taurog
Дата:
On Thu, Jan 19, 2017 at 02:23:15PM +0200, Aryeh Leib Taurog wrote:
> >>
> >> Adrian, have you got a benchmark "classic vs. joined" on remote
> >> network? Thank you.
> >
> > With NRECS=10000 and page size=100:
> >
> > aklaver(at)tito:~> python psycopg_executemany.py -p 100
> > classic: 427.618795156 sec
> > joined: 7.55754685402 sec
>
> This is really interesting.  I have long been using a utility I put
> together to insert using BINARY COPY.  In fact I just brushed it up a
> bit and put it on PyPi: <https://pypi.python.org/pypi/pgcopy>

I added binary copy to the benchmark:
<https://gist.github.com/altaurog/84668e034646fb354b5de81bb86a580d>

On local machine, with NRECS=100000 and page size=100:
classic: 4.26264309883 sec
joined: 2.34096288681 sec
pgcopy: 0.412513971329 sec


Over network, with NRECS=10000 and page size=100:
classic: 716.759769917 sec
joined: 15.141461134 sec
pgcopy: 3.70594096184 sec


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Thu, Jan 19, 2017 at 12:23 PM, Aryeh Leib Taurog
<python@aryehleib.com> wrote:
> On Mon, Jan 2, 2017 at 3:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

>> aklaver(at)tito:~> python psycopg_executemany.py -p 100
>> classic: 427.618795156 sec
>> joined: 7.55754685402 sec
>
> This is really interesting.  I have long been using a utility I put
> together to insert using BINARY COPY.  In fact I just brushed it up a
> bit and put it on PyPi: <https://pypi.python.org/pypi/pgcopy>
>
> I'm curious to run a benchmark against the improved executemany.  I'd
> hoped that pgcopy would be generally useful, but it may no longer be
> necessary.  A fast executemany() certainly suits more use cases.

(Sorry, mant to write this message earlier but forgot it in my drafts.)

There's always the case that a sequence of:

1) psycopg executemany as is now: separate statements
- insert into table values (...);
- insert into table values (...);
- insert into table values (...);

is slower than

2) psycopg executemany as proposed: a single statement containing
- insert into table values (...);
  insert into table values (...);
  insert into table values (...);

which is slower than

3) a single insert with many params. Plays well with PQexecParams
   but would need some form of generation by the client
- insert into table values (...), (...), (...);

which is slower than

4) copy.

While the proposed executemany is a nice low hanging fruit it will
break on PQexecParams and it's far from being optimal anyway. Wonder
if there is a way to help users at least to have 3 without bothering
with mogrify (due to break too with the PQexecParams switch).

Brainstorming from here: expect no consistency.

Another good property of 2 is that it supports any statements: update,
delete, select (which may call a stored procedure repeatedly to
perform some data manipulation) whereas a manipulation into form 3 is
specific to inserts (which can be a starting point for a fast
update/select anyway: fast-insert into a temp table, then select or
update with a join).

fast-forward to last message:

On Sun, Jan 29, 2017 at 5:44 PM, Aryeh Leib Taurog <python@aryehleib.com> wrote:

> Over network, with NRECS=10000 and page size=100:
> classic: 716.759769917 sec
> joined: 15.141461134 sec
> pgcopy: 3.70594096184 sec

Aryeh thank you for this benchmark. Could you please add a test like
(code untested):

    def insertmany(self, sql, argslist, page_size=100):
        tmpl = None
        for page in paginate(argslist, page_size=page_size):
            if tmpl is None and page:
                tmpl = '(%s)' % ','.join([%s] * len(page[0]))
            self.execute(sql % ",".join(self.mogrify(tmpl, args) for
args in page))

where sql should be an insert with a single %s placeholder, which
would be replaced by one or more arguments records, and see where we
end up? I expect somewhere between "joined" and "pgcopy" but wonder
closer to which.

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Aryeh Leib Taurog
Дата:
On Mon, Jan 30, 2017 at 11:18:36AM +0000, Daniele Varrazzo wrote:
> On Thu, Jan 19, 2017 at 12:23 PM, Aryeh Leib Taurog
> <python@aryehleib.com> wrote:
> > On Mon, Jan 2, 2017 at 3:35 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
> >> aklaver(at)tito:~> python psycopg_executemany.py -p 100
> >> classic: 427.618795156 sec
> >> joined: 7.55754685402 sec
> >
> > This is really interesting.  I have long been using a utility I put
> > together to insert using BINARY COPY.  In fact I just brushed it up a
> > bit and put it on PyPi: <https://pypi.python.org/pypi/pgcopy>
> >
> > I'm curious to run a benchmark against the improved executemany.  I'd
> > hoped that pgcopy would be generally useful, but it may no longer be
> > necessary.  A fast executemany() certainly suits more use cases.
>
> (Sorry, mant to write this message earlier but forgot it in my drafts.)
>
> There's always the case that a sequence of:
>
> 1) psycopg executemany as is now: separate statements
> - insert into table values (...);
> - insert into table values (...);
> - insert into table values (...);
>
> is slower than
>
> 2) psycopg executemany as proposed: a single statement containing
> - insert into table values (...);
>   insert into table values (...);
>   insert into table values (...);
>
> which is slower than
>
> 3) a single insert with many params. Plays well with PQexecParams
>    but would need some form of generation by the client
> - insert into table values (...), (...), (...);
>
> which is slower than
>
> 4) copy.
>
> While the proposed executemany is a nice low hanging fruit it will
> break on PQexecParams and it's far from being optimal anyway. Wonder
> if there is a way to help users at least to have 3 without bothering
> with mogrify (due to break too with the PQexecParams switch).
>
> Brainstorming from here: expect no consistency.
>
> Another good property of 2 is that it supports any statements: update,
> delete, select (which may call a stored procedure repeatedly to
> perform some data manipulation) whereas a manipulation into form 3 is
> specific to inserts (which can be a starting point for a fast
> update/select anyway: fast-insert into a temp table, then select or
> update with a join).

I haven't dug into PQexecParams, but it seems to me that these
alternate strategies could be made available outside of psycopg2.
I modified the benchmark script so they are just functions which
operate on a standard cursor object.  They are simple enough to be
included as "recipes" in the psycopg2 docs and/or in a separate
package (like pgcopy).

<https://gist.github.com/altaurog/84668e034646fb354b5de81bb86a580d>

> fast-forward to last message:
>
> On Sun, Jan 29, 2017 at 5:44 PM, Aryeh Leib Taurog <python@aryehleib.com> wrote:
>
> > Over network, with NRECS=10000 and page size=100:
> > classic: 716.759769917 sec
> > joined: 15.141461134 sec
> > pgcopy: 3.70594096184 sec
>
> Aryeh thank you for this benchmark. Could you please add a test like
> (code untested):
>
>     def insertmany(self, sql, argslist, page_size=100):
>         tmpl = None
>         for page in paginate(argslist, page_size=page_size):
>             if tmpl is None and page:
>                 tmpl = '(%s)' % ','.join([%s] * len(page[0]))
>             self.execute(sql % ",".join(self.mogrify(tmpl, args) for
> args in page))
>
> where sql should be an insert with a single %s placeholder, which
> would be replaced by one or more arguments records, and see where we
> end up? I expect somewhere between "joined" and "pgcopy" but wonder
> closer to which.

This one was over a transatlantic connection, NRECS=5000
classic: 761.322767019 sec
joined: 14.6529989243 sec
folded: 12.4037430286 sec
pgcopy: 2.85529208183 sec


Re: [psycopg] speed concerns with executemany()

От
Aryeh Leib Taurog
Дата:
On Mon, Jan 30, 2017 at 11:18:36AM +0000, Daniele Varrazzo wrote:
> 3) a single insert with many params. Plays well with PQexecParams
>    but would need some form of generation by the client
> - insert into table values (...), (...), (...);
>
> While the proposed executemany is a nice low hanging fruit it will
> break on PQexecParams and it's far from being optimal anyway. Wonder
> if there is a way to help users at least to have 3 without bothering
> with mogrify (due to break too with the PQexecParams switch).

Supporting a general case would surely require parsing the sql
statement to some extent, but a simple insert could be done quite
easily, particularly if you change the call signature so the caller
does your work for you:


def insert_batch(cur, sql, template, args):
    argslist = list(args)
    sql_full = sql + ','.join([template] * len(argslist))
    cur.execute(sql_full, reduce(operator.add, argslist))

insert_batch(cur, "insert into testmany (num, data) values ", "(%s, %s)", data)


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
On Mon, Jan 30, 2017 at 9:51 PM, Aryeh Leib Taurog <python@aryehleib.com> wrote:
> I haven't dug into PQexecParams, but it seems to me that these
> alternate strategies could be made available outside of psycopg2.

I got to the same conclusion, so I've implemented functions to
implement the functions execute_batch() and execute_values() (resp.
what you call "joined" and "folded" in your benchmarks) in the extras
module, and leave the semantics of executemany() untouched (no extra
parameters or rowcount breaking only if going batched... just too
confusing).

Implementation, docs, tests in `this commit`__.

.. __: https://github.com/psycopg/psycopg2/commit/a95fd3df1abc0282f1c47fa2170191f037c3c8de

I also thought about implementing an execute_prepared() function,
which would have run PREPARE, then EXECUTE in a loop (possibly batched
as in execute_batch), finally DEALLOCATE. Implementation doesn't seem
trivial because:

- parameters %s and %(name)s should be replaced with $1, $2, ..., with
matching unescaping of literal % and escaping of literal $;
- in case of error, calling DEALLOCATE is tricky: if the connection is
not autocommit the transaction is failed, the function could use a
savepoint but then it would clear the error state too; not calling
DEALLOCATE would leave the prepared statement there and according to
the name chosen for the statement (e.g. 'psycopg_%s' % id(cursor))
would make another execute_prepared() fail... it should be possible to
wrap the PREPARE in a savepoint to deal with this problem;
- if the connection is autocommit all the above is not needed.

ISTM that it's easier to leave the users to call
PREPARE/execute_batch('EXECUTE')/DEALLOCATE: knowing the context in
which the sequence is called wouldn't require parameters mangling and
the error management would be simpler for them.

Thoughts? Shall we merge this stuff? Any feedback is welcome.

-- Daniele


Re: [psycopg] speed concerns with executemany()

От
Aryeh Leib Taurog
Дата:
On Wed, Feb 01, 2017 at 02:21:37AM +0000, Daniele Varrazzo wrote:
> On Mon, Jan 30, 2017 at 9:51 PM, Aryeh Leib Taurog <python@aryehleib.com> wrote:
> > I haven't dug into PQexecParams, but it seems to me that these
> > alternate strategies could be made available outside of psycopg2.
>
> I got to the same conclusion, so I've implemented functions to
> implement the functions execute_batch() and execute_values() (resp.
> what you call "joined" and "folded" in your benchmarks) in the extras
> module, and leave the semantics of executemany() untouched (no extra
> parameters or rowcount breaking only if going batched... just too
> confusing).
>
> Implementation, docs, tests in `this commit`__.
>
> .. __: https://github.com/psycopg/psycopg2/commit/a95fd3df1abc0282f1c47fa2170191f037c3c8de
>
> I also thought about implementing an execute_prepared() function,
> which would have run PREPARE, then EXECUTE in a loop (possibly batched
> as in execute_batch), finally DEALLOCATE. Implementation doesn't seem
> trivial because:
>
> - parameters %s and %(name)s should be replaced with $1, $2, ..., with
> matching unescaping of literal % and escaping of literal $;
> - in case of error, calling DEALLOCATE is tricky: if the connection is
> not autocommit the transaction is failed, the function could use a
> savepoint but then it would clear the error state too; not calling
> DEALLOCATE would leave the prepared statement there and according to
> the name chosen for the statement (e.g. 'psycopg_%s' % id(cursor))
> would make another execute_prepared() fail... it should be possible to
> wrap the PREPARE in a savepoint to deal with this problem;
> - if the connection is autocommit all the above is not needed.
>
> ISTM that it's easier to leave the users to call
> PREPARE/execute_batch('EXECUTE')/DEALLOCATE: knowing the context in
> which the sequence is called wouldn't require parameters mangling and
> the error management would be simpler for them.
>
> Thoughts? Shall we merge this stuff? Any feedback is welcome.

I think it looks great.  I agree that execute_prepared() would be
tricky, and it doesn't seem to provide much performance benefit over
the methods you've already implemented.

I have run the following benchmarks for UPDATE across the Atlantic:

1. classic executemany

2. "joined" - statements executed in batches

3. "prepared/joined" - as you suggested above

4. pgcopy - copy into a temp table, then UPDATE FROM temp table.

<https://gist.github.com/altaurog/84668e034646fb354b5de81bb86a580d#file-updatemany-py-L69>

With NRECS = 5000
classic: 794.658465862 sec
joined: 11.6829760075 sec
prepared: 10.1489500999 sec
pgcopy: 2.68695497513 sec


Re: [psycopg] speed concerns with executemany()

От
Jim Nasby
Дата:
On 1/31/17 6:21 PM, Daniele Varrazzo wrote:
> ISTM that it's easier to leave the users to call
> PREPARE/execute_batch('EXECUTE')/DEALLOCATE: knowing the context in
> which the sequence is called wouldn't require parameters mangling and
> the error management would be simpler for them.

I think a context handler for prepare would be very handy, so that you
didn't need to remember to dealloc. There might be a way to avoid the
separate prepare too.

I agree that trying to magically convert % to $ and back is nothing but
trouble. If users are using prepared statements, they need to understand
the $ syntax.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


Re: [psycopg] speed concerns with executemany()

От
Daniele Varrazzo
Дата:
I've merged the branch containing these functions to master.

Docs for the new functions are at
http://initd.org/psycopg/docs/extras.html#fast-execution-helpers

There is still space for some final tweak, so feedback is still
accepted. Thank you everybody for the constructive discussion.

-- Daniele