Обсуждение: [psycopg] Turbo ODBC

Поиск

Список

Период

Сортировка

[psycopg] Turbo ODBC

От

Jim Nasby

Дата:

16 января 2017 г., 02:32:04

Interesting... https://github.com/blue-yonder/turbodbc
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

Adrian Klaver

Дата:

16 января 2017 г., 03:04:31

On 01/15/2017 12:32 PM, Jim Nasby wrote:
> Interesting... https://github.com/blue-yonder/turbodbc

Yes, interesting but you are still dependent on the underlying ODBC
implementation. Not sure how that impacts performance versus a method
with fewer hops?


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [psycopg] Turbo ODBC

От

Jim Nasby

Дата:

16 января 2017 г., 03:19:33

On 1/15/17 3:04 PM, Adrian Klaver wrote:
> On 01/15/2017 12:32 PM, Jim Nasby wrote:
>> Interesting... https://github.com/blue-yonder/turbodbc
>
> Yes, interesting but you are still dependent on the underlying ODBC
> implementation. Not sure how that impacts performance versus a method
> with fewer hops?

Oh, I'd hope that a native libpq implementation would be faster than
going through ODBC. But, there's presumably useful info that can be
picked up here; the bit about buffering was certainly interesting.

BTW, the person that brought this to my attention had mentioned that a
lot of people doing data science with data living in Postgres feel the
need to extract the data from Postgres into something like HDFS before
they can do anything useful, because apparently data access through HDFS
is 3x faster than through Postgres. My impression is that at least part
of that is due to using Pandas from_sql functionality (which AIUI
marshals everything through SQL Alchemy), but anything that can be done
on the psycopg2 side would help.

I'm also looking into speeding up SPI access through plpython; depending
on how you want to measure I've gotten a 30-600% improvement by removing
the buffering that SPI does by default.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

Adrian Klaver

Дата:

16 января 2017 г., 03:25:11

On 01/15/2017 01:19 PM, Jim Nasby wrote:
> On 1/15/17 3:04 PM, Adrian Klaver wrote:
>> On 01/15/2017 12:32 PM, Jim Nasby wrote:
>>> Interesting... https://github.com/blue-yonder/turbodbc
>>
>> Yes, interesting but you are still dependent on the underlying ODBC
>> implementation. Not sure how that impacts performance versus a method
>> with fewer hops?
>
> Oh, I'd hope that a native libpq implementation would be faster than
> going through ODBC. But, there's presumably useful info that can be
> picked up here; the bit about buffering was certainly interesting.
>
> BTW, the person that brought this to my attention had mentioned that a
> lot of people doing data science with data living in Postgres feel the
> need to extract the data from Postgres into something like HDFS before
> they can do anything useful, because apparently data access through HDFS
> is 3x faster than through Postgres. My impression is that at least part

Have you looked at asyncpg:

https://github.com/MagicStack/asyncpg

it is Python 3.5+ though.

> of that is due to using Pandas from_sql functionality (which AIUI
> marshals everything through SQL Alchemy), but anything that can be done
> on the psycopg2 side would help.
>
> I'm also looking into speeding up SPI access through plpython; depending
> on how you want to measure I've gotten a 30-600% improvement by removing
> the buffering that SPI does by default.


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [psycopg] Turbo ODBC

От

Jim Nasby

Дата:

17 января 2017 г., 05:14:51

On 1/15/17 3:25 PM, Adrian Klaver wrote:
> Have you looked at asyncpg:

I'm pulling Matt Rocklin in, who I've been working with on some plpython
improvements. Here's his offlist replies to me:

> On 1/15/17 5:15 PM, Matthew Rocklin wrote:
>> You might also contact Wes McKinney at check out
>> Arrow: http://pyarrow.readthedocs.io/en/latest/
>>
>> This project was designed for inter-system data interchange.  It would
>> help with conversion to Spark and other popular tools as well.
>>
>> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io
>> <mailto:mrocklin@continuum.io>> wrote:
>>
>>     I haven't taken a look at asyncpg.
>>
>>     Quick note, you might want to be careful about mixing HDFS and
>>     HDF5.  Postgres's competition for data storage is HDF5 not HDFS
>>     (Which stands for the Hadoop File System)
>>
>>     I still think that the best thing to do here would be to dump out
>>     python arrays for each of the columns in a result set.  I suspect
>>     that you could beat /any/ system that doesn't do this handily.  This
>>     would avoid any performance pain of building up Python objects, and
>>     can continue to use just the standard library.
>>
>>     You can stick to Pure Python and still send arrays encoded with
>>     bytes.  We can convert these to NumPy or Pandas trivially with zero
>>     overhead.
>>
>>     In [1]: from array import array   # this is in the standard library
>>     In [2]: b = b'\x00' * 16  # 16 zeroes as a bytestring
>>     In [3]: a = array('L', b) # eight byte unsigned integer
>>     In [4]: a
>>     Out[4]: array('L', [0, 0])
>>
>>     We would only be bound by
>>
>>      1. The cost in postgres to tranpose the data from row-major to
>>         column major
>>      2. The cost to move bytes across a socket
>>
>>     This should run at gigabytes per second.  Everything else is likely
>>     to be competing around to 50-100MB/s range at top speed is my guess.

My guess is that the big gains for external communication would come from:

1) reducing latency effects
2) reducing data marshaling

1 could be accomplished in various ways, and some changes to the FE-BE
protocol might be helpful. If we can demonstrate a clear win we might be
able to get protocol changes. I do suspect this needs to be tune-able
though. My 4x improvement to plpy.execute is changing from the
equivalent to a large batch to single row operation. Moving data to
python over a local filesystem socket would be hurt be a large batch
size, while something on a high latency connection would presumably
benefit from moderately large batches since that reduces overall TCP
overhead.

I think 2 would necessitate FE-BE changes, although some of that could
be tested without the changes. Right now, everything is going to get
marshaled into a simple text format (ie: ints become a string of
digits), but for test purposes there's ways you could run that through
the binary output functions (in the case of int4, you'd get 4 bytes in
network order, which should be faster to handle (and could be passed
directly to something like ndarray).

The two problems you run into with the array type are handling NULLs and
building columns from rows. You could allow the option of specifying
that NULLs won't be allowed in specific fields, or you could use
something like a Pandas Series that provides other ways of handling the
equivalent of NULL.

For the row to column conversion, there's simply no way to get around
the pattern of appending items one at a time to an array-like container
(list, ndarray, Series, etc), and dealing with the memory allocation
problem. I've looked at python's list code and it's certainly no worse
at that then what Postgres would do. The big reason I would do that work
in python though is you then have the ability to use an number of types
for that; they just have to support the equivalent of append().
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

Matthew Rocklin

Дата:

17 января 2017 г., 05:24:42

CC'ing Wes McKinney in case he's interested in this problem (how do we efficiently move data from Postgres into Python).

On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 1/15/17 3:25 PM, Adrian Klaver wrote:
Have you looked at asyncpg:

I'm pulling Matt Rocklin in, who I've been working with on some plpython improvements. Here's his offlist replies to me:

On 1/15/17 5:15 PM, Matthew Rocklin wrote:
You might also contact Wes McKinney and check out
Arrow: http://pyarrow.readthedocs.io/en/latest/

This project was designed for inter-system data interchange. It would
help with conversion to Spark and other popular tools as well.

On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io
<mailto:mrocklin@continuum.io>> wrote:

I haven't taken a look at asyncpg.

Quick note, you might want to be careful about mixing HDFS and
HDF5. Postgres's competition for data storage is HDF5 not HDFS
(Which stands for the Hadoop File System)

I still think that the best thing to do here would be to dump out
python arrays for each of the columns in a result set. I suspect
that you could beat /any/ system that doesn't do this handily. This
would avoid any performance pain of building up Python objects, and
can continue to use just the standard library.

You can stick to Pure Python and still send arrays encoded with
bytes. We can convert these to NumPy or Pandas trivially with zero
overhead.

In [1]: from array import array # this is in the standard library
In [2]: b = b'\x00' * 16 # 16 zeroes as a bytestring
In [3]: a = array('L', b) # eight byte unsigned integer
In [4]: a
Out[4]: array('L', [0, 0])

We would only be bound by

1. The cost in postgres to tranpose the data from row-major to
column major
2. The cost to move bytes across a socket

This should run at gigabytes per second. Everything else is likely
to be competing around to 50-100MB/s range at top speed is my guess.

My guess is that the big gains for external communication would come from:

1) reducing latency effects
2) reducing data marshaling

1 could be accomplished in various ways, and some changes to the FE-BE protocol might be helpful. If we can demonstrate a clear win we might be able to get protocol changes. I do suspect this needs to be tune-able though. My 4x improvement to plpy.execute is changing from the equivalent to a large batch to single row operation. Moving data to python over a local filesystem socket would be hurt be a large batch size, while something on a high latency connection would presumably benefit from moderately large batches since that reduces overall TCP overhead.

I think 2 would necessitate FE-BE changes, although some of that could be tested without the changes. Right now, everything is going to get marshaled into a simple text format (ie: ints become a string of digits), but for test purposes there's ways you could run that through the binary output functions (in the case of int4, you'd get 4 bytes in network order, which should be faster to handle (and could be passed directly to something like ndarray).

The two problems you run into with the array type are handling NULLs and building columns from rows. You could allow the option of specifying that NULLs won't be allowed in specific fields, or you could use something like a Pandas Series that provides other ways of handling the equivalent of NULL.

For the row to column conversion, there's simply no way to get around the pattern of appending items one at a time to an array-like container (list, ndarray, Series, etc), and dealing with the memory allocation problem. I've looked at python's list code and it's certainly no worse at that then what Postgres would do. The big reason I would do that work in python though is you then have the ability to use an number of types for that; they just have to support the equivalent of append().

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

Wes McKinney

Дата:

17 января 2017 г., 07:20:48

Adding Uwe Korn -- he's been working on adding C++-level Arrow support
to TurbODBC:

https://github.com/blue-yonder/turbodbc/pull/26

We're using Arrow data structures as an efficient (column-oriented)
intermediary en route to pandas -- you could also implement normal
Python tuple iteration on top of Arrow record batches.

The main Python-compatible Arrow implementation is in C++11 so if you
wanted to use in C you'd need to do a little work to create C
bindings.

On Mon, Jan 16, 2017 at 6:24 PM, Matthew Rocklin <mrocklin@continuum.io> wrote:
> CC'ing Wes McKinney in case he's interested in this problem (how do we
> efficiently move data from Postgres into Python).
>
> On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> On 1/15/17 3:25 PM, Adrian Klaver wrote:
>>>
>>> Have you looked at asyncpg:
>>
>>
>> I'm pulling Matt Rocklin in, who I've been working with on some plpython
>> improvements. Here's his offlist replies to me:
>>
>>> On 1/15/17 5:15 PM, Matthew Rocklin wrote:
>>>>
>>>> You might also contact Wes McKinney and check out
>>>> Arrow: http://pyarrow.readthedocs.io/en/latest/
>>>>
>>>> This project was designed for inter-system data interchange.  It would
>>>> help with conversion to Spark and other popular tools as well.
>>>>
>>>> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io
>>>> <mailto:mrocklin@continuum.io>> wrote:
>>>>
>>>>     I haven't taken a look at asyncpg.
>>>>
>>>>     Quick note, you might want to be careful about mixing HDFS and
>>>>     HDF5.  Postgres's competition for data storage is HDF5 not HDFS
>>>>     (Which stands for the Hadoop File System)
>>>>
>>>>     I still think that the best thing to do here would be to dump out
>>>>     python arrays for each of the columns in a result set.  I suspect
>>>>     that you could beat /any/ system that doesn't do this handily.  This
>>>>     would avoid any performance pain of building up Python objects, and
>>>>     can continue to use just the standard library.
>>>>
>>>>     You can stick to Pure Python and still send arrays encoded with
>>>>     bytes.  We can convert these to NumPy or Pandas trivially with zero
>>>>     overhead.
>>>>
>>>>     In [1]: from array import array   # this is in the standard library
>>>>     In [2]: b = b'\x00' * 16  # 16 zeroes as a bytestring
>>>>     In [3]: a = array('L', b) # eight byte unsigned integer
>>>>     In [4]: a
>>>>     Out[4]: array('L', [0, 0])
>>>>
>>>>     We would only be bound by
>>>>
>>>>      1. The cost in postgres to tranpose the data from row-major to
>>>>         column major
>>>>      2. The cost to move bytes across a socket
>>>>
>>>>     This should run at gigabytes per second.  Everything else is likely
>>>>     to be competing around to 50-100MB/s range at top speed is my guess.
>>
>>
>> My guess is that the big gains for external communication would come from:
>>
>> 1) reducing latency effects
>> 2) reducing data marshaling
>>
>> 1 could be accomplished in various ways, and some changes to the FE-BE
>> protocol might be helpful. If we can demonstrate a clear win we might be
>> able to get protocol changes. I do suspect this needs to be tune-able
>> though. My 4x improvement to plpy.execute is changing from the equivalent to
>> a large batch to single row operation. Moving data to python over a local
>> filesystem socket would be hurt be a large batch size, while something on a
>> high latency connection would presumably benefit from moderately large
>> batches since that reduces overall TCP overhead.
>>
>> I think 2 would necessitate FE-BE changes, although some of that could be
>> tested without the changes. Right now, everything is going to get marshaled
>> into a simple text format (ie: ints become a string of digits), but for test
>> purposes there's ways you could run that through the binary output functions
>> (in the case of int4, you'd get 4 bytes in network order, which should be
>> faster to handle (and could be passed directly to something like ndarray).
>>
>> The two problems you run into with the array type are handling NULLs and
>> building columns from rows. You could allow the option of specifying that
>> NULLs won't be allowed in specific fields, or you could use something like a
>> Pandas Series that provides other ways of handling the equivalent of NULL.
>>
>> For the row to column conversion, there's simply no way to get around the
>> pattern of appending items one at a time to an array-like container (list,
>> ndarray, Series, etc), and dealing with the memory allocation problem. I've
>> looked at python's list code and it's certainly no worse at that then what
>> Postgres would do. The big reason I would do that work in python though is
>> you then have the ability to use an number of types for that; they just have
>> to support the equivalent of append().
>>
>> --
>> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
>> Experts in Analytics, Data Architecture and PostgreSQL
>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>> 855-TREBLE2 (855-873-2532)
>
>

Re: [psycopg] Turbo ODBC

От

Adrian Klaver

Дата:

17 января 2017 г., 07:32:22

On 01/16/2017 05:20 PM, Wes McKinney wrote:
> Adding Uwe Korn -- he's been working on adding C++-level Arrow support
> to TurbODBC:
>
> https://github.com/blue-yonder/turbodbc/pull/26
>
> We're using Arrow data structures as an efficient (column-oriented)
> intermediary en route to pandas -- you could also implement normal
> Python tuple iteration on top of Arrow record batches.
>
> The main Python-compatible Arrow implementation is in C++11 so if you
> wanted to use in C you'd need to do a little work to create C
> bindings.

All of this is very interesting and definitely worth exploring, just not
sure how much of it ties back to psycopg2 and this list. Not trying to
rain on anyone's parade, I am wondering if this might not be better
explored on a 'meta' list, something like the various Python projects
that deal with Excel do:

https://groups.google.com/forum/#!forum/python-excel

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [psycopg] Turbo ODBC

От

Jim Nasby

Дата:

17 января 2017 г., 08:07:52

On 1/16/17 7:32 PM, Adrian Klaver wrote:
> All of this is very interesting and definitely worth exploring, just not
> sure how much of it ties back to psycopg2 and this list. Not trying to
> rain on anyone's parade, I am wondering if this might not be better
> explored on a 'meta' list, something like the various Python projects
> that deal with Excel do:

Since this is a user mailing list that might make sense. Though, I'm
getting the impression that there's some disconnect between what data
science users are doing and this list. Tuple-based results vs
vector-based (ie: columnar) results is an example of that.

I do think there's 3 items that would best be handled at the "bottom" of
the stack (namely, psycopg2), because they'll enable every higher level
as well as make life easier for direct users of psycopg2:

1) Performance, both in low-latency (ie: filesystem socket) and
high-latency environments.
2) Type conversion (in particular, getting rid of strings as the
intermediate representation).
3) Optionally providing a columnar result set.

#3 might be in direct opposition to the standard Python DB accessor
stuff, so maybe that would need to be a separate module on top of
psycopg2, but psycopg2 would certainly still need to support it. (IE:
you certainly do NOT want psycopg2 to build a list of dicts only to then
try and convert that to a columnar format).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

Adrian Klaver

Дата:

17 января 2017 г., 11:25:09

On 01/16/2017 06:07 PM, Jim Nasby wrote:
> On 1/16/17 7:32 PM, Adrian Klaver wrote:
>> All of this is very interesting and definitely worth exploring, just not
>> sure how much of it ties back to psycopg2 and this list. Not trying to
>> rain on anyone's parade, I am wondering if this might not be better
>> explored on a 'meta' list, something like the various Python projects
>> that deal with Excel do:
>
> Since this is a user mailing list that might make sense. Though, I'm
> getting the impression that there's some disconnect between what data
> science users are doing and this list. Tuple-based results vs
> vector-based (ie: columnar) results is an example of that.

Well we are talking about improving the flow of data in and out of a
database and to the extent that can be improved all users that depend on
that benefit. The issue then becomes the optimizations for the various
subsets of users. I suspect that will involve more then one solution and
quite possibly multiple solutions for any particular case. Seems to me a
place where that can be discussed on a broader scale would lead to a
cross discipline conversation that would be advantageous to all. Because
in a sense we are all data scientists.

>
> I do think there's 3 items that would best be handled at the "bottom" of
> the stack (namely, psycopg2), because they'll enable every higher level
> as well as make life easier for direct users of psycopg2:
>
> 1) Performance, both in low-latency (ie: filesystem socket) and
> high-latency environments.
> 2) Type conversion (in particular, getting rid of strings as the
> intermediate representation).
> 3) Optionally providing a columnar result set.

Seems a good starting point, though I have to admit my needs are fairly
simple and are already met by the existing implementation.

>
> #3 might be in direct opposition to the standard Python DB accessor
> stuff, so maybe that would need to be a separate module on top of
> psycopg2, but psycopg2 would certainly still need to support it. (IE:
> you certainly do NOT want psycopg2 to build a list of dicts only to then
> try and convert that to a columnar format).

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [psycopg] Turbo ODBC

От

"Uwe L. Korn"

Дата:

17 января 2017 г., 16:51:06

One important thing for fast columnar data access is that you don't want
to have the data as Python objects before they will be turned into a
DataFrame. Besides much better buffering, this was one of the main
advantages we have with Turbodbc. Given that the ODBC drivers for
Postgres seem to be in a miserable state, it would be much preferable to
have such functionality directly in pyscopg2. Given from meetings with
people at some PyData conferences that I showed turbodbc to, I can
definitely say that there are some users out there that would like a
fast path for Postgres-to-Pandas.

In turbodbc, there are two additional functions added to the DB-API
cursor object: fetchallnumpy and fetchallarrow. These suffice mostly for
the typical pandas workloads. The experience from implementing this is
basically that with Arrow it was quite simple to add a columnar
interface as most of the data conversions were handled by Arrow. Also
there was no need for me to interface with any Python types as the
language "barrier" was transparently handled by Arrow.

CC'ing Michael König, the creator of Turbodbc, he might be able to give
some more input.

--
  Uwe L. Korn
  uwelk@xhochy.com

On Tue, Jan 17, 2017, at 03:07 AM, Jim Nasby wrote:
> On 1/16/17 7:32 PM, Adrian Klaver wrote:
> > All of this is very interesting and definitely worth exploring, just not
> > sure how much of it ties back to psycopg2 and this list. Not trying to
> > rain on anyone's parade, I am wondering if this might not be better
> > explored on a 'meta' list, something like the various Python projects
> > that deal with Excel do:
>
> Since this is a user mailing list that might make sense. Though, I'm
> getting the impression that there's some disconnect between what data
> science users are doing and this list. Tuple-based results vs
> vector-based (ie: columnar) results is an example of that.
>
> I do think there's 3 items that would best be handled at the "bottom" of
> the stack (namely, psycopg2), because they'll enable every higher level
> as well as make life easier for direct users of psycopg2:
>
> 1) Performance, both in low-latency (ie: filesystem socket) and
> high-latency environments.
> 2) Type conversion (in particular, getting rid of strings as the
> intermediate representation).
> 3) Optionally providing a columnar result set.
>
> #3 might be in direct opposition to the standard Python DB accessor
> stuff, so maybe that would need to be a separate module on top of
> psycopg2, but psycopg2 would certainly still need to support it. (IE:
> you certainly do NOT want psycopg2 to build a list of dicts only to then
> try and convert that to a columnar format).
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

Jim Nasby

Дата:

17 января 2017 г., 21:06:12

On 1/17/17 4:51 AM, Uwe L. Korn wrote:
> One important thing for fast columnar data access is that you don't want
> to have the data as Python objects before they will be turned into a
> DataFrame. Besides much better buffering, this was one of the main
> advantages we have with Turbodbc. Given that the ODBC drivers for
> Postgres seem to be in a miserable state, it would be much preferable to
> have such functionality directly in pyscopg2. Given from meetings with
> people at some PyData conferences that I showed turbodbc to, I can
> definitely say that there are some users out there that would like a
> fast path for Postgres-to-Pandas.
>
> In turbodbc, there are two additional functions added to the DB-API
> cursor object: fetchallnumpy and fetchallarrow. These suffice mostly for
> the typical pandas workloads. The experience from implementing this is
> basically that with Arrow it was quite simple to add a columnar
> interface as most of the data conversions were handled by Arrow. Also
> there was no need for me to interface with any Python types as the
> language "barrier" was transparently handled by Arrow.

I certainly see the advantages to not creating objects. How do you end
up handling NULLs?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

"Koenig, Michael"

Дата:

17 января 2017 г., 21:16:59

For the NumPy functionality, NULL values are handled by returning masked arrays
(https://docs.scipy.org/doc/numpy/reference/maskedarray.html)instead of plain arrays.
 

Regards

Michael

On 17/01/2017, 16:06, "Jim Nasby" <Jim.Nasby@BlueTreble.com> wrote:

    On 1/17/17 4:51 AM, Uwe L. Korn wrote:
    > One important thing for fast columnar data access is that you don't want
    > to have the data as Python objects before they will be turned into a
    > DataFrame. Besides much better buffering, this was one of the main
    > advantages we have with Turbodbc. Given that the ODBC drivers for
    > Postgres seem to be in a miserable state, it would be much preferable to
    > have such functionality directly in pyscopg2. Given from meetings with
    > people at some PyData conferences that I showed turbodbc to, I can
    > definitely say that there are some users out there that would like a
    > fast path for Postgres-to-Pandas.
    >
    > In turbodbc, there are two additional functions added to the DB-API
    > cursor object: fetchallnumpy and fetchallarrow. These suffice mostly for
    > the typical pandas workloads. The experience from implementing this is
    > basically that with Arrow it was quite simple to add a columnar
    > interface as most of the data conversions were handled by Arrow. Also
    > there was no need for me to interface with any Python types as the
    > language "barrier" was transparently handled by Arrow.
    
    I certainly see the advantages to not creating objects. How do you end 
    up handling NULLs?
    -- 
    Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
    Experts in Analytics, Data Architecture and PostgreSQL
    Data in Trouble? Get it in Treble! http://BlueTreble.com
    855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

"Uwe L. Korn"

Дата:

17 января 2017 г., 21:18:41

In Arrow, we have a bitmap for each column indicating if a value is
NULL.  We can convert this clearly to NumPy masked arrays but once this
data is converted to Pandas though, integer columns with NULLs will be
converted to floats with NaN representing NULL as there is no explicit
NULL representation in Pandas 0.x.

--
  Uwe L. Korn
  uwelk@xhochy.com

On Tue, Jan 17, 2017, at 04:06 PM, Jim Nasby wrote:
> On 1/17/17 4:51 AM, Uwe L. Korn wrote:
> > One important thing for fast columnar data access is that you don't want
> > to have the data as Python objects before they will be turned into a
> > DataFrame. Besides much better buffering, this was one of the main
> > advantages we have with Turbodbc. Given that the ODBC drivers for
> > Postgres seem to be in a miserable state, it would be much preferable to
> > have such functionality directly in pyscopg2. Given from meetings with
> > people at some PyData conferences that I showed turbodbc to, I can
> > definitely say that there are some users out there that would like a
> > fast path for Postgres-to-Pandas.
> >
> > In turbodbc, there are two additional functions added to the DB-API
> > cursor object: fetchallnumpy and fetchallarrow. These suffice mostly for
> > the typical pandas workloads. The experience from implementing this is
> > basically that with Arrow it was quite simple to add a columnar
> > interface as most of the data conversions were handled by Arrow. Also
> > there was no need for me to interface with any Python types as the
> > language "barrier" was transparently handled by Arrow.
>
> I certainly see the advantages to not creating objects. How do you end
> up handling NULLs?
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

Jim Nasby

Дата:

17 января 2017 г., 22:31:02

On 1/17/17 9:16 AM, Koenig, Michael wrote:
> For the NumPy functionality, NULL values are handled by returning masked arrays
(https://docs.scipy.org/doc/numpy/reference/maskedarray.html)instead of plain arrays. 

Oh, interesting. That's the same representation Postgres uses internally
as well. I hope to explore this further in plpython at some point.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [psycopg] Turbo ODBC

От

Daniele Varrazzo

Дата:

18 января 2017 г., 00:34:22

ISTM there has to be a fundamental switch in how psycopg uses the
libpq: it should use, at least for what these large objects are
concerned, a binary data type and asking for a binary result
(https://www.postgresql.org/docs/9.6/static/libpq-exec.html#LIBPQ-PQEXECPARAMS),
right?

-- Daniele

On Tue, Jan 17, 2017 at 4:31 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 1/17/17 9:16 AM, Koenig, Michael wrote:
>>
>> For the NumPy functionality, NULL values are handled by returning masked
>> arrays (https://docs.scipy.org/doc/numpy/reference/maskedarray.html) instead
>> of plain arrays.
>
>
> Oh, interesting. That's the same representation Postgres uses internally as
> well. I hope to explore this further in plpython at some point.
>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)
>
>
> --
> Sent via psycopg mailing list (psycopg@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/psycopg

Re: [psycopg] Turbo ODBC

От

Jim Nasby

Дата:

18 января 2017 г., 03:30:25

Possibly, if large objects are actually being returned from Postgres.
But I suspect that in most cases what you'd be getting from Postgres
would be a bunch of rows, which you then want to turn into a python form
that's efficient for analytics, such as a DataFrame or arrow.

The binary return option on PQexecParams might be useful though. That
would allow ints to come over as 4 bytes instead of a variable length
string, for example. Presumably that would provide a performance increase.

On 1/17/17 12:34 PM, Daniele Varrazzo wrote:
> ISTM there has to be a fundamental switch in how psycopg uses the
> libpq: it should use, at least for what these large objects are
> concerned, a binary data type and asking for a binary result
> (https://www.postgresql.org/docs/9.6/static/libpq-exec.html#LIBPQ-PQEXECPARAMS),
> right?
>
> -- Daniele
>
> On Tue, Jan 17, 2017 at 4:31 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 1/17/17 9:16 AM, Koenig, Michael wrote:
>>>
>>> For the NumPy functionality, NULL values are handled by returning masked
>>> arrays (https://docs.scipy.org/doc/numpy/reference/maskedarray.html) instead
>>> of plain arrays.
>>
>>
>> Oh, interesting. That's the same representation Postgres uses internally as
>> well. I hope to explore this further in plpython at some point.
>>
>> --
>> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
>> Experts in Analytics, Data Architecture and PostgreSQL
>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>> 855-TREBLE2 (855-873-2532)
>>
>>
>> --
>> Sent via psycopg mailing list (psycopg@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/psycopg


--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [psycopg] Turbo ODBC