Обсуждение: [psycopg] Turbo ODBC
Interesting... https://github.com/blue-yonder/turbodbc -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 01/15/2017 12:32 PM, Jim Nasby wrote: > Interesting... https://github.com/blue-yonder/turbodbc Yes, interesting but you are still dependent on the underlying ODBC implementation. Not sure how that impacts performance versus a method with fewer hops? -- Adrian Klaver adrian.klaver@aklaver.com
On 1/15/17 3:04 PM, Adrian Klaver wrote: > On 01/15/2017 12:32 PM, Jim Nasby wrote: >> Interesting... https://github.com/blue-yonder/turbodbc > > Yes, interesting but you are still dependent on the underlying ODBC > implementation. Not sure how that impacts performance versus a method > with fewer hops? Oh, I'd hope that a native libpq implementation would be faster than going through ODBC. But, there's presumably useful info that can be picked up here; the bit about buffering was certainly interesting. BTW, the person that brought this to my attention had mentioned that a lot of people doing data science with data living in Postgres feel the need to extract the data from Postgres into something like HDFS before they can do anything useful, because apparently data access through HDFS is 3x faster than through Postgres. My impression is that at least part of that is due to using Pandas from_sql functionality (which AIUI marshals everything through SQL Alchemy), but anything that can be done on the psycopg2 side would help. I'm also looking into speeding up SPI access through plpython; depending on how you want to measure I've gotten a 30-600% improvement by removing the buffering that SPI does by default. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 01/15/2017 01:19 PM, Jim Nasby wrote: > On 1/15/17 3:04 PM, Adrian Klaver wrote: >> On 01/15/2017 12:32 PM, Jim Nasby wrote: >>> Interesting... https://github.com/blue-yonder/turbodbc >> >> Yes, interesting but you are still dependent on the underlying ODBC >> implementation. Not sure how that impacts performance versus a method >> with fewer hops? > > Oh, I'd hope that a native libpq implementation would be faster than > going through ODBC. But, there's presumably useful info that can be > picked up here; the bit about buffering was certainly interesting. > > BTW, the person that brought this to my attention had mentioned that a > lot of people doing data science with data living in Postgres feel the > need to extract the data from Postgres into something like HDFS before > they can do anything useful, because apparently data access through HDFS > is 3x faster than through Postgres. My impression is that at least part Have you looked at asyncpg: https://github.com/MagicStack/asyncpg it is Python 3.5+ though. > of that is due to using Pandas from_sql functionality (which AIUI > marshals everything through SQL Alchemy), but anything that can be done > on the psycopg2 side would help. > > I'm also looking into speeding up SPI access through plpython; depending > on how you want to measure I've gotten a 30-600% improvement by removing > the buffering that SPI does by default. -- Adrian Klaver adrian.klaver@aklaver.com
On 1/15/17 3:25 PM, Adrian Klaver wrote: > Have you looked at asyncpg: I'm pulling Matt Rocklin in, who I've been working with on some plpython improvements. Here's his offlist replies to me: > On 1/15/17 5:15 PM, Matthew Rocklin wrote: >> You might also contact Wes McKinney at check out >> Arrow: http://pyarrow.readthedocs.io/en/latest/ >> >> This project was designed for inter-system data interchange. It would >> help with conversion to Spark and other popular tools as well. >> >> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io >> <mailto:mrocklin@continuum.io>> wrote: >> >> I haven't taken a look at asyncpg. >> >> Quick note, you might want to be careful about mixing HDFS and >> HDF5. Postgres's competition for data storage is HDF5 not HDFS >> (Which stands for the Hadoop File System) >> >> I still think that the best thing to do here would be to dump out >> python arrays for each of the columns in a result set. I suspect >> that you could beat /any/ system that doesn't do this handily. This >> would avoid any performance pain of building up Python objects, and >> can continue to use just the standard library. >> >> You can stick to Pure Python and still send arrays encoded with >> bytes. We can convert these to NumPy or Pandas trivially with zero >> overhead. >> >> In [1]: from array import array # this is in the standard library >> In [2]: b = b'\x00' * 16 # 16 zeroes as a bytestring >> In [3]: a = array('L', b) # eight byte unsigned integer >> In [4]: a >> Out[4]: array('L', [0, 0]) >> >> We would only be bound by >> >> 1. The cost in postgres to tranpose the data from row-major to >> column major >> 2. The cost to move bytes across a socket >> >> This should run at gigabytes per second. Everything else is likely >> to be competing around to 50-100MB/s range at top speed is my guess. My guess is that the big gains for external communication would come from: 1) reducing latency effects 2) reducing data marshaling 1 could be accomplished in various ways, and some changes to the FE-BE protocol might be helpful. If we can demonstrate a clear win we might be able to get protocol changes. I do suspect this needs to be tune-able though. My 4x improvement to plpy.execute is changing from the equivalent to a large batch to single row operation. Moving data to python over a local filesystem socket would be hurt be a large batch size, while something on a high latency connection would presumably benefit from moderately large batches since that reduces overall TCP overhead. I think 2 would necessitate FE-BE changes, although some of that could be tested without the changes. Right now, everything is going to get marshaled into a simple text format (ie: ints become a string of digits), but for test purposes there's ways you could run that through the binary output functions (in the case of int4, you'd get 4 bytes in network order, which should be faster to handle (and could be passed directly to something like ndarray). The two problems you run into with the array type are handling NULLs and building columns from rows. You could allow the option of specifying that NULLs won't be allowed in specific fields, or you could use something like a Pandas Series that provides other ways of handling the equivalent of NULL. For the row to column conversion, there's simply no way to get around the pattern of appending items one at a time to an array-like container (list, ndarray, Series, etc), and dealing with the memory allocation problem. I've looked at python's list code and it's certainly no worse at that then what Postgres would do. The big reason I would do that work in python though is you then have the ability to use an number of types for that; they just have to support the equivalent of append(). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
CC'ing Wes McKinney in case he's interested in this problem (how do we efficiently move data from Postgres into Python).
On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 1/15/17 3:25 PM, Adrian Klaver wrote:Have you looked at asyncpg:
I'm pulling Matt Rocklin in, who I've been working with on some plpython improvements. Here's his offlist replies to me:On 1/15/17 5:15 PM, Matthew Rocklin wrote:You might also contact Wes McKinney and check out
Arrow: http://pyarrow.readthedocs.io/en/latest/
This project was designed for inter-system data interchange. It would
help with conversion to Spark and other popular tools as well.
On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io
<mailto:mrocklin@continuum.io>> wrote:
I haven't taken a look at asyncpg.
Quick note, you might want to be careful about mixing HDFS and
HDF5. Postgres's competition for data storage is HDF5 not HDFS
(Which stands for the Hadoop File System)
I still think that the best thing to do here would be to dump out
python arrays for each of the columns in a result set. I suspect
that you could beat /any/ system that doesn't do this handily. This
would avoid any performance pain of building up Python objects, and
can continue to use just the standard library.
You can stick to Pure Python and still send arrays encoded with
bytes. We can convert these to NumPy or Pandas trivially with zero
overhead.
In [1]: from array import array # this is in the standard library
In [2]: b = b'\x00' * 16 # 16 zeroes as a bytestring
In [3]: a = array('L', b) # eight byte unsigned integer
In [4]: a
Out[4]: array('L', [0, 0])
We would only be bound by
1. The cost in postgres to tranpose the data from row-major to
column major
2. The cost to move bytes across a socket
This should run at gigabytes per second. Everything else is likely
to be competing around to 50-100MB/s range at top speed is my guess.
My guess is that the big gains for external communication would come from:
1) reducing latency effects
2) reducing data marshaling
1 could be accomplished in various ways, and some changes to the FE-BE protocol might be helpful. If we can demonstrate a clear win we might be able to get protocol changes. I do suspect this needs to be tune-able though. My 4x improvement to plpy.execute is changing from the equivalent to a large batch to single row operation. Moving data to python over a local filesystem socket would be hurt be a large batch size, while something on a high latency connection would presumably benefit from moderately large batches since that reduces overall TCP overhead.
I think 2 would necessitate FE-BE changes, although some of that could be tested without the changes. Right now, everything is going to get marshaled into a simple text format (ie: ints become a string of digits), but for test purposes there's ways you could run that through the binary output functions (in the case of int4, you'd get 4 bytes in network order, which should be faster to handle (and could be passed directly to something like ndarray).
The two problems you run into with the array type are handling NULLs and building columns from rows. You could allow the option of specifying that NULLs won't be allowed in specific fields, or you could use something like a Pandas Series that provides other ways of handling the equivalent of NULL.
For the row to column conversion, there's simply no way to get around the pattern of appending items one at a time to an array-like container (list, ndarray, Series, etc), and dealing with the memory allocation problem. I've looked at python's list code and it's certainly no worse at that then what Postgres would do. The big reason I would do that work in python though is you then have the ability to use an number of types for that; they just have to support the equivalent of append().
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
Adding Uwe Korn -- he's been working on adding C++-level Arrow support to TurbODBC: https://github.com/blue-yonder/turbodbc/pull/26 We're using Arrow data structures as an efficient (column-oriented) intermediary en route to pandas -- you could also implement normal Python tuple iteration on top of Arrow record batches. The main Python-compatible Arrow implementation is in C++11 so if you wanted to use in C you'd need to do a little work to create C bindings. On Mon, Jan 16, 2017 at 6:24 PM, Matthew Rocklin <mrocklin@continuum.io> wrote: > CC'ing Wes McKinney in case he's interested in this problem (how do we > efficiently move data from Postgres into Python). > > On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> >> On 1/15/17 3:25 PM, Adrian Klaver wrote: >>> >>> Have you looked at asyncpg: >> >> >> I'm pulling Matt Rocklin in, who I've been working with on some plpython >> improvements. Here's his offlist replies to me: >> >>> On 1/15/17 5:15 PM, Matthew Rocklin wrote: >>>> >>>> You might also contact Wes McKinney and check out >>>> Arrow: http://pyarrow.readthedocs.io/en/latest/ >>>> >>>> This project was designed for inter-system data interchange. It would >>>> help with conversion to Spark and other popular tools as well. >>>> >>>> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io >>>> <mailto:mrocklin@continuum.io>> wrote: >>>> >>>> I haven't taken a look at asyncpg. >>>> >>>> Quick note, you might want to be careful about mixing HDFS and >>>> HDF5. Postgres's competition for data storage is HDF5 not HDFS >>>> (Which stands for the Hadoop File System) >>>> >>>> I still think that the best thing to do here would be to dump out >>>> python arrays for each of the columns in a result set. I suspect >>>> that you could beat /any/ system that doesn't do this handily. This >>>> would avoid any performance pain of building up Python objects, and >>>> can continue to use just the standard library. >>>> >>>> You can stick to Pure Python and still send arrays encoded with >>>> bytes. We can convert these to NumPy or Pandas trivially with zero >>>> overhead. >>>> >>>> In [1]: from array import array # this is in the standard library >>>> In [2]: b = b'\x00' * 16 # 16 zeroes as a bytestring >>>> In [3]: a = array('L', b) # eight byte unsigned integer >>>> In [4]: a >>>> Out[4]: array('L', [0, 0]) >>>> >>>> We would only be bound by >>>> >>>> 1. The cost in postgres to tranpose the data from row-major to >>>> column major >>>> 2. The cost to move bytes across a socket >>>> >>>> This should run at gigabytes per second. Everything else is likely >>>> to be competing around to 50-100MB/s range at top speed is my guess. >> >> >> My guess is that the big gains for external communication would come from: >> >> 1) reducing latency effects >> 2) reducing data marshaling >> >> 1 could be accomplished in various ways, and some changes to the FE-BE >> protocol might be helpful. If we can demonstrate a clear win we might be >> able to get protocol changes. I do suspect this needs to be tune-able >> though. My 4x improvement to plpy.execute is changing from the equivalent to >> a large batch to single row operation. Moving data to python over a local >> filesystem socket would be hurt be a large batch size, while something on a >> high latency connection would presumably benefit from moderately large >> batches since that reduces overall TCP overhead. >> >> I think 2 would necessitate FE-BE changes, although some of that could be >> tested without the changes. Right now, everything is going to get marshaled >> into a simple text format (ie: ints become a string of digits), but for test >> purposes there's ways you could run that through the binary output functions >> (in the case of int4, you'd get 4 bytes in network order, which should be >> faster to handle (and could be passed directly to something like ndarray). >> >> The two problems you run into with the array type are handling NULLs and >> building columns from rows. You could allow the option of specifying that >> NULLs won't be allowed in specific fields, or you could use something like a >> Pandas Series that provides other ways of handling the equivalent of NULL. >> >> For the row to column conversion, there's simply no way to get around the >> pattern of appending items one at a time to an array-like container (list, >> ndarray, Series, etc), and dealing with the memory allocation problem. I've >> looked at python's list code and it's certainly no worse at that then what >> Postgres would do. The big reason I would do that work in python though is >> you then have the ability to use an number of types for that; they just have >> to support the equivalent of append(). >> >> -- >> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX >> Experts in Analytics, Data Architecture and PostgreSQL >> Data in Trouble? Get it in Treble! http://BlueTreble.com >> 855-TREBLE2 (855-873-2532) > >
On 01/16/2017 05:20 PM, Wes McKinney wrote: > Adding Uwe Korn -- he's been working on adding C++-level Arrow support > to TurbODBC: > > https://github.com/blue-yonder/turbodbc/pull/26 > > We're using Arrow data structures as an efficient (column-oriented) > intermediary en route to pandas -- you could also implement normal > Python tuple iteration on top of Arrow record batches. > > The main Python-compatible Arrow implementation is in C++11 so if you > wanted to use in C you'd need to do a little work to create C > bindings. All of this is very interesting and definitely worth exploring, just not sure how much of it ties back to psycopg2 and this list. Not trying to rain on anyone's parade, I am wondering if this might not be better explored on a 'meta' list, something like the various Python projects that deal with Excel do: https://groups.google.com/forum/#!forum/python-excel -- Adrian Klaver adrian.klaver@aklaver.com
On 1/16/17 7:32 PM, Adrian Klaver wrote: > All of this is very interesting and definitely worth exploring, just not > sure how much of it ties back to psycopg2 and this list. Not trying to > rain on anyone's parade, I am wondering if this might not be better > explored on a 'meta' list, something like the various Python projects > that deal with Excel do: Since this is a user mailing list that might make sense. Though, I'm getting the impression that there's some disconnect between what data science users are doing and this list. Tuple-based results vs vector-based (ie: columnar) results is an example of that. I do think there's 3 items that would best be handled at the "bottom" of the stack (namely, psycopg2), because they'll enable every higher level as well as make life easier for direct users of psycopg2: 1) Performance, both in low-latency (ie: filesystem socket) and high-latency environments. 2) Type conversion (in particular, getting rid of strings as the intermediate representation). 3) Optionally providing a columnar result set. #3 might be in direct opposition to the standard Python DB accessor stuff, so maybe that would need to be a separate module on top of psycopg2, but psycopg2 would certainly still need to support it. (IE: you certainly do NOT want psycopg2 to build a list of dicts only to then try and convert that to a columnar format). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 01/16/2017 06:07 PM, Jim Nasby wrote: > On 1/16/17 7:32 PM, Adrian Klaver wrote: >> All of this is very interesting and definitely worth exploring, just not >> sure how much of it ties back to psycopg2 and this list. Not trying to >> rain on anyone's parade, I am wondering if this might not be better >> explored on a 'meta' list, something like the various Python projects >> that deal with Excel do: > > Since this is a user mailing list that might make sense. Though, I'm > getting the impression that there's some disconnect between what data > science users are doing and this list. Tuple-based results vs > vector-based (ie: columnar) results is an example of that. Well we are talking about improving the flow of data in and out of a database and to the extent that can be improved all users that depend on that benefit. The issue then becomes the optimizations for the various subsets of users. I suspect that will involve more then one solution and quite possibly multiple solutions for any particular case. Seems to me a place where that can be discussed on a broader scale would lead to a cross discipline conversation that would be advantageous to all. Because in a sense we are all data scientists. > > I do think there's 3 items that would best be handled at the "bottom" of > the stack (namely, psycopg2), because they'll enable every higher level > as well as make life easier for direct users of psycopg2: > > 1) Performance, both in low-latency (ie: filesystem socket) and > high-latency environments. > 2) Type conversion (in particular, getting rid of strings as the > intermediate representation). > 3) Optionally providing a columnar result set. Seems a good starting point, though I have to admit my needs are fairly simple and are already met by the existing implementation. > > #3 might be in direct opposition to the standard Python DB accessor > stuff, so maybe that would need to be a separate module on top of > psycopg2, but psycopg2 would certainly still need to support it. (IE: > you certainly do NOT want psycopg2 to build a list of dicts only to then > try and convert that to a columnar format). -- Adrian Klaver adrian.klaver@aklaver.com
One important thing for fast columnar data access is that you don't want to have the data as Python objects before they will be turned into a DataFrame. Besides much better buffering, this was one of the main advantages we have with Turbodbc. Given that the ODBC drivers for Postgres seem to be in a miserable state, it would be much preferable to have such functionality directly in pyscopg2. Given from meetings with people at some PyData conferences that I showed turbodbc to, I can definitely say that there are some users out there that would like a fast path for Postgres-to-Pandas. In turbodbc, there are two additional functions added to the DB-API cursor object: fetchallnumpy and fetchallarrow. These suffice mostly for the typical pandas workloads. The experience from implementing this is basically that with Arrow it was quite simple to add a columnar interface as most of the data conversions were handled by Arrow. Also there was no need for me to interface with any Python types as the language "barrier" was transparently handled by Arrow. CC'ing Michael König, the creator of Turbodbc, he might be able to give some more input. -- Uwe L. Korn uwelk@xhochy.com On Tue, Jan 17, 2017, at 03:07 AM, Jim Nasby wrote: > On 1/16/17 7:32 PM, Adrian Klaver wrote: > > All of this is very interesting and definitely worth exploring, just not > > sure how much of it ties back to psycopg2 and this list. Not trying to > > rain on anyone's parade, I am wondering if this might not be better > > explored on a 'meta' list, something like the various Python projects > > that deal with Excel do: > > Since this is a user mailing list that might make sense. Though, I'm > getting the impression that there's some disconnect between what data > science users are doing and this list. Tuple-based results vs > vector-based (ie: columnar) results is an example of that. > > I do think there's 3 items that would best be handled at the "bottom" of > the stack (namely, psycopg2), because they'll enable every higher level > as well as make life easier for direct users of psycopg2: > > 1) Performance, both in low-latency (ie: filesystem socket) and > high-latency environments. > 2) Type conversion (in particular, getting rid of strings as the > intermediate representation). > 3) Optionally providing a columnar result set. > > #3 might be in direct opposition to the standard Python DB accessor > stuff, so maybe that would need to be a separate module on top of > psycopg2, but psycopg2 would certainly still need to support it. (IE: > you certainly do NOT want psycopg2 to build a list of dicts only to then > try and convert that to a columnar format). > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > 855-TREBLE2 (855-873-2532)
On 1/17/17 4:51 AM, Uwe L. Korn wrote: > One important thing for fast columnar data access is that you don't want > to have the data as Python objects before they will be turned into a > DataFrame. Besides much better buffering, this was one of the main > advantages we have with Turbodbc. Given that the ODBC drivers for > Postgres seem to be in a miserable state, it would be much preferable to > have such functionality directly in pyscopg2. Given from meetings with > people at some PyData conferences that I showed turbodbc to, I can > definitely say that there are some users out there that would like a > fast path for Postgres-to-Pandas. > > In turbodbc, there are two additional functions added to the DB-API > cursor object: fetchallnumpy and fetchallarrow. These suffice mostly for > the typical pandas workloads. The experience from implementing this is > basically that with Arrow it was quite simple to add a columnar > interface as most of the data conversions were handled by Arrow. Also > there was no need for me to interface with any Python types as the > language "barrier" was transparently handled by Arrow. I certainly see the advantages to not creating objects. How do you end up handling NULLs? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
For the NumPy functionality, NULL values are handled by returning masked arrays (https://docs.scipy.org/doc/numpy/reference/maskedarray.html)instead of plain arrays. Regards Michael On 17/01/2017, 16:06, "Jim Nasby" <Jim.Nasby@BlueTreble.com> wrote: On 1/17/17 4:51 AM, Uwe L. Korn wrote: > One important thing for fast columnar data access is that you don't want > to have the data as Python objects before they will be turned into a > DataFrame. Besides much better buffering, this was one of the main > advantages we have with Turbodbc. Given that the ODBC drivers for > Postgres seem to be in a miserable state, it would be much preferable to > have such functionality directly in pyscopg2. Given from meetings with > people at some PyData conferences that I showed turbodbc to, I can > definitely say that there are some users out there that would like a > fast path for Postgres-to-Pandas. > > In turbodbc, there are two additional functions added to the DB-API > cursor object: fetchallnumpy and fetchallarrow. These suffice mostly for > the typical pandas workloads. The experience from implementing this is > basically that with Arrow it was quite simple to add a columnar > interface as most of the data conversions were handled by Arrow. Also > there was no need for me to interface with any Python types as the > language "barrier" was transparently handled by Arrow. I certainly see the advantages to not creating objects. How do you end up handling NULLs? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
In Arrow, we have a bitmap for each column indicating if a value is NULL. We can convert this clearly to NumPy masked arrays but once this data is converted to Pandas though, integer columns with NULLs will be converted to floats with NaN representing NULL as there is no explicit NULL representation in Pandas 0.x. -- Uwe L. Korn uwelk@xhochy.com On Tue, Jan 17, 2017, at 04:06 PM, Jim Nasby wrote: > On 1/17/17 4:51 AM, Uwe L. Korn wrote: > > One important thing for fast columnar data access is that you don't want > > to have the data as Python objects before they will be turned into a > > DataFrame. Besides much better buffering, this was one of the main > > advantages we have with Turbodbc. Given that the ODBC drivers for > > Postgres seem to be in a miserable state, it would be much preferable to > > have such functionality directly in pyscopg2. Given from meetings with > > people at some PyData conferences that I showed turbodbc to, I can > > definitely say that there are some users out there that would like a > > fast path for Postgres-to-Pandas. > > > > In turbodbc, there are two additional functions added to the DB-API > > cursor object: fetchallnumpy and fetchallarrow. These suffice mostly for > > the typical pandas workloads. The experience from implementing this is > > basically that with Arrow it was quite simple to add a columnar > > interface as most of the data conversions were handled by Arrow. Also > > there was no need for me to interface with any Python types as the > > language "barrier" was transparently handled by Arrow. > > I certainly see the advantages to not creating objects. How do you end > up handling NULLs? > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > 855-TREBLE2 (855-873-2532)
On 1/17/17 9:16 AM, Koenig, Michael wrote: > For the NumPy functionality, NULL values are handled by returning masked arrays (https://docs.scipy.org/doc/numpy/reference/maskedarray.html)instead of plain arrays. Oh, interesting. That's the same representation Postgres uses internally as well. I hope to explore this further in plpython at some point. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
ISTM there has to be a fundamental switch in how psycopg uses the libpq: it should use, at least for what these large objects are concerned, a binary data type and asking for a binary result (https://www.postgresql.org/docs/9.6/static/libpq-exec.html#LIBPQ-PQEXECPARAMS), right? -- Daniele On Tue, Jan 17, 2017 at 4:31 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 1/17/17 9:16 AM, Koenig, Michael wrote: >> >> For the NumPy functionality, NULL values are handled by returning masked >> arrays (https://docs.scipy.org/doc/numpy/reference/maskedarray.html) instead >> of plain arrays. > > > Oh, interesting. That's the same representation Postgres uses internally as > well. I hope to explore this further in plpython at some point. > > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > 855-TREBLE2 (855-873-2532) > > > -- > Sent via psycopg mailing list (psycopg@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/psycopg
Possibly, if large objects are actually being returned from Postgres. But I suspect that in most cases what you'd be getting from Postgres would be a bunch of rows, which you then want to turn into a python form that's efficient for analytics, such as a DataFrame or arrow. The binary return option on PQexecParams might be useful though. That would allow ints to come over as 4 bytes instead of a variable length string, for example. Presumably that would provide a performance increase. On 1/17/17 12:34 PM, Daniele Varrazzo wrote: > ISTM there has to be a fundamental switch in how psycopg uses the > libpq: it should use, at least for what these large objects are > concerned, a binary data type and asking for a binary result > (https://www.postgresql.org/docs/9.6/static/libpq-exec.html#LIBPQ-PQEXECPARAMS), > right? > > -- Daniele > > On Tue, Jan 17, 2017 at 4:31 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 1/17/17 9:16 AM, Koenig, Michael wrote: >>> >>> For the NumPy functionality, NULL values are handled by returning masked >>> arrays (https://docs.scipy.org/doc/numpy/reference/maskedarray.html) instead >>> of plain arrays. >> >> >> Oh, interesting. That's the same representation Postgres uses internally as >> well. I hope to explore this further in plpython at some point. >> >> -- >> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX >> Experts in Analytics, Data Architecture and PostgreSQL >> Data in Trouble? Get it in Treble! http://BlueTreble.com >> 855-TREBLE2 (855-873-2532) >> >> >> -- >> Sent via psycopg mailing list (psycopg@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/psycopg -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)