Обсуждение: datatype conversion thoughts

Поиск
Список
Период
Сортировка

datatype conversion thoughts

От
Kris Jurka
Дата:
I've been looking at a number of driver todo items that are all centered
around converting between datatypes.  Specifically adding COPY support,
array handling, composite types (SQLInput/Ouptut), and binary data
transfer.  All of these center around converting pg datatypes to Java
types and back.

The current driver currently operates on everything (except bytea) as a
String which is indeed easy to transform to and from with simple
operations like Integer.parseInt, but this only works if you're really
given an integer to begin with.  The jdbc spec allows you to call getInt
on a variety of types, like floats, so it needs a fallback to handle that.
Also the driver allows you to call getInt on the deprecated pg money type.
This means that every time you call getInt() it has to perform operations
on the String it's given to determine if it's actually a money
value "-$12.11" and strip away the dollars sign.  This can't be good for
performance.

Basically what I'd like to see is a datatype parsed only as the type it
actually is.  We know what type the underlying data is, but at the moment
it's not used.  For the money example above it would be parsed as money
and then converted to an int.  This means you only pay the cost of parsing
the money type when you actually have money.  This also cleans up the
other fallback parsing where say if we had a float column and called
getInt on it we wouldn't bother trying Integer.parseInt on it first and
then falling back to a float parsing method.

One way of doing this would be provide a PGDataType abstract class that
provided every getXXX method (getInt, getFloat, getBoolean, ...) which had
a default implementation of saying "Invalid conversion...".  Then for each
datatype an implementation would be provided that could parse the
underlying data and perform whatever conversions that are applicable on
it.  This would work the same way on the reverse side (converting from
java -> pg) as the implementation would provide a method for
converting to the format pg is expecting.

There are some isses about when and how this conversion happens regarding
excessive object creation or protecting a fixed set of these objects from
multithreaded access, but I just wanted to throw this basic concept out
there for any feedback.

Kris Jurka



Re: datatype conversion thoughts

От
"Barry Lind"
Дата:
Kris,

How would PGDataType relate to PGObject?  I have always thought that
PGObject should be extended and used for all datatypes (not just
extensions).

--Barry

-----Original Message-----
From: Kris Jurka [mailto:books@ejurka.com]
Sent: Saturday, October 09, 2004 12:20 AM
To: pgsql-jdbc@postgresql.org
Subject: [JDBC] datatype conversion thoughts


I've been looking at a number of driver todo items that are all centered
around converting between datatypes.  Specifically adding COPY support,
array handling, composite types (SQLInput/Ouptut), and binary data
transfer.  All of these center around converting pg datatypes to Java
types and back.

The current driver currently operates on everything (except bytea) as a
String which is indeed easy to transform to and from with simple
operations like Integer.parseInt, but this only works if you're really
given an integer to begin with.  The jdbc spec allows you to call getInt
on a variety of types, like floats, so it needs a fallback to handle
that.
Also the driver allows you to call getInt on the deprecated pg money
type.
This means that every time you call getInt() it has to perform
operations on the String it's given to determine if it's actually a
money value "-$12.11" and strip away the dollars sign.  This can't be
good for performance.

Basically what I'd like to see is a datatype parsed only as the type it
actually is.  We know what type the underlying data is, but at the
moment it's not used.  For the money example above it would be parsed as
money and then converted to an int.  This means you only pay the cost of
parsing the money type when you actually have money.  This also cleans
up the other fallback parsing where say if we had a float column and
called getInt on it we wouldn't bother trying Integer.parseInt on it
first and then falling back to a float parsing method.

One way of doing this would be provide a PGDataType abstract class that
provided every getXXX method (getInt, getFloat, getBoolean, ...) which
had a default implementation of saying "Invalid conversion...".  Then
for each datatype an implementation would be provided that could parse
the underlying data and perform whatever conversions that are applicable
on it.  This would work the same way on the reverse side (converting
from java -> pg) as the implementation would provide a method for
converting to the format pg is expecting.

There are some isses about when and how this conversion happens
regarding excessive object creation or protecting a fixed set of these
objects from multithreaded access, but I just wanted to throw this basic
concept out there for any feedback.

Kris Jurka



---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faqs/FAQ.html


Re: datatype conversion thoughts

От
Oliver Jowett
Дата:
Barry Lind wrote:
> Kris,
>
> How would PGDataType relate to PGObject?  I have always thought that
> PGObject should be extended and used for all datatypes (not just
> extensions).

We have the problem that PGobject is a public interface, so we probably
need an adaption layer at a minimum to handle older PGobject
implementations that don't know about binary formats, streaming, etc
when they are introduced.

-O

Re: datatype conversion thoughts

От
Oliver Jowett
Дата:
Kris Jurka wrote:
> I've been looking at a number of driver todo items that are all centered
> around converting between datatypes.  Specifically adding COPY support,
> array handling, composite types (SQLInput/Ouptut), and binary data
> transfer.  All of these center around converting pg datatypes to Java
> types and back.

> There are some isses about when and how this conversion happens regarding
> excessive object creation or protecting a fixed set of these objects from
> multithreaded access, but I just wanted to throw this basic concept out
> there for any feedback.

The basic concept is good. I was going in roughly the same direction
with ParameterList and friends.

To avoid threading/object creation issues you might want to look at a
strategy/flyweight implementation where the per-type codec objects are
stateless and are passed all the context they need, i.e:

interface ParameterCodec {
   String getTextFormat(Object parameter);
   byte[] getBinaryFormat(Object parameter);
   void streamAsBinary(Object parameter, PGstream toStream);
   // etc.
}

This would also mean you can construct (for example) codec instances for
Array types fairly painlessly; you have one instance per element type,
configured to delegate to the element's codec for converting the elements.

Also, separating the codec from storing the actual parameter data seems
like a win when you need to handle Array/COPY/SQLOutput etc, since they
all have different requirements.

For the ResultSet side and binary data, I was looking at reading the
entire DataRow message into a single array and passing array offsets to
the codecs. That should help reduce the amount of garbage generation:
we'd have one bytearray per row, not one per row per column.

For results where we do not yet have a binary-format parser, I realized
we can convert them to text on demand fairly easily: issue a "SELECT ?"
passing the uninterpretable binary data as the parameter value (with
correct type OID), and ask for results in text format. This should
hopefully mean we can do binary format results by default without having
to do an extra roundtrip to Describe the statement before every
execution (to work out which result columns to make binary); you only
pay the roundtrip cost when the user asks for a non-binary value, which
should hopefully be rare (e.g. external PGobject implementations that
haven't been updated for the changes).

I've done some more detailed design, mostly in my head at the moment; if
you want I can send you more details.

-O

Re: datatype conversion thoughts

От
Kris Jurka
Дата:

On Tue, 12 Oct 2004, Oliver Jowett wrote:

> Barry Lind wrote:
> > Kris,
> >
> > How would PGDataType relate to PGObject?  I have always thought that
> > PGObject should be extended and used for all datatypes (not just
> > extensions).
>
> We have the problem that PGobject is a public interface, so we probably
> need an adaption layer at a minimum to handle older PGobject
> implementations that don't know about binary formats, streaming, etc
> when they are introduced.
>

I had envisioned making PGDataType have some methods like supportsBinary()
or similar to not require every datatype to support binary transfer.  By
making PGobject extend PGDataType it would not need to be changed much/at
all to fit into this scheme.

Kris Jurka

Re: datatype conversion thoughts

От
Markus Schaber
Дата:
Hi, Oliver,

On Tue, 12 Oct 2004 10:30:23 +1300
Oliver Jowett <oliver@opencloud.com> wrote:

> To avoid threading/object creation issues you might want to look at a
> strategy/flyweight implementation where the per-type codec objects are
> stateless and are passed all the context they need, i.e:
>
> interface ParameterCodec {
>    String getTextFormat(Object parameter);
>    byte[] getBinaryFormat(Object parameter);
>    void streamAsBinary(Object parameter, PGstream toStream);
>    // etc.
> }

Sounds like a good idea, lightweight and stateless factories. Those instances

By using an abstract superclass instead of an interface, we could gain
default implementations (that use .toString() for getTextFormat or throw
an exception for unsupported operations), and thus avoid code
duplication.

But on the other hand, an interface would be more flexible, and we could
still provide an implementing class containing such default code that
users could (but don't have to) use.

> For the ResultSet side and binary data, I was looking at reading the
> entire DataRow message into a single array and passing array offsets to
> the codecs. That should help reduce the amount of garbage generation:
> we'd have one bytearray per row, not one per row per column.

Good idea. Less object instantiations and less garbage is always
positive.

> For results where we do not yet have a binary-format parser, I realized
> we can convert them to text on demand fairly easily: issue a "SELECT ?"
> passing the uninterpretable binary data as the parameter value (with
> correct type OID), and ask for results in text format.

This sounds genious. If we don't know how to deal, ask the server as it
is guaranteed to have the knowledge.

> This should
> hopefully mean we can do binary format results by default without having
> to do an extra roundtrip to Describe the statement before every
> execution (to work out which result columns to make binary); you only
> pay the roundtrip cost when the user asks for a non-binary value, which
> should hopefully be rare (e.g. external PGobject implementations that
> haven't been updated for the changes).

I think this is a cost we can pay, all PGObject implementations I know
are open source and so the new system can be adopted rather quick.

> I've done some more detailed design, mostly in my head at the moment; if
> you want I can send you more details.

Please feel invited! :-)

Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: datatype conversion thoughts

От
Markus Schaber
Дата:
Hi, Kris,

[Sorry for the duplicate message - I really should learn how to
configure my mailer, this message should have gone to the list directly.]

On Sat, 9 Oct 2004 02:19:37 -0500 (EST)
Kris Jurka <books@ejurka.com> wrote:

> One way of doing this would be provide a PGDataType abstract class
> that provided every getXXX method (getInt, getFloat, getBoolean, ...)
> which had a default implementation of saying "Invalid conversion...".
> Then for each datatype an implementation would be provided that could
> parse the underlying data and perform whatever conversions that are
> applicable on it.  This would work the same way on the reverse side
> (converting from java -> pg) as the implementation would provide a
> method for converting to the format pg is expecting.

That sounds great.

I assume the PGDataType subclasses would be used like factories.

When carefully crafted, this could even allow the PGDataType factory to
directly instantiate classes like org.postgis.Point or JTS geometry
classes to be returned by getObject(), without need to subclassing
PGObject. This would allow to unify custom and built-in data types,
including ability to override built-ins.

<shameless rant - consume with a grain of salt>

One of the most annoying aspects of the current PGObject approach is
that the current architecture instantiates the custom data types and
then passes the data to them via setValue. The jdbc driver only knows
its a geometry object, but not the appropriate subclass (Point, Polygon
etc.) which is encoded in the data itsself. So currently, PGGeometry and
other PGObject subclasses are factories that create and remember the
appropriate geometry subclass in setValue. Then the user calls
getObject() to get the PGGeometry wrapper, and then the getGeometry()
method to get the actual geometry class.

Another drawback is lacking support for binary transfer, the getValue()
and setValue() only work with slow and inefficient String
representations. This even lead to the development of wkb4j that uses a
hex encoded binary representation to speed-up reading, see
http://wkb4j.sourceforge.net/

</rant>

This would need a "backwards-mapping" from setObject classes to the
PGDataType factory for writing data. This could be accomplished by
combining a Map where the PGDataTypes register the classes they know to
handle, and a probing scheme with auto-caching for the case someone puts
subclasses into setObject.

Your concept would render my "Register arbitrary types framework" demand
obsolete (which I posted here on March 10, and then abandoned
temporarily).

BTW, this would even allow someone to transparently emulate other
manufacturers drivers by writing e. G. a PGDataType implementation that
handles oracle spatial GIS objects. This could ease the transition to
PostgreSQL (provided the user has the appropriate license for the oracle
java classes that would still be needed).


So, I'd give a +10 on this suggestion.

Or am I completely misleaded?

Markus
--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com