Обсуждение: Custom Data Type Question

Поиск
Список
Период
Сортировка

Custom Data Type Question

От
Greg Mitchell
Дата:
I'm trying to create a custom data type similar to an enumeration type. 
However, I'd like the mapping of the int<->string to be dynamic instead 
of hard coded. I'd like to have a table that contains this mapping that 
can be appended to. Creating this type is not very difficult. However, 
for performance reasons, I'd like to cache the mapping so that the table 
is only queried once every connection unless it changes. I'm thinking a 
combination of a flag that can be triggered on insert and a transaction 
id could be used to decide if the table needs to be reloaded. 
Unfortunately, I'm not exactly sure how to get started on this, any ideas?

Thanks,
Greg


Re: Custom Data Type Question

От
Andrew Dunstan
Дата:
Greg Mitchell wrote:
> I'm trying to create a custom data type similar to an enumeration 
> type. However, I'd like the mapping of the int<->string to be dynamic 
> instead of hard coded. I'd like to have a table that contains this 
> mapping that can be appended to. Creating this type is not very 
> difficult. However, for performance reasons, I'd like to cache the 
> mapping so that the table is only queried once every connection unless 
> it changes. I'm thinking a combination of a flag that can be triggered 
> on insert and a transaction id could be used to decide if the table 
> needs to be reloaded. Unfortunately, I'm not exactly sure how to get 
> started on this, any ideas?
>
>

Are you aware that there is a patch for first class enumeration types 
waiting to be reviewed for 8.3? The mapping is kept entirely internal, 
and you should never see what it is kept as underneath. It does not 
provide for dynamically extending the enumeration set, for various 
reasons, but there is an easy workaround, namely to create a new type 
with the extra member(s) and then do:
  alter table foo alter column  bar type newtype using bar::newtype;


My little enumkit tool allows you to create enumerations today very 
easily, but its values are completely hardcoded. However, the above 
trick still works. The downside is that each enumeration type requires a 
tiny bit of compilation.

cheers

andrew




Re: Custom Data Type Question

От
Tom Dunstan
Дата:
Hi Greg

Greg Mitchell wrote:
> I'm trying to create a custom data type similar to an enumeration type. 
> However, I'd like the mapping of the int<->string to be dynamic instead 
> of hard coded. I'd like to have a table that contains this mapping that 
> can be appended to.  Creating this type is not very difficult. However,
> for performance reasons, I'd like to cache the mapping so that the table 
> is only queried once every connection unless it changes.

A simpler way to do this might be to only cache the list per query 
context. In your IO functions, you could whack a pointer to your cache 
onto fcinfo->flinfo->fn_extra, and the same flinfo gets passed in for 
e.g. all output function calls for that column for that query, IIRC. 
This was what I had in mind originally when I did the enum patch, but I 
ended up just using syscaches, which I think would be unavailable to you 
writing a UDT.

The upside of the above is that for a given query, the contents of your  table shouldn't change, so there's no mucking
aboutwith trying to 
 
keep things in other backends up to date. The downside is that you have 
to do the lookup per query, but if you're dealing with lots of data then 
it'll get dwarfed by the actual query, and if not, who cares?

The other question that leaps to mind is whether you want to have more 
than one of these types. If you do, you may have to have multiple 
versions of the IO functions, otherwise e.g. your output function might 
be passed the value 0, but was that the 0 representing the 'red' string 
from the rgb enum, or the 'northern' string from the hemisphere enum? 
You don't know, and postgresql won't tell you directly.

There are a few ways around this. In your case, it might be ok to 
compile different versions of the IO functions for each enum which point 
to different tables, or the same table with a discriminator. Or you 
could see the various different proposals when my patch was first 
discussed. See the thread starting at 
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00979.php or if 
you want a peek at the patch, see 
http://archives.postgresql.org/pgsql-patches/2006-09/msg00000.php. A 
rather simpler starting point might be Andrew's enumkit 
http://www.oreillynet.com/pub/a/databases/2006/01/06/enumerated-fields-in-postgresql.html?page=last&x-showcontent=text,

or possibly Martijn's tagged types at 
http://svana.org/kleptog/pgsql/taggedtypes.html.

Cheers

Tom



Re: Custom Data Type Question

От
Greg Mitchell
Дата:
> A simpler way to do this might be to only cache the list per query
> context. In your IO functions, you could whack a pointer to your cache
> onto fcinfo->flinfo->fn_extra, and the same flinfo gets passed in for
> e.g. all output function calls for that column for that query, IIRC.
> This was what I had in mind originally when I did the enum patch, but I
> ended up just using syscaches, which I think would be unavailable to you
> writing a UDT.

If my understanding is correct, if fn_extra is null, I would palloc() my
data cache and store the pointer in fn_extra? What about freeing this
pointer? Or is cleanup automatic?

Also, are there any ADTs like a hash-map or tree-map in the server
libraries (my background is C++ and am use to having std::map<>) or do I
need to role my  own?

I am using enumkit for some true enums I have in the DB and like it very
much. Though I tend to customize the C-code to optimize it for my use.

Thanks,
Greg


Re: Custom Data Type Question

От
Greg Mitchell
Дата:
As far as memory management goes, do I just use hash_create() and assign 
that pointer to fn_extra and at the end of the query it will be freed? 
Or will it not be freed until this end of the transaction? I'm really 
having trouble understanding the memory management issues with Postgres.

Greg

Andrew Dunstan wrote:
> Greg Mitchell wrote:
>>
>>
>>
>> Also, are there any ADTs like a hash-map or tree-map in the server
>> libraries (my background is C++ and am use to having std::map<>) or do I
>> need to role my  own?
>>
>>
> 
> Look at the dynahash code. I just used it for the first time in a plperl 
> patch, and it's reasonably straightforward.
> 
> cheers
> 
> andrew
> 


Re: Custom Data Type Question

От
Andrew Dunstan
Дата:
Greg Mitchell wrote:
>
>
>
> Also, are there any ADTs like a hash-map or tree-map in the server
> libraries (my background is C++ and am use to having std::map<>) or do I
> need to role my  own?
>
>

Look at the dynahash code. I just used it for the first time in a plperl 
patch, and it's reasonably straightforward.

cheers

andrew



Re: Custom Data Type Question

От
Tom Lane
Дата:
Greg Mitchell <gmitchell@atdesk.com> writes:
> As far as memory management goes, do I just use hash_create() and assign 
> that pointer to fn_extra and at the end of the query it will be freed? 
> Or will it not be freed until this end of the transaction? I'm really 
> having trouble understanding the memory management issues with Postgres.

You have to be careful that the hashtable is created in the correct
"memory context" --- in this case you want it to be in a query-lifespan
context, not the short-term (per-tuple-lifespan) context that your
function will be called in.  The usual procedure for cases like this is
to use the context identified by fn_mcxt.  src/backend/utils/mmgr/README
might make useful reading for you.
        regards, tom lane


Re: Custom Data Type Question

От
"Simon Riggs"
Дата:
On Wed, 2006-11-15 at 16:38 -0500, Andrew Dunstan wrote:

> My little enumkit tool allows you to create enumerations today very 
> easily, but its values are completely hardcoded. However, the above 
> trick still works. The downside is that each enumeration type requires a 
> tiny bit of compilation.

Andrew,

Your enum sounds good, apart from the hardcoded/compilation thing. That
is a data management nightmare AFAICS and so restricts the usefulness of
the solution.

It would be much better to read things dynamically into an array, so
using an init function in *preload_libraries would work well.

I'd also love any suggestions as to how we might be able to use a
similar local-data-cacheing mechanism to work when we specify SQL
standard FOREIGN KEYs. It would be really cool to say USING LOCAL CACHE
or some way of avoiding the overhead of all those stored after triggers
and SPI SELECT statements when we've got checks against tables with only
a few rows where the values hardly ever change. The enum concept departs
radically from the declarative Referential Integrity concepts that many
of us are already used to. I'd like to be able to speed things up
without radical re-design of the database... so a few nicely sprinked
ALTER TABLE statements would be a much better way of implementing this
IMHO.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Custom Data Type Question

От
Andrew Dunstan
Дата:

Simon Riggs wrote:
> On Wed, 2006-11-15 at 16:38 -0500, Andrew Dunstan wrote:
>
>   
>> My little enumkit tool allows you to create enumerations today very 
>> easily, but its values are completely hardcoded. However, the above 
>> trick still works. The downside is that each enumeration type requires a 
>> tiny bit of compilation.
>>     
>
> Andrew,
>
> Your enum sounds good, apart from the hardcoded/compilation thing. That
> is a data management nightmare AFAICS and so restricts the usefulness of
> the solution.
>
>
>   

Simon, Tom Dunstan has submitted a patch for first class enum types that 
do not have the compilation requirement - it's in the queue for 8.3. You 
might want to review that.

cheers

andrew


Re: Custom Data Type Question

От
Andrew Dunstan
Дата:
Simon Riggs wrote:
> On Sat, 2006-11-18 at 10:54 -0500, Andrew Dunstan wrote:
>
>   
>>> Your enum sounds good, apart from the hardcoded/compilation thing. That
>>> is a data management nightmare AFAICS and so restricts the usefulness of
>>> the solution.
>>>       
>
>   
>> Simon, Tom Dunstan has submitted a patch for first class enum types that 
>> do not have the compilation requirement - it's in the queue for 8.3. You 
>> might want to review that.
>>     
>
> Well, the link to previous discussion ends: How about being more
> specific about what you are trying to accomplish? My thoughts:
>
> IMHO we need to optimise FOREIGN KEY checks. One way of doing that is by
> having enums that build the allowable values into the datatypes, I can
> think of others. An overall evaluation of the various approaches should
> be made before we settle on a specific one.
>   

Using the submitted patch, FK checks should be VERY fast - the stored 
values are just oids.

> My requirements list would be to allow FOREIGN KEYS to
> - be specified in SQL standard syntax
> - work as fast as CHECK (col IN (1,2,3))
>   

If I understand it, you are really not talking about doing an FK check 
on an enum value, but rather using an FK check as a means of validating 
an enum. That is not what we are talking about. But the validity checks 
will be at least as fast as any check constraint.

> - use less memory and completely avoid any spill-to-disk nightmare(*)
>   

See above.

> - have the list of allowable values to be dynamically updateable,
> automatically as INSERTs/DELETEs occur on the referenced table
>   

Why? People seem so hung up on this. If you want dynamically updatable 
set, then use a reference table. The whole point of this exercise was to 
provide first class enum types that work *just*like*other *types*. If 
you want to change a column's type, you do 'alter table foo alter column 
bar type newtype'. And so you would if you need to change to a different 
enum type. What if you deleted a value in the allowed set? Boom goes 
data integrity.

> - optimize without needing to change/reload database, just by adding
> minimum number of statements (zero being the best)
>
> (*) doesn't exist yet, thats why its a nightmare
>   

I should add that it would have been nice if this discussion had 
happened back in August when the work was being done.

cheers

andrew




Re: Custom Data Type Question

От
"Simon Riggs"
Дата:
On Sat, 2006-11-18 at 10:54 -0500, Andrew Dunstan wrote:

> > Your enum sounds good, apart from the hardcoded/compilation thing. That
> > is a data management nightmare AFAICS and so restricts the usefulness of
> > the solution.

> Simon, Tom Dunstan has submitted a patch for first class enum types that 
> do not have the compilation requirement - it's in the queue for 8.3. You 
> might want to review that.

Well, the link to previous discussion ends: How about being more
specific about what you are trying to accomplish? My thoughts:

IMHO we need to optimise FOREIGN KEY checks. One way of doing that is by
having enums that build the allowable values into the datatypes, I can
think of others. An overall evaluation of the various approaches should
be made before we settle on a specific one.

My requirements list would be to allow FOREIGN KEYS to
- be specified in SQL standard syntax
- work as fast as CHECK (col IN (1,2,3))
- use less memory and completely avoid any spill-to-disk nightmare(*)
- have the list of allowable values to be dynamically updateable,
automatically as INSERTs/DELETEs occur on the referenced table
- optimize without needing to change/reload database, just by adding
minimum number of statements (zero being the best)

(*) doesn't exist yet, thats why its a nightmare

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Custom Data Type Question

От
"Simon Riggs"
Дата:
On Mon, 2006-11-20 at 16:30 -0500, Andrew Dunstan wrote:

> > IMHO we need to optimise FOREIGN KEY checks. One way of doing that is by
> > having enums that build the allowable values into the datatypes, I can
> > think of others. An overall evaluation of the various approaches should
> > be made before we settle on a specific one.
> >   
> 
> Using the submitted patch, FK checks should be VERY fast - the stored 
> values are just oids.

I see that they would be.

> > My requirements list would be to allow FOREIGN KEYS to
> > - be specified in SQL standard syntax
> > - work as fast as CHECK (col IN (1,2,3))
> >   
> 
> If I understand it, you are really not talking about doing an FK check 
> on an enum value, but rather using an FK check as a means of validating 
> an enum. That is not what we are talking about. But the validity checks 
> will be at least as fast as any check constraint.

Not really talking about enums at all. Just trying to understand how to
optimize the general case of Foreign Keys, which I have suddenly
realised is exactly what you are trying to achieve with enums.

> Boom goes 
> data integrity.

My point exactly (assuming I didn't quote you out of context).

> I should add that it would have been nice if this discussion had 
> happened back in August when the work was being done.

I'm happy to apologise. I hadn't made the connection at that point
between the role of an enum datatype and the need to optimise FKs.

Since MySQL has an enum type, it will aid portability to allow an enum
type. However, I'm interested in providing a mechanism to speed up FKs
without the *need* to adopt specific data domain specific datatypes.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Custom Data Type Question

От
"Andrew Dunstan"
Дата:
Simon Riggs wrote:
> On Mon, 2006-11-20 at 16:30 -0500, Andrew Dunstan wrote:

>> I should add that it would have been nice if this discussion had
>> happened back in August when the work was being done.
>
> I'm happy to apologise. I hadn't made the connection at that point
> between the role of an enum datatype and the need to optimise FKs.
>
> Since MySQL has an enum type, it will aid portability to allow an enum
> type. However, I'm interested in providing a mechanism to speed up FKs
> without the *need* to adopt specific data domain specific datatypes.
>

No problem.

The original genesis of this work was an article I wrote for OReilly last
year, that looked at the taxonomy of enums. Essentially, the idea was to
get the advantages of enums done using lookup tables without the
disadvantages. Enumkit was a first step, but had the disadvantage of
needing compilationand having values hardcoded. Tom's work is essentially
to provide the same facility but with grammar and catalog support and
without requiring compilation.

Note that MySQL enum types are essentially anonymous types, and have a
number of very odd quirks. What Tom has submitted is, in my opinion, enums
done right, and orthogonally with the rest of the type system. IOW, a very
typical piece of Postgres development :-)

cheers

andrew




Re: Custom Data Type Question

От
Tom Dunstan
Дата:
Andrew Dunstan wrote:
> Simon Riggs wrote:
>> My requirements list would be to allow FOREIGN KEYS to
>> - be specified in SQL standard syntax
>> - work as fast as CHECK (col IN (1,2,3))
>>   
> 
> If I understand it, you are really not talking about doing an FK check 
> on an enum value, but rather using an FK check as a means of validating 
> an enum. That is not what we are talking about. But the validity checks 
> will be at least as fast as any check constraint.

Right. Enums (as implemented) require no explicit FK check. When you try 
to enter an enum, the input function does a syscache lookup using the 
typoid and the string value entered and returns the enum value OID. 
There's probably a point at which the syscache lookup becomes faster 
than a CHECK constraint, as I assume that the CHECK will get slower 
linearly as the number of allowed values increases. That number might be 
quite small.

>> - have the list of allowable values to be dynamically updateable,
>> automatically as INSERTs/DELETEs occur on the referenced table
>>   
> 
> Why? People seem so hung up on this. If you want dynamically updatable 
> set, then use a reference table. The whole point of this exercise was to 
> provide first class enum types that work *just*like*other *types*. If 
> you want to change a column's type, you do 'alter table foo alter column 
> bar type newtype'. And so you would if you need to change to a different 
> enum type. What if you deleted a value in the allowed set? Boom goes 
> data integrity.

Well, if there really is demand, there are some things that we could do 
here to make life a bit easier. Firstly, just adding a new value to the 
end of the enum list should be trivial (just add a new row to pg_enum) 
except in the case of OID wraparound. When that happens, or if they want 
to add a value in the middle or start of the enum, we could possibly do 
the create new type, alter referenced tables, drop old type, rename new 
type automagically. Dropping a value from an enum would be a matter of 
checking that no referencing tables had the to-be-dropped value stored, 
and I suppose locking them while the delete from pg_enum is performed. 
Maybe that would be easy, maybe hard, but these things aren't 
impossible, just more work than it seemed it was worth at the time. If 
other people have use cases that require changing these more than we 
anticipated, however, maybe they'll leap forward with contributions. :)

What I *would* say, though, is that if people want these to be 
sufficiently dynamic that they can ever foresee using code rather than a 
schema script to change them, then they're using the wrong solution, and 
should go back to using a table.

Cheers

Tom



Re: Custom Data Type Question

От
Tom Dunstan
Дата:
Simon Riggs wrote:
> I'd also love any suggestions as to how we might be able to use a
> similar local-data-cacheing mechanism to work when we specify SQL
> standard FOREIGN KEYs. It would be really cool to say USING LOCAL CACHE
> or some way of avoiding the overhead of all those stored after triggers
> and SPI SELECT statements when we've got checks against tables with only
> a few rows where the values hardly ever change.

Uh, sounds like an enum is a perfect fit. :) This is certainly one of 
the use-cases that I've encountered when I wished that I had had an enum 
type to use.

> The enum concept departs
> radically from the declarative Referential Integrity concepts that many
> of us are already used to.

I have to challenge this. It's *just another type*. Is using a boolean 
type a radical departure from RI just because you're not referencing 
some external table with the definitions for true and false in it? After 
all, from a functional point of view, booleans are just another 
enumerated type.

A major annoyance with SQL has been that it hasn't had a good solution 
for this pattern. I've seen any number of broken solutions, from lots of 
little mostly-static tables littered all over your data model, to single 
big code tables that every other table references, and for which you 
need triggers to enforce data integrity because standard RI doesn't 
work, to chars and varchars with incorrect ordering or meaningless names 
or which suck storage-wise. Don't even get me started on MySQL enums.

The reason that I wanted to do the enum patch was because *all* of those 
solutions suck. Requiring a table to represent a small fixed set of 
allowable values that a column should take is broken. But because it's 
the least ugly solution that we've had using vanilla SQL, it's what 
we've used, and dare I suggest that because we've all done it for so 
long, we start to think that *not* doing it that way is broken.

Enums, as implemented in the patch, are reasonably efficient, typesafe 
and properly ordered. Plus they make your data model look cleaner, your 
queries don't need to have lookups anymore and you use less disk space. 
Oh, and they also bring you coffee and put out the trash :)

> I'd like to be able to speed things up
> without radical re-design of the database... so a few nicely sprinked
> ALTER TABLE statements would be a much better way of implementing this
> IMHO.

OK, back to what you'd like to do. :)

If your external tables are so small and static, just how long does the 
FK check take? Are they really that slow?

I would have thought that it might be difficult to get rid of the FK 
check altogether, but perhaps, in the context of a single query (I'm 
thinking a bulk insert) you could have some sort of LRU cache. If you 
want the cache to stick around, you've got to deal with what happens 
when it goes out of date... notifying all the backends etc, and what 
happens when one if the other backends was halfway through a 
transaction.  Maybe you could set this "cached mode" on, but would have 
to switch it off before updating the tables in question or something 
like that. I dunno.  That stuff sounds hard; I found it easier to just 
implement my own type ;)

Cheers

Tom



Re: Custom Data Type Question

От
"Simon Riggs"
Дата:
On Tue, 2006-11-21 at 02:51 +0000, Tom Dunstan wrote:

> Requiring a table to represent a small fixed set of 
> allowable values that a column should take is broken. But because
> it's 
> the least ugly solution that we've had using vanilla SQL, it's what 
> we've used, and dare I suggest that because we've all done it for so 
> long, we start to think that *not* doing it that way is broken.

I do support your goal of higher performance.

Putting data in tables is reasonably accepted practice, round here at
least. 

I see the strong need to optimise the case where people want/need to
follow the SQL standard and have defined their databases that way. There
is also the need to support DELETE RESTRICT functionality from the
referenced to the referencing table, as a protection against data
quality problems. A link between two tables is important - otherwise we
introduce another DBA task and the possibility of error that results.

If there is a body of opinion behind enums, then thats good. The MySQL
way is not something to be ignored and that is a good argument for
inclusion. I've got no problem with multiple ways of doing things. 

In the long run, as currently envisaged, enums don't do all that I would
like. I see the need to performance tune Referential Integrity more
directly.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Custom Data Type Question

От
Andrew Dunstan
Дата:
Simon Riggs wrote:
> On Tue, 2006-11-21 at 02:51 +0000, Tom Dunstan wrote:
>
> In the long run, as currently envisaged, enums don't do all that I would
> like. 

In a sense they do more than you want. They will provide the ability to 
set values much faster than anything using an FK constraint, I expect, 
and without having to use any explicit constraint.

> I see the need to performance tune Referential Integrity more
> directly.
>
>   

Sure. Go for it. As far as enums go, the only cases I can think of where 
that will have any application are:
. you don't use enums because you want strictly vanilla SQL, or
. you don't use enums because you want to be able to alter the set of 
allowed values arbitrarily.

That still leaves lots of applications (e.g. those I work on in my day 
job) that will benefit from enums.

cheers

andrew