WIP: Generic functions for Node types using generated metadata

Поиск
Список
Период
Сортировка
От Andres Freund
Тема WIP: Generic functions for Node types using generated metadata
Дата
Msg-id 20190828234136.fk2ndqtld3onfrrp@alap3.anarazel.de
обсуждение исходный текст
Ответы Re: WIP: Generic functions for Node types using generated metadata  (Fabien COELHO <coelho@cri.ensmp.fr>)
Re: WIP: Generic functions for Node types using generated metadata  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Hi,

There's been a lot of complaints over the years about how annoying it is
to keep the out/read/copy/equalfuncs.c functions in sync with the actual
underlying structs.

There've been various calls for automating their generation, but no
actual patches that I am aware of.

There also recently has been discussion about generating more efficient
memory layout for node trees that we know are read only (e.g. plan trees
inside the plancache), and about copying such trees more efficiently
(e.g. by having one big allocation, and then just adjusting pointers).


One way to approach this problem would be to to parse the type
definitions, and directly generate code for the various functions. But
that does mean that such a code-generator needs to be expanded for each
such functions.

An alternative approach is to have a parser of the node definitions that
doesn't generate code directly, but instead generates metadata. And then
use that metadata to write node aware functions.  This seems more
promising to me.

In the attached, *very experimental WIP*, patch I've played around with
this approach.  I have written a small C program that uses libclang
(more on that later) to parse relevant headers, and generate metadata
about node types.

Currently the generated metadata roughly looks like this:

#define TYPE_CAT_SCALAR (1 << 0) // FIXME: this isn't well named rn
#define TYPE_CAT_POINTER (1 << 1)
#define TYPE_CAT_UNION (1 << 2) // FIXME: unused
#define TYPE_CAT_INCOMPLETE (1 << 3)

#define KNOWN_TYPE_UNKNOWN 0
#define KNOWN_TYPE_STRING 1
#define KNOWN_TYPE_INT 2
#define KNOWN_TYPE_NODE 3
#define KNOWN_TYPE_BITMAPSET 4

typedef struct NodeTypeInfo
{
    /* FIXME: intern me */
    const char *structname;
    int16 nelems;
    int16 start_at;
    int16 size;
} NodeTypeInfo;

typedef struct NodeTypeComponents
{
    /* FIXME: intern me */
    const char *fieldname;
    /* FIXME: intern me */
    const char *fieldtype;
    int16 offset;
    int16 size;
    int16 flags;
    int16 node_type_id;
    int16 known_type_id;
} NodeTypeComponents;

NodeTypeInfo node_type_info[]  = {
    {0},
    {.structname = "IndexInfo", .nelems = 22, .start_at = 2128, .size = sizeof(IndexInfo)},
...
    {.structname = "Append", .nelems = 4, .start_at = 1878, .size = sizeof(Append)},
        ...

NodeTypeComponents node_type_components[] = {
...
    {.fieldname = "plan", .fieldtype = "struct Plan", .offset = offsetof(struct Append, plan), .size = sizeof(struct
Plan),.flags = TYPE_CAT_SCALAR, .node_type_id = 9, .known_type_id = KNOWN_TYPE_UNKNOWN},
 
    {.fieldname = "appendplans", .fieldtype = "struct List *", .offset = offsetof(struct Append, appendplans), .size =
sizeof(structList *), .flags = TYPE_CAT_POINTER, .node_type_id = 223, .known_type_id = KNOWN_TYPE_UNKNOWN},
 
    {.fieldname = "first_partial_plan", .fieldtype = "int", .offset = offsetof(struct Append, first_partial_plan),
.size= sizeof(int), .flags = TYPE_CAT_SCALAR, .node_type_id = -1, .known_type_id = KNOWN_TYPE_UNKNOWN},
 
    {.fieldname = "part_prune_info", .fieldtype = "struct PartitionPruneInfo *", .offset = offsetof(struct Append,
part_prune_info),.size = sizeof(struct PartitionPruneInfo *), .flags = TYPE_CAT_POINTER, .node_type_id = 53,
.known_type_id= KNOWN_TYPE_UNKNOWN},
 
...

i.e. each node type is listed in node_type_info, indexed by the NodeTag
value. The fields building up the node are defined in
node_type_components.  By including the size for Node structs, and
components, and by including the offset for components, we can write
generic routines accessing node trees.


In the attached I used this metadata to write a function that can
compute the amount of memory a node tree needs (without currently
considering memory allocator overhead).  The function for doing so has a
pretty limited amount of type awareness - it needs to know about
strings, the various list types, value nodes and Bitmapset.

I'm fairly sure this metadata can also be used to write the other
currently existing node functions.

I *suspect* that it'll be quite OK performancewise, compared to bespoke
code, but I have *NOT* measured that.


The one set of fields this currently can not deal with is the various
arrays that we directly reference from nodes. For e.g.

typedef struct Sort
{
    Plan        plan;
    int            numCols;        /* number of sort-key columns */
    AttrNumber *sortColIdx;        /* their indexes in the target list */
    Oid           *sortOperators;    /* OIDs of operators to sort them by */
    Oid           *collations;        /* OIDs of collations */
    bool       *nullsFirst;        /* NULLS FIRST/LAST directions */
} Sort;

the generic code has no way of knowing that sortColIdx, sortOperators,
collations, nullsFirst are all numCols long arrays.

I can see various ways of dealing with that:

1) We could convert them all to lists, now that we have fast arbitrary
   access. But that'd add a lot of indirection / separate allocations.

2) We could add annotations to the sourcecode, to declare that
   association. That's probably not trivial, but wouldn't be that hard -
   one disadvantage is that we probably couldn't use that declaration
   for automated asserts etc.

3) We could introduce a few macros to create array type that include the
   length of the members.  That'd duplicate the lenght for each of those
   arrays, but I have a bit of a hard time believing that that's a
   meaningful enough overhead.

   I'm thinking of a macro that'd be used like
   ARRAY_FIELD(AttrNumber) *sortColIdx;
   that'd generate code like
   struct
   {
       size_t nmembers;
       AttrNumber members[FLEXIBLE_ARRAY_MEMBER];
   } *sortColIdx;

   plus a set of macros (or perhaps inline functions + macros) to access
   them.


With any of these we could add enough metadata for node type members to
be able to handle such arrays in a generic manner, rather than having to
write code for each of them.


With regards to using libclang for the parsing: I chose that because it
seemed the easiest to experiment with, compared to annotating all the
structs with enough metadata to be able to easily parse them from a perl
script. The node definitions are after all distributed over quite a few
headers.

I think it might even be the correct way forward, over inventing our own
mini-languages and writing ad-hoc parsers for those. It sure is easier
to understand plain C code, compared to having to understand various
embeded mini-languages consisting out of macros.

The obvious drawback is that it'd require more people to install
libclang - a significant imposition. I think it might be OK if we only
required it to be installed when changing node types (although it's not
necessarily trivial how to detect that node types haven't changed, if we
wanted to allow other code changes in the relevant headers), and stored
the generated code in the repository.

Alternatively we could annotate the code enough to be able to write our
own parser, or use some other C parser.


I don't really want to invest significantly more time into this without
first debating the general idea.


The attached patch really just is a fairly minimal prototype. It does
not:
- properly integrate into the buildsystem, to automatically re-generate
  the node information when necessary - instead one has to
  "make -C src/backend/catalog gennodes-run"
- solve the full problem (see discussion of referenced arrays above)
- actually use the metadata in a useful manner (there just is a function
  that estimates node tree sizes, which is used for parse/plan trees)
- have clean code


One interesting - but less important - bit is that the patch currently
has the declared node type id available for node members, e.g:

typedef struct FieldStore
{
    Expr        xpr;
    Expr       *arg;            /* input tuple value */
    List       *newvals;        /* new value(s) for field(s) */
    List       *fieldnums;        /* integer list of field attnums */
    Oid            resulttype;        /* type of result (same as type of arg) */
    /* Like RowExpr, we deliberately omit a typmod and collation here */
} FieldStore;

but we cannot actually rely on arg actually pointing to a node of type
Expr* - it'll always point to a "subclass".  Obviously we can handle
that by dispatching using nodeTag(arg), but it'd be more efficient if we
didn't need that.  And we also loose some error checking due to that.

I wonder if we could collect enough metadata to determine which
potential subclasses a node type has. And then dispatch more efficiently
when only one, and assert that it's a legal subclass for the other
cases.

Greetings,

Andres Freund

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ryan Lambert
Дата:
Сообщение: Re: FETCH FIRST clause PERCENT option
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: REINDEX filtering in the backend