Re: patch: function xmltable

Поиск
Список
Период
Сортировка
От Pavel Stehule
Тема Re: patch: function xmltable
Дата
Msg-id CAFj8pRB5NAOgzQPPZD0uXX3LBG6uE9N+PKyf0uf9NJ5VHiBs-w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: patch: function xmltable  (Craig Ringer <craig@2ndquadrant.com>)
Ответы Re: patch: function xmltable  (Pavel Stehule <pavel.stehule@gmail.com>)
Re: patch: function xmltable  (Craig Ringer <craig@2ndquadrant.com>)
Список pgsql-hackers
Hi

2016-09-06 6:54 GMT+02:00 Craig Ringer <craig@2ndquadrant.com>:
On 4 September 2016 at 16:06, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> Hi
>
> minor update - using DefElem instead own private parser type

I'm really glad that you're doing this and I'll take a look at it for this CF.

It's quite a big patch so I expect this will take a few rounds of
review and updating.

Thank you for review
 


Patch applies cleanly and builds cleanly on master both with and
without --with-xml .

Overall, I think this needs to be revised with appropriate comments.
Whitespace/formatting needs fixing since it's all over the place.
Documentation is insufficient (per notes below).

I am not able to write documentation in English language :( - This function is pretty complex - so I hope so anybody with better language skills can help with this. It respects standard and it respects little bit different Oracle's behave too (different order of DEFAULT and PATH parts). 
 

Re identifier naming, some of this code uses XmlTable naming patterns,
some uses TableExpr prefixes. Is that intended to indicate a bounary
between things re-usable for other structured data ingesting
functions? Do you expect a "JSONEXPR" or similar in future? That's
alluded to by

This structure should be reused by JSON_TABLE function. Now, it is little bit strange, because there is only XMLTABLE implementation - and I have to choose between a) using two different names now, b) renaming some part in future.

And although XMLTABLE and JSON_TABLE functions are pretty similar - share 90% of data (input value, path, columns definitions), these functions has different syntax - so only middle level code should be shared.
 

+/*----------
+ * TableExpr - used for XMLTABLE function
+ *
+ * This can be used for json_table, jsonb_table functions in future
+ *----------
+ */
+typedef struct TableExpr
+{
...

If so, should this really be two patches, one to add the table
expression infrastructure and another to add XMLTABLE that uses it?
Also, why in that case does so much of the TableExpr code call
directly into XmlTable code? It doesn't look very generic.

Currently the common part is not too big - just the Node related part - I am not sure about necessity of two patches. I am agree, there is missing some TableExpBuilder, where can be better isolated the XML part.  
 

Overall I find identifier naming to be a bit inconsisent and think
it's necessary to make it clear that all the "TableExpr" stuff is for
XMLTABLE specifically, if that's the case, or make the delineation
clearer if not.

I'd also like to see tests that exercise the ruleutils get_rule_expr
parts of the code for the various XMLTABLE variants.

Similarly, since this seems to add a new xpath parser, that needs
comprehensive tests. Maybe re-purpose an existing xpath test data set?


sure
 



More detailed comments:
====

Docs comments:

      The <function>xmltable</function> produces [a] table based on
[the] passed XML value.

The docs are pretty minimal and don't explain the various clauses of
XMLTABLE. What is "BY REF" ? Is PATH an xpath expression? If so, is
there a good cross reference link available? The PASSING clause? etc.

How does XMLTABLE decide what to iterate over, and how to iterate over it?

Presumably the FOR ORDINALITY clause makes a column emit a numeric counter.

What standard, if any, does this conform to? Does it resemble
implementations elsewhere? What limitations or unsupported features
does it have relative to those standards?



execEvalTableExpr seems to be defined twice, with a difference in
case. This is probably not going to fly:


+static Datum
+execEvalTableExpr(TableExprState *tstate,
+                        ExprContext *econtext,
+                        bool *isNull, ExprDoneCond *isDone)
+{

+static Datum
+ExecEvalTableExpr(TableExprState *tstate,
+                        ExprContext *econtext,
+                        bool *isNull, ExprDoneCond *isDone)
+{


It looks like you've split the function into a "guts" and "wrapper"
part, with the error handling PG_TRY / PG_CATCH block in the wrapper.
That seems reasonable for readability, but the naming isn't.

I invite any idea how these functions should be named.
 

A comment is needed to explain what ExecEvalTableExpr is / does. If
it's XMLTABLE specific (which it looks like based on the code), its
name should reflect that. This pattern is repeated elsewhere; e.g.
TableExprState is really the state for an XMLTABLE expression. But
PostgreSQL actually has TABLE statements, and in future we might want
to support table-expressions, so I don't think this naming is
appropriate. This is made worse by the lack of comments on things like
the definition of TableExprState. Please use something that makes it
clear it's for XMLTABLE and add appropriate comments.

I understand, so using TableExpr can be strange (for XMLTABLE function). But when we will have JSON_TABLE function, then it will have a sense.

"TableExprState" is consistent with "TableExpr".

Any idea how it should be changed?

 

Formatting of variables, arguments, function signatures etc is
random/haphazard and doesn't follow project convention. It's neither
aligned or unaligned in the normal way, I don't understand the random
spacing at all. Maybe you should try to run pgindent and then extract
just the changes related to your patch? Or run your IDE/editor's
indent function on your changes? Right now it's actually kind of hard
to read. Do you edit with tabstop set to 1 normally or something like
that?

There's a general lack of comments throughout the added code.

In execEvalTableExpr, why are we looping over namespaces? What's that
for? Comment would be nice.

Typo: Path caclulation => Path calculation

What does XmlTableSetRowPath() do? It seems to copy its argument.
Nothing further is done with the row_path argument after it's called
by execEvalTableExpr, so what context is that memory in and do we have
to worry about it if it's large?

execEvalTableExpr says it's doing "path calculation". What it actually
appears to do is evaluate the path expressions, if provided, and
otherwise use the column name as the implied path expression. (The
docs should mention that).

It's wasn't immediately obvious to me what the branch around
tstate->for_ordinality_col   is for and what the alternate path's
purpose is in terms of XMLTABLE's behaviour, until I read the parser
definition. That's largely because the behaviour of XMLTABLE is
underspecified in the docs, since once you know ORDINALITY columns
exist it's pretty obvious what it's doing.

Similarly, for the alternate branch   tstate->ncols   , the
XmlTableGetRowValue call there is meant to do what exactly, and
why/under what conditions? Is it for situations where the field type
is a whole-row value? a composite type? (I'm deliberately not studying
this too deeply, these are points I'd like to see commented so it can
be understood to some reasonable degree at a skim-read).


                /* result is one more columns every time */
"one or more"



                /* when typmod is not valid, refresh it */
                if (te->typmod == -1)


Is this a cache? How is it valid or not valid and when? The comment
(thanks!) on TableExprGetTupleDesc says:

/*
 * When we skip transform stage (in view), then TableExpr's
 * TupleDesc should not be valid. Refresh is necessary.
 */

but I'm not really grasping what you're trying to explain here. What
transform stage? What view? This could well be my ignorance of this
part of the code; if it should be understandable by a reader who is
appropriately familiar with the executor that's fine, but if it's
specific to how XMLTABLE works some more explanation would be good.

This is most difficult part of this patch, and I am not sure it it is fully correctly implemented. I use TupleDesc cache. The TupleDesc is created in parser/transform stage. When the XMLTABLE is used in some view, then the transformed parser tree is materialized - and when the view is used in query, then this tree is loaded and the parser/transform stage is "skipped". I'll check this code against implementation of ROW constructor and I'll try to do more comments there.

 

Good that you've got all the required node copy/in/out funcs in place.

Please don't use the name "used_dns". Anyone reading that will read it
as "domain name service" and that's actually confusing with XML
because of XML schema lookups. Maybe used_defnamespace ?   used
def_ns?

good idea
 

I haven't looked closely at keyword/parser changes yet, but it doesn't
look like you added any reserved keywords, which is good. It does add
unreserved keywords PATH and COLUMNS ; I'm not sure what policy for
unreserved keywords is or the significance of that.

New ereport() calls specify ERRCODEs, which is good.

PostgreSQL already has XPATH support in the form of xmlexists(...)
etc. Why is getXPathToken() etc needed? What re-use is possible here?
There's no explanation in the patch header or comments. Should the new
xpath parser be re-used by the existing xpath stuff? Why can't we use
libxml's facilities? etc. This at least needs explaining in the
submission, and some kind of hint as to why we have two different ways
to do it is needed in the code. If we do need a new XML parser, should
it be bundled in adt/xml.c along with a lot of user-facing
functionality, or a separate file?


libxml2 and our XPATH function doesn't support default namespace ( http://plasmasturm.org/log/259/ ). This is pretty useful feature - so I implemented. This is the mayor issue of libxml2 library. Another difference between XPATH function and XMLTABLE function is using two phase searching and implicit prefix "./" and suffix ("/text()") in XMLTABLE. XMLTABLE using two XPATH expressions - for row data cutting and next for column data cutting (from row data). The our XPATH functions is pretty simple mapped to libxml2 XPATH API. But it is not possible with XMLTABLE function - due design of this function in standard (it is more user friendly and doesn't require exactly correct xpath expressions).

I didn't find any API in libxml2 for a work with parsed xpath expressions - I need some info about the first and last token of xpath expression - it is base for decision about using prefix or suffix.

This functionality (xpath expression parser) cannot be used for our XPATH function now - maybe default namespace in future.
 

How does XmlTableGetValue(...) and XmlTableGetRowValue(...) relate to
this? It doesn't look like they're intended to be called directly by
the user, and they're not documented (or commented).

Probably I used wrong names. XMLTABLE function is running in two different modes - with explicitly defined columns (XmlTableGetValue is used), and without explicitly defined columns - so result is one XML column and only one one step searching is used (there are not column related xpath expressions) ( XmlTableGetRowValue is used). The function XmlTableGetValue is used for getting one column value, the function XmlTableGetRowValue is used for getting one value too, but in special case, when there are not any other value.

I don't understand this at all:



+/*
+ * There are different requests from XMLTABLE, JSON_TABLE functions
+ * on passed data than has CREATE TABLE command. It is reason for
+ * introduction special structure instead using ColumnDef.
+ */
+typedef struct TableExprRawCol
+{
+    NodeTag     type;
+    char       *colname;
+    TypeName   *typeName;
+    bool        for_ordinality;
+    bool        is_not_null;
+    Node       *path_expr;
+    Node       *default_expr;
+    int         location;
+} TableExprRawCol;



I am sorry. It is my fault. Now we have very similar node ColumnDef. This node is designed for usage in utility commands - and it is not designed for usage inside a query. I had to decide between enhancing ColumnDef node or introduction new special node. Because there are more special attributes and it is hard to serialize current ColumnDef, I decided to use new node.
 

That's my first-pass commentary. I'll return to this once you've had a
chance to take a look at these and tell me all the places I got it
wrong ;)


Thank for this

Regard

Pavel


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [PATCH] COPY vs \copy HINT
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [PATCH] Alter or rename enum value