Re: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)
Дата
Msg-id CAMsr+YEPN0R91U=bBeWHKT2Xi2aji6PXwFkVgvM88oLi5bOmAA@mail.gmail.com
обсуждение исходный текст
Ответ на Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 13 December 2015 at 06:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 
I'm not particularly wedded to this rule.  In principle we could go so
far as to import psql's code that parses commands and figures out which
semicolons are command terminators --- but that is a pretty large chunk
of code, and I think it'd really be overkill considering that initdb
deals only with fixed input scripts.  

Shouldn't that be a bison/flex job anyway, rather than hand-coded? Or a simple(ish) state machine?

Dollar-quoted strings are probably the only quite ugly bit due to their arbitrary delimiters. So I thought I'd sketch out how it'd look as a state machine. At which point I remembered that we allow $ in identifiers too. So the state machine would have to bother with unquoted identifiers. Of course $ marks parameters, so it has to keep track of if it's reading a parameter. At which point you have half an SQL parser.

This strikes me as a really good reason for making it re-usable, because it's horrid to write code that handles statement splitting in the PostgreSQL dialect.

Optional handling of psql \commands would be required, but that'd make it easier for PgAdmin to support psql backslash commands, so there'd be a win there too.

I figured I'd sketch it out for kicks. Comment: yuck.

States would be at least:

SQL_TEXT
SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD
NUMBER
QUOTED_IDENTIFIER
QUOTED_IDENTIFIER_QUOTE
SQL_TEXT_DOLLAR
DOLLAR_STRING_START_DELIM
DOLLAR_STRING
DOLLAR_STRING_DOLLAR
DOLLAR_STRING_END_DELIM
STANDARD_STRING
STANDARD_STRING_QUOTE
SQL_TEXT_E
ESCAPE_STRING
ESCAPE_STRING_ESCAPE

Transitions

SQL_TEXT => { SQL_TEXT, SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD, NUMBER, QUOTED_IDENTIFIER, SQL_TEXT_DOLLAR, STANDARD_STRING, SQL_TEXT_E, ESCAPE_STRING }

SQL_TEXT_E => { SQL_TEXT, ESCAPE_STRING, SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD }

SQL_TEXT_DOLLAR => { SQL_TEXT, NUMBER, DOLLAR_STRING_START_DELIM }

QUOTED_IDENTIFIER => { QUOTED_IDENTIFIER, QUOTED_IDENTIFIER_QUOTE }

QUOTED_IDENTIFIER_QUOTE => { SQL_TEXT, QUOTED_IDENTIFIER }

DOLLAR_STRING_START_DELIM => { DOLLAR_STRING_START_DELIM, DOLLAR_STRING }

DOLLAR_STRING => { DOLLAR_STRING, DOLLAR_STRING_DOLLAR }

DOLLAR_STRING_END_DELIM => { DOLLAR_STRING_END_DELIM, SQL_TEXT, DOLLAR_STRING }

STANDARD_STRING => { STANDARD_STRING, STANDARD_STRING_QUOTE }

STANDARD_STRING_QUOTE => { SQL_TEXT, STANDARD_STRING }

ESCAPE_STRING => { ESCAPE_STRING, ESCAPE_STRING_ESCAPE }

ESCAPE_STRING_ESCAPE => { SQL_TEXT, ESCAPE_STRING }


NUMBER consumes sequential digits and period chars and returns to SQL_TEXT at any non-digit. (That way it can handle Pg's lazy parser quirks like SELECT 123"f" being legal, and equivalent to SELECT 123 AS "f").

SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD is needed because a $ within an identifier is part of the identifier so it can't just be consumed as SQL_TEXT .

For dollar strings, when a $ is found when reading SQL text (not an identifier/keyword), enter SQL_TEXT_DOLLAR. What comes next must be a parameter or the start of a dollar string. If the next char is a digit then it's a parameter so switch to NUMBER, since dollar-quoted string delims follow identifier rules and unquoted identifiers can't start with a number. Otherwise switch to DOLLAR_STRING_START_DELIM and consume until a $ is found or something illegal in an identifier is found. Or of course it could be lone $ which is bogus syntax but as far as we're concerned still just SQL_TEXT. Really, this is just looking for a dollar-quote start and doesn't care what it finds if it isn't a valid dollar-quote start.

If a valid dollar-quote delim is found switch to DOLLAR_STRING and read until we find the matching delim using a similar process, entering DOLLAR_STRING_DOLLAR, looking for param vs end delim, etc. When a full delimiter is read compare to the start delimiter and switch back to SQL_TEXT mode if it matches, otherwise remain in DOLLAR_STRING.

If an invalid dollar string delim was found switch back to SQL_TEXT (since it wasn't a valid beginning of a dollar string) and continue.

For QUOTED_IDENTIFIER_QUOTE and STANDARD_STRING_QUOTE, it found a " or ' while reading a quoted identifier or standard string and transitioned into that state. If the next char doubles the quote it'll return to reading the string; otherwise it'll return to the SQL_TEXT state since the identifier or literal has ended.

Similarly with ESCAPE_STRING_ESCAPE. Having found an escape, consume the next char even if it's a quote and return to the string parsing.

All this ignores psql backslash commands.

Have I missed anything really obvious? Does it seem useful to have more re-usable statement splitting code? Is there any sane justification for doing anything but extracting what psql does verbatim while carefully changing nothing?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: Logical replication and multimaster
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Logical replication and multimaster