Re: [PATCH] Initial progress reporting for COPY command

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: [PATCH] Initial progress reporting for COPY command
Дата
Msg-id 20200622141900.5zjsi7ulvrr7i5cv@development
обсуждение исходный текст
Ответ на Re: [PATCH] Initial progress reporting for COPY command  (Josef Šimánek <josef.simanek@gmail.com>)
Список pgsql-hackers
On Mon, Jun 22, 2020 at 03:33:00PM +0200, Josef Šimánek wrote:
>po 22. 6. 2020 v 14:14 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>napsal:
>
>> On Sun, Jun 21, 2020 at 01:40:34PM +0200, Josef Šimánek wrote:
>> >Thanks for all comments. I have updated code to support more options
>> >(including STDIN/STDOUT) and added some documentation.
>> >
>> >Patch is attached and can be found also at
>> >https://github.com/simi/postgres/pull/5.
>> >
>> >Diff version: https://github.com/simi/postgres/pull/5.diff
>> >Patch version: https://github.com/simi/postgres/pull/5.patch
>> >
>> >I'm also attaching screenshot of HTML documentation and html documentation
>> >file.
>> >
>> >I'll do my best to get this to commitfest now.
>> >
>>
>> I see we're not showing the total number of bytes the COPY is expected
>> to process, which makes it hard to estimate how far we actually are.
>> Clearly there are cases when we really don't know that (exports, import
>> from stdin/program), but why not to show file size for imports from a
>> file? I'd expect that to be the most common case.
>>
>
>For COPY FROM file fstat is done and info is available already at
>https://github.com/postgres/postgres/blob/fe186b4c200b76a5c0f03379fe8645ed1c70a844/src/backend/commands/copy.c#L1934.
>It should be easy to update some param (param6 for example) with file size
>and expose it in report view. When not available, this column can be NULL.
>
>Would that be enough?
>

Yes, I think that'd be fine. The rows without a file should have NULL,
because we literally don't know what the value is. And 0 is a valid file
size, so we can't use it anyway.

>On the other side everyone can check file size manually to get total value
>expected and just compare to reported bytes_processed. Alt. "wc -l" can be
>checked to get amount of lines and check lines_processed column to get
>progress. Should it check amount of lines and populate another column with
>lines total (using a configured separator) as well? AFAIK that would need
>full file scan which can be slow for huge files.
>

Sure, but the extra `wc -l` is less convenient and you then need to
combine that with pg_stat_progress_copy. With the information right in
the view, you can do (100.0 * bytes_processed / bytes_total) and you get
the progress as a percentage. (I've omitted the NULL handling.)

As for the number of lines, I certainly don't think we need to scan the
file - that'd be far too expensive. What we might do is estimate it as

    total_bytes / (processed_bytes / processed_rows)

but that's something people can easily do on their own. So I don't think
it needs to be part of the patch, and IMHO bytes_processed / bytes_total
is a sufficient measure of progress.

>
>> I wonder if it made sense to show some estimates in the other cases. For
>> example when exporting query result, maybe we could show the estimated
>> number of rows and size? Of course, that's prone to estimation errors
>> and it's more a wild idea for the future, I don't expect this patch to
>> implement that.
>>
>
>My plan here was to expose numbers not being currently available and let
>clients get the rest of info on their own.
>
>For example:
>- for "COPY (query) TO file" - EXPLAIN or COUNT variant of query could be
>executed before to get the amount of expected rows
>- for "COPY table FROM file" - file size or amount of lines in file can be
>inspected first to get amount of expected rows or bytes to be processed
>
>I see the current system view in my patch (and also all other report views
>currently available) more as a scaffold to build own tools.
>
>For example CLI tools can use this to provide some kind of progress.
>

True, but I'd advise against putting this into v1 of the patch. Let's
keep it simple, get it committed and then maybe improve it later.

Some of these stats (like the estimates from a query) may be quite
unreliable, so I think it needs more discussion. We might invent
lines_estimated or something like that, for example.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Asymmetry in opening and closing indices for partition routing
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Parallel Seq Scan vs kernel read ahead