Re: BUG #12202: json operator ->>with offset

Поиск
Список
Период
Сортировка
От Matt Freeman
Тема Re: BUG #12202: json operator ->>with offset
Дата
Msg-id CAHEc+CDpuQKWcr-EiXGdbzGqDZ2zckHKzS8wYpca4huRvqnk0w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #12202: json operator ->>with offset  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-bugs
No problem,

Using postgresql 9.4 we have a simple contacts table with (id text not null
(as pk), blob json) to experiment with porting a couchdb crm database. We
will eventually split out to more columns etc, and handle the data more
idomatically for a rdbms, but that's besides the point for the time being.

There are approximately 100k rows.

I am aware that hardcore postgresql performance experts advise against
using offset however I can accept a small performance penalty (happy with
anything under 100msec)

SELECT id FROM couchcontacts OFFSET 10000 LIMIT 10

As expected takes <10ms

SELECT blob->>'firstName' FROM couchcontacts LIMIT 10

Also takes < 10ms (presume 10 json decode ops on blob column here)

SELECT blob->>'firstName' FROM couchcontacts OFFSET 10000 LIMIT 10

Takes upwards of 10 seconds!! Noted inefficiencies of offset aside why is
this presumably causing 10,010 json decode ops? As the projection has no
side-effects I don't understand the reason this can't be fast?

Is this a limitation of json functionality being relatively new to
postgres? and thus unable to determine ->>opereator isnt yielding
side-effects?

Interesting rewriting the query to this bring it back under 10milliseconds

SELECT jsonblob->>'firstName' FROM couchdbcontacts WHERE id IN (SELECT id
FROM couchcontacts OFFSET 10000 LIMIT 10)

Is there a way to ensure offset doesnt json decode the offsetted records?
(i.e. don't execute the select projection)

"Limit  (cost=3D1680.31..1681.99 rows=3D10 width=3D32) (actual
time=3D12634.674..12634.842 rows=3D10 loops=3D1)"
"  ->  Seq Scan on couchcontacts  (cost=3D0.00..17186.53 rows=3D102282
width=3D32) (actual time=3D0.088..12629.401 rows=3D10010 loops=3D1)"
"Planning time: 0.194 ms"
"Execution time: 12634.895 ms"

postgresql
share|edit|close|delete|flag
edited 11 hours ago

asked 13 hours ago
Matt Freeman - nonuby
615315


That does look like a rough corner of this new feature. Please report it as
a bug to Postgresql. I guess the workaround is manual paging, just like you
showed yourself. =E2=80=93  Thilo 13 hours ago
1

EXPLAIN ANALYZE please? I'm not totally convinced by the explanation of the
discrepancy. Did you profile / perf top / etc to see if your hypothesized
explanation fits observed behaviour? Though on second thoughts ... I think
that if you request a result set with an offset, PostgreSQL should evaluate
expressions in discarded rows unless it can prove they have no
side-effects. So maybe it is evaluating the json expressions... and
arguably it should be unless it can prove they can't abort the query with
an ERROR or change database state. =E2=80=93  Craig Ringer 11 hours ago

Given ->> is a built in operator, shouldnt postgresql know this isnt
causing side-effects? is there a way to hint it is none side-effect
causing?. Updated with explain analyze =E2=80=93  Matt Freeman - nonuby 11 =
hours
ago


@CraigRinger: I just ran a few tests, and I actually found similar effects.
If you run select a field, or if you select *, the performance is roughly
the same. But when you select foo->>'bar', it becomes slightly slower with
empty json data ('{}'), and slower when the field actually exists
('{"bar":0}'), and increasingly slower if you make the json larger. It's
basically behaving as if it's unserializing the json for each row when the
operator is used. =E2=80=93  Denis 10 hours ago
1
Someone asked (and then removed answer, possibly @Denis?) if a
non-indexed-json field behaves the same. There is another non indexed field
on the table called version (int not null) SELECT couchcontacts."version"
FROM couchcontacts OFFSET 10000 LIMIT 10 returns in 14ms. vs blob->>'field'
taking 4158ms - 12,000ms =E2=80=93  Matt Freeman - nonuby 10 hours ago


@MattFreeman-nonuby: Yeah, that was me. I added a new answer, after running
a few tests. =E2=80=93  Denis 9 hours ago
add a comment
question eligible for bounty tomorrow

1 Answer

activeoldestvotes
up vote2down voteaccept

I ran a few tests, and I'm seeing similar behaviors. Each of these have
immaterial differences in performance:

select id ...
select indexed_field ...
select unindexed_field ...
select json_field ...
select * ...

This one, however, does show a difference in performance:

select json_field->>'key' ...

When the json_field is null, the performance impact is negligible. When
it's empty, it degrades things very slightly. When it's filled in, it
degrades noticeably. And when the field is loaded with larger data, it
degrades materially.

In other words, Postgres seems to want to unserialize the json data for
every row it's visiting. (Which is probably a bug, and one that's massively
affecting RoR developers seeing how they use json.)

Fwiw, I noted that re-arranging the query so it uses a CTE will work around
the problem:

with data as (
  select * from table offset 10000 limit 10
)
select json_field->>'key' from data;

(It might get an only-very-slightly better plan than the id IN (...) query
that you highlighted.)

share|edit|flag
edited 9 hours ago

answered 10 hours ago
Denis
36.7k43575

Thanks for the tests. Just to add we are not using RoR, but Clojure, as
many developers are now pushing Postgres as a NoSQL replacement (blogs,
conferences, microbenchmarks) Im sure these scenarios will become more
common across a multitude of stacks. =E2=80=93  Matt Freeman - nonuby 9 hou=
rs ago
1
up voted

Yeah. Just in case, please take a minute to report the issue to the
pg-performance or the pg-bugs list. My guess is this is an oversight in the
code base: the various json functions related to reading are immutable
insofar as I can tell, so I fail to see any reason to bother with
evaluating them when discarding rows. =E2=80=93  Denis9 hours ago

I have reported to pg-bugs and will wait to see what happens (its a
moderated list), I'll leave this marked unanswered for a day or two incase
for some odd reason it's by design.

On Fri, Dec 12, 2014 at 12:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
>
> matt@nonuby.com wrote:
>
> > I am just going to link to the stackoverflow post, as it contains both
the
> > question and an answer which helps clarify the problem, I fear pasting
as-is
> > here might be confusing.
> >
> >
http://stackoverflow.com/questions/27415340/postgresql-offset-behavior-with=
-json-column?noredirect=3D1#comment43279635_27415340
> >
> > tl;dr When using offset, postgres seems to want to unserialize the json
data
> > for every row it's visiting.
>
> Please don't tl;dr bug reports.  Paste the contents here.  We're not in
> love with links to external sites.
>
> --
> =C3=81lvaro Herrera                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




--

Matt Freeman

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Mark Kirkwood
Дата:
Сообщение: Re: regression, deadlock in high frequency single-row UPDATE
Следующее
От: Mark Kirkwood
Дата:
Сообщение: Re: regression, deadlock in high frequency single-row UPDATE