Обсуждение: json_to_recordset() and CTE performance
Hello,
I’ve written an sql function that takes a single JSON parameter, in this case an array of objects each with eight
properties(example below.) This function uses json_to_recordset() in a CTE to insert three rows on two tables. It takes
nearly7 minutes to insert my dataset of 8935 records (in JSON), using a small python script (the script parses the JSON
in< 1s) and PostgreSQL 13 on my Macbook Air.
As an experiment I wrote a second sql function with 8 parameters representing each property of the JSON objects,
requiringa discrete function call per record. I wrote a python script to insert all 8935 records (making as many calls
tothe sql function) which executed in around 2 minutes.
I’m very much a novice at interpreting EXPLAIN (ANALYZE) and hoping someone can help me better optimize my original
function.Both the function and results of explain/analyze are provided below. Is it perhaps a limitation of CTEs or
json_to_recordset(),and an entirely different approach is necessary (like the second function I wrote with one call per
record?)I was really hoping to make this work, I’ve written a small API to my database using sql and plpgsql functions
eachtaking a single JSON parameter, and my web backend acts almost as an http proxy to the database. It’s a different
wayof doing things as far as webdev goes, but (for me) an interesting experiment. I like PostgreSQL and really want to
takeadvantage of it.
I’ve tried a few “optimizations”. Removing the final SELECT and returning VOID saves around 1.5 min. Removing some
extraJOINs saves a little time but nothing substantial (the joins against account.t_account in some places are to check
“ownership”of the record by a given client_id.)
Using a subquery seems like it could be much faster than a CTE, but I don’t know how to insert three rows on two tables
usinga subquery.
Any advice is appreciated, thank you in advance.
Matt
INPUT:
(The CTE illustrates all the properties and types)
[
{
“bank_account_id”: 1324,
“transaction_id”: “abc123”,
“transaction_date”: “2020-10-20”,
…
},
…
]
OUTPUT:
(Not sure what I’ve done to create the nested arrays, but it’s unnecessary..
I can shave off ~1.5min by returning VOID.)
[
[
{
"id": 250185
},
{
"id": 250186
},
...
]
]
FUNCTION:
CREATE OR REPLACE FUNCTION journal.create_with_categories(in_json JSON)
RETURNS JSON
AS $$
WITH data AS (
SELECT
(in_json#>>'{context, client_id}')::BIGINT AS client_id,
nextval(pg_get_serial_sequence('journal', 'id')) AS journal_id,
bank_transaction_type,
x.*
FROM json_to_recordset(in_json->'data')
AS x (
bank_account_id BIGINT,
transaction_id TEXT,
transaction_date DATE,
posted_date DATE,
amount finance.monetary,
description TEXT,
parent_account_id BIGINT,
child_account_id BIGINT
),
LATERAL bank_account.get_transaction_type_by_id(x.bank_account_id, x.amount) AS bank_transaction_type
),
insert_journal_entry AS (
INSERT INTO journal.journal (
client_id, id, bank_account_id,
transaction_id, transaction_date, posted_date,
description
)
SELECT
client_id, journal_id, bank_account_id,
transaction_id, transaction_date, posted_date,
description
FROM data
),
insert_account_entries AS (
INSERT INTO journal.account2journal(
account_id, journal_id, amount, type
)
-- T account
SELECT
t.id,
d.journal_id,
@ d.amount,
CASE WHEN d.bank_transaction_type = 'debit'::transaction_type
THEN 'credit'::transaction_type
ELSE 'debit'::transaction_type
END
FROM data d
LEFT JOIN account.t_account t
ON (t.id = COALESCE(d.child_account_id, d.parent_account_id))
WHERE t.client_id = d.client_id OR t.id IS NULL
UNION ALL
-- bank account
SELECT
t.id, d.journal_id, @ d.amount, d.bank_transaction_type
FROM data d
JOIN bank_account.bank_account b
ON (b.id = d.bank_account_id)
JOIN account.t_account t
ON (t.id = b.t_account_id)
WHERE
t.client_id = d.client_id
)
SELECT json_agg(d) FROM (SELECT d.journal_id AS id FROM data AS d) AS d;
$$ LANGUAGE sql;
EXPLAIN ANALYZE:
(From logs)
Aggregate (cost=24.24..24.25 rows=1 width=32) (actual time=388926.249..388926.371 rows=1 loops=1)
Buffers: shared hit=53877 dirtied=2
CTE data
-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual
time=130.645..142.316rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1
width=4)(actual time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054
CTE insert_journal_entry
-> Insert on journal (cost=0.00..2.00 rows=100 width=96) (actual time=453.563..453.563 rows=0 loops=1)
Buffers: shared hit=79242 dirtied=295
-> CTE Scan on data (cost=0.00..2.00 rows=100 width=96) (actual time=0.006..10.001 rows=8935
loops=1)
CTE insert_account_entries
-> Insert on account2journal (cost=4.86..15.23 rows=2 width=52) (actual time=816.381..816.381 rows=0
loops=1)
Buffers: shared hit=159273 dirtied=335 written=17
-> Result (cost=4.86..15.23 rows=2 width=52) (actual time=0.206..109.222 rows=17870 loops=1)
Buffers: shared hit=5
-> Append (cost=4.86..15.20 rows=2 width=52) (actual time=0.197..95.060 rows=17870 loops=1)
Buffers: shared hit=5
-> Hash Left Join (cost=4.86..7.14 rows=1 width=52) (actual time=0.195..35.512
rows=8935loops=1)
Hash Cond: (COALESCE(d_1.child_account_id, d_1.parent_account_id) = t.id)
Filter: ((t.client_id = d_1.client_id) OR (t.id IS NULL))
Buffers: shared hit=2
-> CTE Scan on data d_1 (cost=0.00..2.00 rows=100 width=68) (actual
time=0.004..6.544rows=8935 loops=1)
-> Hash (cost=3.27..3.27 rows=127 width=16) (actual time=0.137..0.137 rows=127
loops=1)
Buffers: shared hit=2
-> Seq Scan on t_account t (cost=0.00..3.27 rows=127 width=16) (actual
time=0.026..0.073rows=127 loops=1)
Buffers: shared hit=2
-> Hash Join (cost=3.80..8.03 rows=1 width=52) (actual time=40.182..53.796 rows=8935
loops=1)
Hash Cond: ((t_1.id = b.t_account_id) AND (t_1.client_id = d_2.client_id))
Buffers: shared hit=3
-> Seq Scan on t_account t_1 (cost=0.00..3.27 rows=127 width=16) (actual
time=0.022..0.079rows=127 loops=1)
Buffers: shared hit=2
-> Hash (cost=3.59..3.59 rows=14 width=60) (actual time=40.118..40.118 rows=8935
loops=1)
Buffers: shared hit=1
-> Hash Join (cost=1.32..3.59 rows=14 width=60) (actual time=0.071..17.863
rows=8935loops=1)
Hash Cond: (d_2.bank_account_id = b.id)
Buffers: shared hit=1
-> CTE Scan on data d_2 (cost=0.00..2.00 rows=100 width=60) (actual
time=0.005..3.740rows=8935 loops=1)
-> Hash (cost=1.14..1.14 rows=14 width=16) (actual time=0.030..0.030
rows=14loops=1)
Buffers: shared hit=1
-> Seq Scan on bank_account b (cost=0.00..1.14 rows=14
width=16)(actual time=0.012..0.016 rows=14 loops=1)
Buffers: shared hit=1
-> CTE Scan on data d (cost=0.00..2.00 rows=100 width=8) (actual time=183.918..388812.950 rows=8935
loops=1)
Buffers: shared hit=53877 dirtied=2
Trigger for constraint journal_client_id_fkey on journal: time=194.194 calls=8935
Trigger for constraint journal_bank_account_id_fkey on journal: time=204.014 calls=8935
Trigger trigger_journal_import_sequence on journal: time=373.344 calls=1
Trigger for constraint account2journal_account_id_fkey on account2journal: time=580.482 calls=17870
Trigger trigger_debits_equal_credits_on_insert on account2journal: time=116.653 calls=1
Version? What is the value for work_mem and other configs that are non-default? I see some estimates that are rather off like -
-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054
Sometimes putting data into a temp table and analyzing it can be rather helpful to ensure the planner has statistics on the number of records, ndistinct, most common values, etc. I would try doing that with the result of json_to_recordset and skipping the function call to get_transaction_type_by_id until later, just to see how it performs.
That said, it seems like a hardcoded estimate from json_to_recordset is 100 perhaps. I haven't checked source code, but I know when defining a set returning function, there is a ROWS option which provides the planner a static value to assume will come out of that function so it would make sense perhaps.
PostgreSQL 13.0.
You’d have to be specific about the configs you’re looking for, I’m using Postgres.app (postgresapp.com) and am uncertain if it’s distributed with non-default configs.
But, a quick grep shows these items that are configured:
max_wal_size = 1GB
min_wal_size = 80MB
shared_buffers = 128MB
work_mem is not configured so presumably it’s the default of 4MB.
I’ll try the temp tables. That seems familiar to what I found searching online - are you suggesting that as a permanent solution, or just as a means to better analyze performance?
Thanks,
Matt
On Oct 21, 2020, at 1:25 PM, Michael Lewis <mlewis@entrata.com> wrote:Version? What is the value for work_mem and other configs that are non-default? I see some estimates that are rather off like -
-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054Sometimes putting data into a temp table and analyzing it can be rather helpful to ensure the planner has statistics on the number of records, ndistinct, most common values, etc. I would try doing that with the result of json_to_recordset and skipping the function call to get_transaction_type_by_id until later, just to see how it performs.That said, it seems like a hardcoded estimate from json_to_recordset is 100 perhaps. I haven't checked source code, but I know when defining a set returning function, there is a ROWS option which provides the planner a static value to assume will come out of that function so it would make sense perhaps.