Re: Populating large tables with occasional bad values

Поиск

Список

Период

Сортировка

От	Craig Ringer
Тема	Re: Populating large tables with occasional bad values
Дата	11 июня 2008 г. 13:07:03
Msg-id	484FF7E9.4070104@postnewspapers.com.au обсуждение исходный текст
Ответ на	Populating large tables with occasional bad values ("John T. Dow" <john@johntdow.com>)
Ответы	Re: Populating large tables with occasional bad values
Список	pgsql-jdbc

Дерево обсуждения

John T. Dow wrote:
> Question: why are inserts approximately 10 times slower over the Internet than locally?

Probably causes:

- Limited total bandwidth available;
- Delays caused by round trip latencies; or
- TCP/IP costs vs local UNIX domain socket costs

I'm betting on latency in your case.

> Postgres rejects the offending rows and my Java program writes the
> offending insert statement and the error message to a log file. This
> allows the user to take action, perhaps cleaning the original data and
> reloading, perhaps entering a few rows manually, whatever.

OK, so you're waiting for each INSERT to complete before issuing the
next. That means that over the Internet you add *at* *least* (2*latency)
to the time it takes to complete each insert, where `latency' is the
round trip time for a packet between the DB client and server.

That gets expensive fast. My latency from my home DSL to my work's DSL
is 12ms - and I'm in the same city and on the same ISP as work is, as
well as connected to the same peering point. I regularly see latencies
of > 150ms to hosts within Australia.

Even assuming only one round trip per INSERT (which is probably very
over-optimistic) at, say, 50ms latency, you're talking a MAXIMUM
throughput of 20 INSERTs per second even if the inserts themselves are
issued, executed and responded to instantly.

> In the case of duplicate key values, for certain tables the program
> is  told to modify the key (eg appending "A" or "B") to make it unique. In
> that case, the original insert is an error but after one or two retries,
> a computed unique key allows it to be inserted.

Can you do this server side? An appropriate CASE in the INSERT or the
use of a stored procedure might be helpful. Can your application provide
fallback options ahead of time in case they're needed, or generate them
server-side?

Doing this client-side is going to kill your insert rate completely.

> However, inserting rows is only fast enough if the database is on the
> same computer as the Java program. For example, 200,000 rows can be
> inserted locally in under an hour. Over the Internet (using ssh
> tunneling with Trilead's code) it took over six hours. That's the problem.

Yes, it's almost certainly latency. You'll probably find that if you
batched your inserts and did more server-side you'd get massively better
performance. For example, instead of:

INSERT INTO x (a,b) values ('pk1', 1);
INSERT INTO x (a,b) values ('pk2', 2); -- Fails
INSERT INTO x (a,b) values ('pk2b', 2); -- Reissue failed
INSERT INTO x (a,b) values ('pk3', 3);

you might issue:

INSERT INTO x (a,b) VALUES
('pk1', 1),
('pk2',2),
('pk3',3);

and have a trigger on `x' check for and handle insert conflicts.

If you only want the duplicate key logic to fire for this app but not
other inserts on the same table you could use a trigger on a dummy table
to rewrite the inserts, use rewrite rules on a view to convert the
inserts into stored proc calls, etc etc.

Another option is to bulk-insert or COPY the raw data into a holding
table. The holding table need not even have any indexes, doesn't need a
primary key, etc. Once the holding table is populated you can bulk
INSERT ... SELECT the data, using appropriate CASE statements to handle
pkey conflicts. If you need user input, run one query to find all pkey
conflicts and get the user to make their decisions based on the results,
then issue batched inserts based on their responses.

In any case, you need to eliminate the time your app spends waiting for
the command to be issued, processed by the DB server, and for the
response to reach the client. The main ways to do that are to do more
work server-side and to batch your operations together more.

> I've tinkered some with setting autocommit off and doing commits
> every  100 inserts, but if there's an error the entire set is lost, not just
> the offending row.

If you want performance, in this case I'd certainly turn autocommit off.
Instead of inserting a row and trapping the error if a unique constraint
is violated, do something like:

INSERT INTO mytable (key, value)
SELECT 'pk1', 'otherthing'
   WHERE NOT EXISTS (SELECT 1 FROM mytable WHERE key = 'pk1');

... and check to see if you inserted the row or not using the rows
affected count provided by the JDBC driver.

Alternately you can use a trigger or stored procedure and trap
exceptions on unique constraint violations. This will cause horrible
performance problems with versions of postgresql before 8.3, though.

In either case, however, you're still issuing one INSERT per row
inserted, and incurring nasty round trip latency penalties for that.
You'll be much better off doing multiple-row inserts into a table/view
with a FOR EACH ROW ... BEFORE INSERT trigger that spots pk conflicts
and rewrites the insert to fix them (or redirects the insert into a
"conflicts" table for later processing).

> Also, postgres apparently doesn't allow setting the transaction
> isolation to NONE. Am I wrong about that? If it did, would that help?

As far as I know it's not possible. Nor would it help, because your
problem is network round trip latencies not database execution time.

> I fail to understand why this is so much slower over the Internet.

Because the Internet is a high latency communication medium. The server
isn't taking any longer to execute your queries, as you'll find out if
you use EXPLAIN ANALYZE to measure their execution time.

--
Craig Ringer

В списке pgsql-jdbc по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Populating large tables with occasional bad values