Re: Creating large database of MD5 hash values

Поиск

Список

Период

Сортировка

От	Florian Weimer
Тема	Re: Creating large database of MD5 hash values
Дата	11 апреля 2008 г. 14:04:08
Msg-id	82skxsqq67.fsf@mid.bfk.de обсуждение исходный текст
Ответ на	Re: Creating large database of MD5 hash values ("Jon Stewart" <jonathan.l.stewart@gmail.com>)
Список	pgsql-performance

Дерево обсуждения

* Jon Stewart:

>>  BYTEA is slower to load and a bit inconvenient to use from DBI, but
>>  occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33
>>  bytes with PostgreSQL 8.3).

> Can you clarify the "slower to load" point? Where is that pain point
> in the postgres architecture?

COPY FROM needs to read 2.5 bytes on average, instead 2, and a complex
form of double-decoding is necessary.

> Storing the values in binary makes intuitive sense to me since the
> data is twice as dense, thus getting you more bang for the buck on
> comparisons, caching, and streaming reads. I'm not too concerned about
> raw convenience, as there's not going to be a lot of code around my
> application.

The main issue is that you can't use the parameter-providing version
of $sth->execute (or things like $sth->selectarray, $sth->do), you
must use explicit binding by parameter index in order to specify the
type information.

> The idea is that you have named sets of hash values, and hash values
> can be in multiple sets.

The ID step is only going to help you if your sets are very large and
you use certain types of joins, I think.  So it's better to
denormalize in this case (if that's what you were alluding to in your
original post).

> The big operations will be to calculate the unions, intersections, and
> differences between sets. That is, I'll load a new set into the
> database and then see whether it has anything in common with another
> set (probably throw the results into a temp table and then dump it
> out).

In this case, PostgreSQL's in-memory bitmap indices should give you
most of the effect of your hash <-> ID mapping anyway.

> I will also periodically run queries to determine the size of
> the intersection of two sets for all pairs of sets (in order to
> generate some nice graphs).

I think it's very difficult to compute that efficiently, but I haven't
thought much about it.  This type of query might benefit from your
hash <-> ID mapping, however, because the working set is smaller.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

В списке pgsql-performance по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Creating large database of MD5 hash values