Обсуждение: many to many performance

Поиск
Список
Период
Сортировка

many to many performance

От
"Chavdar Kopoev"
Дата:
Hello,

I have following common situation:

Category IDs: about 50 000
Document IDs: about 3 000 000
Many to many relationship.
A document id have a relation with 10 up to 1000 category ids
One query, with input set of document ids, resulting set of category ids, having relation with input ids. (very often
thisquery is called with more than 500k of input document ids) 

I use custom written datawarehouse, file storage, and memory kept "id offset" like indecies. So a query for all 3
milliondocument ids, resulting almost all category ids take less than a second on desktop machine. Abstracting from
concreterealization, query plan is like: 
1. for each input document id: look up an array by document id and retrive a list of ralated category ids.
1.1 for each category id in the list: look up an array value by category id and mark it as found
2. traverse category array to extract category ids marked as found

I want to use as a data storage postgresql. Tried several data structures, testing btree, gin, gist indecies over them,
butbest achieved performance for a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower more
than5 times. 

I read about postgresql bitmap indecies and "data lookup" when scanning indecies to get a value for current
transaction.Downloaded latest v8.4 snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I
downloadHEAD revision I will find them there, dont know. 

Anyway, I want to ask, have anyone faced similar situation, and is there any way to achive closer to optimal
performanceusing postgresql functionality and extensibility? 

Regards,
Chavdar Kopoev


Re: many to many performance

От
Craig Ringer
Дата:
Chavdar Kopoev wrote:

> I want to use as a data storage postgresql. Tried several data structures, testing btree, gin, gist indecies over
them,but best achieved performance for a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower
morethan 5 times. 

Can you post your queries and table definitions so people trying to help
you know what you did / tried to do? A downloadable self contained
example might also be useful.

Please also post the output of `EXPLAIN' on your queries, eg:

EXPLAIN SELECT blah, ... FROM blah;

> I read about postgresql bitmap indecies and "data lookup" when scanning indecies to get a value for current
transaction.Downloaded latest v8.4 snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I
downloadHEAD revision I will find them there, dont know. 

Bitmap index scans are an internal function that's used to combine two
indexes on the fly during a query (or at least use both of them in one
table scan). You don't make a bitmap index, you just make two ordinary
btree indexes and let Pg take care of this for you.

If you query on the columns of interest a lot, you might want to use a
multi-column index instead.

--
Craig Ringer