Re: gsoc, text search selectivity and dllist enhancments
От | Tom Lane |
---|---|
Тема | Re: gsoc, text search selectivity and dllist enhancments |
Дата | |
Msg-id | 6233.1215113144@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | gsoc, text search selectivity and dllist enhancments (Jan Urbański <j.urbanski@students.mimuw.edu.pl>) |
Ответы |
Re: gsoc, text search selectivity and dllist enhancments
("Heikki Linnakangas" <heikki@enterprisedb.com>)
|
Список | pgsql-hackers |
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > attached are two patches against HEAD. The smaller one is meant to be > commited - it adds some functions that manipulate double-linked lists, > namely inserting a new cell after or before another cell and swapping > two adjacent cells. > The gzipped one is WIP for my GSoC project. I've reworked the algorithm > for determing most common lexemes. I looked over this a bit. I'm not excited about adding functionality to Dllist --- that data structure is barely used at all in the backend, and I think a better case could be made for getting rid of it than adding code to it. The analyze patch doesn't change my mind on the point, because I don't think that Dllist is really helping you there anyway. The data structure I'd suggest is a simple array of pointers to the underlying hash table entries. Since you have a predetermined maximum number of lexemes to track, you can just palloc the array once --- you don't need the expansibility properties of a list. The only operations you need are "add an entry at the end" (if you keep the array sorted by descending count not ascending), "remove the end entry", and "swap adjacent entries", all of which are actually cheaper on an array than on a Dllist. Another point is that you don't really need the array to be sorted all the time. Instead of using what is basically an O(N^2) incremental sort, you could consider applying qsort() when you do need it to be sorted, which is at the end or when the table overflows and you need to discard some entries. If you are willing to discard multiple entries per overflow event, this could be quite cheap --- although I think in the worst case where there are many overflows, it'd go back to being O(N^2). Note BTW that adding a count=1 entry at the end cannot make the array unsorted if it was sorted before. The only event that renders the array unsorted is increasing an item's count to more than the count of its predecessor --- so it might be worth keeping a predecessor pointer in each hashtable entry that identifies its predecessor as of the time of the last array sort, so you could check on-the-fly and avoid unnecessary re-sorts. I'm not totally sure that this idea is a win, but it seems worth investigating. Other than that, the analyze patch looked generally sane to me, and I think you're on the right track. Please rework and resubmit. regards, tom lane
В списке pgsql-hackers по дате отправления: