Обсуждение: pg_trgm Memory Allocation logic

Поиск
Список
Период
Сортировка

pg_trgm Memory Allocation logic

От
Beena Emerson
Дата:
In the pg_trgm module, within function generate_trgm, the memory for trigrams
is allocated as follows:

trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3);

I have been trying to understand why this is so because it seems to be
allocating more space than that is required.

The following table shows the palloced size [(slen / 2 + 1) *3] and the
actual trgm count for different string length.

slen    palloc size    actual trgm count
2    6        3
26    42        27
38    60        39

Can somebody please explain this to me.


I had tried changing the allocation to slen + 1 and it seemed to be working
without any problem.

trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen + 1)); 


Maybe I am missisng some scenarios. 

Any help would be appreciated.

Thank you,

Beena Emerson



-----

--

Beena Emerson

--
View this message in context: http://postgresql.nabble.com/pg-trgm-Memory-Allocation-logic-tp5841088.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: pg_trgm Memory Allocation logic

От
Alvaro Herrera
Дата:
Beena Emerson wrote:
> In the pg_trgm module, within function generate_trgm, the memory for trigrams
> is allocated as follows:
> 
> trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3);
> 
> I have been trying to understand why this is so because it seems to be
> allocating more space than that is required.

Maybe it's considering a worst-case for multibyte characteres?  I don't
really know if trgm supports multibyte, but I assume it does.  If it
does, then probably the trigrams consist of chars, not bytes.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_trgm Memory Allocation logic

От
Heikki Linnakangas
Дата:
On 03/09/2015 02:54 PM, Alvaro Herrera wrote:
> Beena Emerson wrote:
>> In the pg_trgm module, within function generate_trgm, the memory for trigrams
>> is allocated as follows:
>>
>> trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3);
>>
>> I have been trying to understand why this is so because it seems to be
>> allocating more space than that is required.
>
> Maybe it's considering a worst-case for multibyte characteres?  I don't
> really know if trgm supports multibyte, but I assume it does.  If it
> does, then probably the trigrams consist of chars, not bytes.

Nope. Trigrams are always three bytes, even ones containing multibyte 
characters. If there are any multibyte characters in the trigram, we 
store a 3-byte checksum of the three characters instead. That loses some 
information, you can have a collision where one multibyte trigram 
incorrectly matches another one, but the trigram algorithms are 
generally not too concerned about exact results.

- Heikki




Re: pg_trgm Memory Allocation logic

От
Tom Lane
Дата:
Beena Emerson <memissemerson@gmail.com> writes:
> In the pg_trgm module, within function generate_trgm, the memory for trigrams
> is allocated as follows:

> trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3);

> I have been trying to understand why this is so because it seems to be
> allocating more space than that is required.

Consider input like 'X X X X X'.  Each X produces 3 trigrams.
        regards, tom lane



Re: pg_trgm Memory Allocation logic

От
Heikki Linnakangas
Дата:
On 03/09/2015 03:33 PM, Tom Lane wrote:
> Beena Emerson <memissemerson@gmail.com> writes:
>> In the pg_trgm module, within function generate_trgm, the memory for trigrams
>> is allocated as follows:
>
>> trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3);
>
>> I have been trying to understand why this is so because it seems to be
>> allocating more space than that is required.
>
> Consider input like 'X X X X X'.  Each X produces 3 trigrams.

No it won't. Only two:

postgres=# select show_trgm('a b c');               show_trgm
--------------------------------------- {"  a","  b","  c"," a "," b "," c "}
(1 row)

If you manually set RPADDING 2 in trgm.h, then it will, but the 
allocation probably should use LPADDING/RPADDING to get it right, rather 
than assume the max values.

- Heikki




Re: pg_trgm Memory Allocation logic

От
Beena Emerson
Дата:
Hello,

> If you manually set RPADDING 2 in trgm.h, then it will, but the 
> allocation probably should use LPADDING/RPADDING to get it right, rather 
> than assume the max values.

Yes you are right. For RPADDING = 2, the current formula is suitable but for
RPADDING =1, a lot of extra space is allocated.

IIUC, for each word the total number of trigrams is: (word_length + LPADDING
+ RPADDING - 2).
Also the maximum number of words a string can have is: (slen +1)/2
(considering each word has length of 1)

I think (slen + (slen + 1)/2 * (LPADDING + RPADDING - 3) + 1) allocates only
the necessary memory for different values of RPADDING.

Regards,

Beena Emerson







-----

--

Beena Emerson

--
View this message in context: http://postgresql.nabble.com/pg-trgm-Memory-Allocation-logic-tp5841088p5841238.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.