Обсуждение: pg_trgm Memory Allocation logic
In the pg_trgm module, within function generate_trgm, the memory for trigrams is allocated as follows: trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3); I have been trying to understand why this is so because it seems to be allocating more space than that is required. The following table shows the palloced size [(slen / 2 + 1) *3] and the actual trgm count for different string length. slen palloc size actual trgm count 2 6 3 26 42 27 38 60 39 Can somebody please explain this to me. I had tried changing the allocation to slen + 1 and it seemed to be working without any problem. trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen + 1)); Maybe I am missisng some scenarios. Any help would be appreciated. Thank you, Beena Emerson ----- -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/pg-trgm-Memory-Allocation-logic-tp5841088.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Beena Emerson wrote: > In the pg_trgm module, within function generate_trgm, the memory for trigrams > is allocated as follows: > > trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3); > > I have been trying to understand why this is so because it seems to be > allocating more space than that is required. Maybe it's considering a worst-case for multibyte characteres? I don't really know if trgm supports multibyte, but I assume it does. If it does, then probably the trigrams consist of chars, not bytes. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/09/2015 02:54 PM, Alvaro Herrera wrote: > Beena Emerson wrote: >> In the pg_trgm module, within function generate_trgm, the memory for trigrams >> is allocated as follows: >> >> trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3); >> >> I have been trying to understand why this is so because it seems to be >> allocating more space than that is required. > > Maybe it's considering a worst-case for multibyte characteres? I don't > really know if trgm supports multibyte, but I assume it does. If it > does, then probably the trigrams consist of chars, not bytes. Nope. Trigrams are always three bytes, even ones containing multibyte characters. If there are any multibyte characters in the trigram, we store a 3-byte checksum of the three characters instead. That loses some information, you can have a collision where one multibyte trigram incorrectly matches another one, but the trigram algorithms are generally not too concerned about exact results. - Heikki
Beena Emerson <memissemerson@gmail.com> writes: > In the pg_trgm module, within function generate_trgm, the memory for trigrams > is allocated as follows: > trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3); > I have been trying to understand why this is so because it seems to be > allocating more space than that is required. Consider input like 'X X X X X'. Each X produces 3 trigrams. regards, tom lane
On 03/09/2015 03:33 PM, Tom Lane wrote: > Beena Emerson <memissemerson@gmail.com> writes: >> In the pg_trgm module, within function generate_trgm, the memory for trigrams >> is allocated as follows: > >> trg = (TRGM *) palloc(TRGMHDRSIZE + sizeof(trgm) * (slen / 2 + 1) *3); > >> I have been trying to understand why this is so because it seems to be >> allocating more space than that is required. > > Consider input like 'X X X X X'. Each X produces 3 trigrams. No it won't. Only two: postgres=# select show_trgm('a b c'); show_trgm --------------------------------------- {" a"," b"," c"," a "," b "," c "} (1 row) If you manually set RPADDING 2 in trgm.h, then it will, but the allocation probably should use LPADDING/RPADDING to get it right, rather than assume the max values. - Heikki
Hello, > If you manually set RPADDING 2 in trgm.h, then it will, but the > allocation probably should use LPADDING/RPADDING to get it right, rather > than assume the max values. Yes you are right. For RPADDING = 2, the current formula is suitable but for RPADDING =1, a lot of extra space is allocated. IIUC, for each word the total number of trigrams is: (word_length + LPADDING + RPADDING - 2). Also the maximum number of words a string can have is: (slen +1)/2 (considering each word has length of 1) I think (slen + (slen + 1)/2 * (LPADDING + RPADDING - 3) + 1) allocates only the necessary memory for different values of RPADDING. Regards, Beena Emerson ----- -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/pg-trgm-Memory-Allocation-logic-tp5841088p5841238.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.