Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
От | Todd A. Cook |
---|---|
Тема | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop |
Дата | |
Msg-id | 0f552dee-2b87-2a74-7862-7b0ff095fc82@blackducksoftware.com обсуждение исходный текст |
Ответ на | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Список | pgsql-bugs |
On 12/10/17 18:14, Tomas Vondra wrote: > > I've done a simple experiment today - computing a hash for every uint32 > value, and checking how many distinct hash values that produces. For the > 4.2 billion values the results look like this: > > second key ndistinct ndistinct/nkeys > 3.380 i=100000000 nset=98829159 0.99 > 6.240 i=200000000 nset=195351181 0.98 > 9.106 i=300000000 nset=289623103 0.97 > 11.932 i=400000000 nset=381695003 0.95 > 14.814 i=500000000 nset=471621355 0.94 > 17.706 i=600000000 nset=559452278 0.93 > 20.577 i=700000000 nset=645242762 0.92 > 23.496 i=800000000 nset=729036217 0.91 > 26.460 i=900000000 nset=810879821 0.90 > 29.380 i=1000000000 nset=890818778 0.89 > 32.331 i=1100000000 nset=968898478 0.88 > 35.282 i=1200000000 nset=1045164189 0.87 > 38.262 i=1300000000 nset=1119654810 0.86 > 41.251 i=1400000000 nset=1192424766 0.85 > 44.258 i=1500000000 nset=1263482593 0.84 > 47.268 i=1600000000 nset=1332897449 0.83 > 50.305 i=1700000000 nset=1400717923 0.82 > 53.356 i=1800000000 nset=1466962823 0.81 > 56.425 i=1900000000 nset=1531660191 0.81 > 59.489 i=2000000000 nset=1594856205 0.80 > 62.593 i=2100000000 nset=1656588855 0.79 > 65.706 i=2200000000 nset=1716895530 0.78 > 68.829 i=2300000000 nset=1775809288 0.77 > 71.966 i=2400000000 nset=1833353377 0.76 > 75.123 i=2500000000 nset=1889558599 0.76 > 78.271 i=2600000000 nset=1944463039 0.75 > 81.418 i=2700000000 nset=1998096496 0.74 > 84.574 i=2800000000 nset=2050490745 0.73 > 87.711 i=2900000000 nset=2101666393 0.72 > 90.852 i=3000000000 nset=2151669155 0.72 > 93.986 i=3100000000 nset=2200517580 0.71 > 97.084 i=3200000000 nset=2248232772 0.70 > 100.172 i=3300000000 nset=2294849331 0.70 > 103.232 i=3400000000 nset=2340389131 0.69 > 106.273 i=3500000000 nset=2384875319 0.68 > 109.272 i=3600000000 nset=2428339639 0.67 > 112.260 i=3700000000 nset=2470798655 0.67 > 115.199 i=3800000000 nset=2512271730 0.66 > 118.125 i=3900000000 nset=2552788321 0.65 > 121.029 i=4000000000 nset=2592379529 0.65 > 123.895 i=4100000000 nset=2631056297 0.64 > 126.726 i=4200000000 nset=2668843329 0.64 > 129.397 loops 4294967295 set 2703915179 (0.63) > > That means we only ever generate about 64% of the possible hash keys. > That doesn't seem particularly healthy I guess, but for small hash > tables (with fewer than 2^32 buckets) that may not be an issue, as some > of the values will wrap because of the modulo. FWIW, changing hashint8() to int64 val = PG_GETARG_INT64(0); if (val <= UINT32_MAX && val >= 0) return hash_uint32((uint32) val); else return hash_any((unsigned char *) &val, sizeof(val)); fixes the problem for me on all of my data sets. My conjecture is that the existing implementation uint32 lohalf = (uint32) val; uint32 hihalf = (uint32) (val >> 32); lohalf ^= (val >= 0) ? hihalf : ~hihalf; biases lohalf when hashing collections of large-magnitude negative numbers (like my data sets) because the individual bits of lohalf do not have equal probability of being toggled by the exclusive-or. For example, given a value of -9223261146507558080 (0x8000770b48a68340), 15 of the 16 most significant bits in lohalf will toggle, but only half of the least significant will toggle. -- todd
В списке pgsql-bugs по дате отправления: