Re: jsonb format is pessimal for toast compression

Поиск
Список
Период
Сортировка
От Gavin Flower
Тема Re: jsonb format is pessimal for toast compression
Дата
Msg-id 53ED33D4.2060600@archidevsys.co.nz
обсуждение исходный текст
Ответ на Re: jsonb format is pessimal for toast compression  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 15/08/14 09:47, Tom Lane wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> On Thu, Aug 14, 2014 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Maybe this is telling us it's not worth changing the representation,
>>> and we should just go do something about the first_success_by threshold
>>> and be done.  I'm hesitant to draw such conclusions on the basis of a
>>> single use-case though, especially one that doesn't really have that
>>> much use for compression in the first place.  Do we have other JSON
>>> corpuses to look at?
>> Yes. Pavel posted some representative JSON data a while back:
>> http://pgsql.cz/data/data.dump.gz (it's a plain dump)
> I did some quick stats on that.  206560 rows
>
>                     min    max    avg
>
> external text representation        220    172685    880.3
>
> JSON representation (compressed text)    224    78565    541.3
>
> pg_column_size, JSONB HEAD repr.    225    82540    639.0
>
> pg_column_size, all-lengths repr.    225    66794    531.1
>
> So in this data, there definitely is some scope for compression:
> just compressing the text gets about 38% savings.  The all-lengths
> hack is able to beat that slightly, but the all-offsets format is
> well behind at 27%.
>
> Not sure what to conclude.  It looks from both these examples like
> we're talking about a 10 to 20 percent size penalty for JSON objects
> that are big enough to need compression.  Is that beyond our threshold
> of pain?  I'm not sure, but there is definitely room to argue that the
> extra I/O costs will swamp any savings we get from faster access to
> individual fields or array elements.
>
>             regards, tom lane
>
>
Curious, would adding the standard deviation help in characterising the 
distribution of data values?

Also you might like to consider additionally using the median value, and 
possibly the 25% & 75% (or some such) values.  I assume the 'avg' in 
your table, refers to the arithmetic mean.  Sometimes the median is a 
better meaure of 'normal' than the arithmetic mean, and it can be useful 
to note the difference between the two!

Graphing the values may also be useful.  You might have 2, or more, 
distinct populations which might show up as several distinct peaks - in 
which case, this might suggest changes to the algorithm.

Many moons ago, I did a 400 level statistics course at University, of 
which I've forgotten most.  However, I'm aware of other potentially 
useful measure, but I suspect that they would be too esoteric for the 
current problem!


Cheers,
Gavin




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: jsonb format is pessimal for toast compression
Следующее
От: Tom Lane
Дата:
Сообщение: Re: ALTER TABLESPACE MOVE command tag tweak