Speeding up unicode decoding

Поиск
Список
Период
Сортировка
От Daniele Varrazzo
Тема Speeding up unicode decoding
Дата
Msg-id CA+mi_8ayXAKRtNiEAv6sNA7T4vTd-EOAbO6eg2+RKuZniLBcrw@mail.gmail.com
обсуждение исходный текст
Список psycopg
Hello,

I've taken a look at issue
https://github.com/psycopg/psycopg2/issues/473, where it is reported
that SQLAlchemy is faster than psycopg at decoding unicode (i.e. it's
faster for SQLAlchemy to use psycopg to return bytes strings and
decoding them than asking psycopg directly to return unicode). It
seems from the discussion linked in the ticket that a relevant
improvement can come from caching the codec.

I've tried a quick test: storing a pointer to a fast C decode function
for known codec in the connection (e.g. for an utf8 connection store
the pointer to PyUnicode_DecodeUTF8). The results are totally worth
more work. This script
<https://gist.github.com/dvarrazzo/43b43d6ae96e13319cb085a3efe92ac8>
generates unicode data on the server and measures the decode time
(decode happens on fetch*() so the operation is not I/O bound but CPU
and memory access). Decoding 400K of 1KB strings has a 17% speedup:

$ PYTHONPATH=orig python ./timeenc.py -s 1000 -c 4096 -m 100
timing for strsize: 1000, chrrange: 4096, mult: 100
times: 2.588915, 2.310006, 2.308195, 2.305879, 2.304648 sec
best: 2.304648 sec

$ PYTHONPATH=fast python ./timeenc.py -s 1000 -c 4096 -m 100
timing for strsize: 1000, chrrange: 4096, mult: 100
times: 2.159055, 1.922977, 1.922651, 1.933926, 1.932110 sec
best: 1.922651 sec

Because the overhead paid for the codec lookup is per string, not per
data size, the improvement is more relevant decoding the same amount
of data, but in more, shorter strings: 55% for 4M of 100B strings:

$ PYTHONPATH=orig python ./timeenc.py -s 100 -c 4096 -m 1000
timing for strsize: 100, chrrange: 4096, mult: 1000
times: 5.997742, 5.909936, 5.914419, 5.967713, 6.779648 sec
best: 5.909936 sec

$ PYTHONPATH=fast python ./timeenc.py -s 100 -c 4096 -m 1000
timing for strsize: 100, chrrange: 4096, mult: 1000
times: 2.738192, 2.669642, 2.647298, 2.657130, 2.651866 sec
best: 2.647298 sec

Other things to do:

- the lookup can be cached also for other encodings, not only the two
blessed ones for which there is a public C function in the Python API
(similar in Python to saving codecs.getdecoder() instead of calling
codecs.decode())

- encoding data to the connection can be optimised the same way.

If someone wants contribute to the idea, the first commit is in the
branch <https://github.com/psycopg/psycopg2/tree/fast-codecs>. Any
feedback or help is welcome.


-- Daniele


В списке psycopg по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Re: plpython3u extension
Следующее
От: Nahum Castro
Дата:
Сообщение: OT? plpython2u