Обсуждение: PostgreSQL SQL Tricks: faster urldecode

Поиск
Список
Период
Сортировка

PostgreSQL SQL Tricks: faster urldecode

От
Marc Mamin
Дата:
Hi,
here is a function which is about 8 x faster than the one described in the PostgreSQL SQL Tricks
( http://postgres.cz/wiki/PostgreSQL_SQL_Tricks#Function_for_decoding_of_url_code )

The idea is to handle each encoded/not_encoded parts in bulk rather than spliting on each character.

urldecode_arr:
Seq Scan on lt_referrer  (actual time=1.966..17623.979 rows=65717 loops=1)

urldecode:
Seq Scan on lt_referrer  (actual time=4.846..144445.292 rows=65717 loops=1)

regards,

Marc Mamin



CREATE OR REPLACE FUNCTION urldecode_arr(url text)
  RETURNS text AS
$BODY$
DECLARE ret text;

BEGIN
 BEGIN

    WITH STR AS (
      SELECT

      -- array with all non encoded parts, prepend with '' when the string start is encoded
      case when $1 ~ '^%[0-9a-fA-F][0-9a-fA-F]'
           then array['']
           end
      || regexp_split_to_array ($1,'(%[0-9a-fA-F][0-9a-fA-F])+', 'i') plain,

      -- array with all encoded parts
      array(select (regexp_matches ($1,'((?:%[0-9a-fA-F][0-9a-fA-F])+)', 'gi'))[1]) encoded
    )
    SELECT  string_agg(plain[i] || coalesce( convert_from(decode(replace(encoded[i], '%',''), 'hex'), 'utf8'),''),'')
    FROM STR,
      (SELECT  generate_series(1, array_upper(encoded,1)+2) i FROM STR)blah

    INTO ret;

  EXCEPTION WHEN OTHERS THEN
    raise notice 'failed: %',url;
    return $1;
  END;

  RETURN coalesce(ret,$1); -- when the string has no encoding;

END;

$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT


Re: PostgreSQL SQL Tricks: faster urldecode

От
Merlin Moncure
Дата:
On Fri, Sep 20, 2013 at 10:26 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
> Hi,
> here is a function which is about 8 x faster than the one described in the PostgreSQL SQL Tricks
> ( http://postgres.cz/wiki/PostgreSQL_SQL_Tricks#Function_for_decoding_of_url_code )
>
> The idea is to handle each encoded/not_encoded parts in bulk rather than spliting on each character.
>
> urldecode_arr:
> Seq Scan on lt_referrer  (actual time=1.966..17623.979 rows=65717 loops=1)
>
> urldecode:
> Seq Scan on lt_referrer  (actual time=4.846..144445.292 rows=65717 loops=1)

very nice.  Basically it comes down to this: all non-trivial regex
replacements require decomposition of the string into an array because
regexp_replace() is unable to do any kind of transformation on the
string.  This is a crippling limitation relative to first-class regex
languages like perl; postgres string translation functions are
invisible to the regex engine.  I have no idea if this is fixable (I
dimly recall Tom explaining why it might not be).

merlin


Re: PostgreSQL SQL Tricks: faster urldecode

От
Marc Mamin
Дата:
> Von: Merlin Moncure [mmoncure@gmail.com]
> Gesendet: Freitag, 20. September 2013 17:43
>
> >  On Fri, Sep 20, 2013 at 10:26 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
> > Hi,
> > here is a function which is about 8 x faster than the one described in the PostgreSQL SQL Tricks
> > ( http://postgres.cz/wiki/PostgreSQL_SQL_Tricks#Function_for_decoding_of_url_code )
> >
> > The idea is to handle each encoded/not_encoded parts in bulk rather than spliting on each character.
> >
> > urldecode_arr:
> > Seq Scan on lt_referrer  (actual time=1.966..17623.979 rows=65717 loops=1)
> >
> > urldecode:
> > Seq Scan on lt_referrer  (actual time=4.846..144445.292 rows=65717 loops=1)
>
> very nice.  Basically it comes down to this: all non-trivial regex
> replacements require decomposition of the string into an array because
> regexp_replace() is unable to do any kind of transformation on the
> string.  This is a crippling limitation relative to first-class regex
> languages like perl; postgres string translation functions are
> invisible to the regex engine.  I have no idea if this is fixable (I
> dimly recall Tom explaining why it might not be).
>
> merlin

yes, a possible(?) assistance for such problems would be a new variant of regexp_split_to_table
that would return two columns:
- the splitted parts (as currently)
- the separator matches (new)

Marc