Обсуждение: JSON vs Text + Regexp Index Searching

Поиск
Список
Период
Сортировка

JSON vs Text + Regexp Index Searching

От
Eliot Gable
Дата:
I am working on a project for which I require some assistance. Any input would be appreciated.

We have a table with millions of records and dozens of columns. On some systems, it takes up >10GB of disk space, and it runs on a single disk which is rather slow (7200 RPM). We typically don't do any complicated queries on it, so performance has not posed a problem previously. The table is actually defined as 3 tables using inheritance. One table shows only the last 7 days of data, and then a process moves the 8th day into the history table for archiving. For performance reasons, we typically only query the table with 7 days of data, which generally does not exceed 1 million rows. In fact, it is typically under 50k rows for most systems. However, a few systems may contain a few million records, at most. There are indices on each of the fields we query, and we also put limits on the number of records returned. The table is practically an append-only table.

We are looking at adding some additional columns to the table, and planning for some future features. One of those features will allow cross referencing the records in this table with a list of IDs. If the record's list of IDs contains the one we are looking for, we want to include the record in our query. I advocated creating a separate mapping table which maps the ID of these records to the other ID we are searching for and performing a JOIN on the two tables with appropriate foreign key relationships and indices. However, I was ask to instead put the list into a single column on each row to reduce implementation complexity. 

Assuming the list of IDs is in a column on each row as TEXT in the format of a JSON array, what is the best way to index the column so I can quickly find the rows with the given ID?


Limitations:

We run version 9.0.x on these systems right now. We plan on updating later this year, but for now we don't have the JSON type. Once we do, we plan to make use of it. The column value will also be passed to the UI, which is JavaScript-based, so a JSON format would be ideal.

We cannot modify the disks we run on to make them faster. It is cost prohibitive in our particular scenario.

The systems in question are real-time systems, and are carefully optimized. Any excessively lengthy queries which have high disk IO or CPU usage will adversely affect the entire system.


My Thoughts:

Is it possible to use a REGEXP INDEX in this case? For example: column ~ '^\[67|,67,|,67\]$'
Will such an index perform well? How would I create it?

How badly would a simple LIKE operation perform? Is there any way to optimize it?

Would using a JSON type in the future allow us to do any better type of index on the column?

If we wanted to put an index on a JSON data type column whose values was a simple array of IDs, what would the index look like, and how would we construct the query to make use of it?


Thanks in advance for any input / insight on this.

Re: JSON vs Text + Regexp Index Searching

От
Merlin Moncure
Дата:
On Tue, Feb 25, 2014 at 9:08 AM, Eliot Gable
<egable+pgsql-general@gmail.com> wrote:
> If we wanted to put an index on a JSON data type column whose values was a
> simple array of IDs, what would the index look like, and how would we
> construct the query to make use of it?

For 9.0, your only option would be to use the intarray extension.  If
your IDs are all integers, this may even be your best option.  9.4
hstore may get the ability to support similar operations in a more
general way.  json type may be able to optimize this case in the
future (probably as jsonb), but as of 9.4 it can't.

merlin


Re: JSON vs Text + Regexp Index Searching

От
David Johnston
Дата:
Eliot Gable-4 wrote
> I advocated creating a separate mapping table which
> maps the ID of these records to the other ID we are searching for and
> performing a JOIN on the two tables with appropriate foreign key
> relationships and indices. However, I was ask to instead put the list into
> a single column on each row to reduce implementation complexity.
>
> Assuming the list of IDs is in a column on each row as TEXT in the format
> of a JSON array, what is the best way to index the column so I can quickly
> find the rows with the given ID?

I recommend benchmarking two implementations:

1) id_xref integer[] --on the same table, use "search_id = ANY(id_xref)" as
the WHERE condition
2) Your multi-table solution but use "EXISTS (SELECT 1 FROM xref_master
WHERE search_id = id_xref)"

And I'd politely respond that implementation complexity is somewhat less
important than performance in an embedded system - not that either of these
solutions is considered complex and both can readily be encapsulated into
functions to hide any such complexity from the application.

I would not introduce the added indirection of storing the values as a
single JSON array.  Especially if the IDs are integer-based but even if you
represent IDs as text anyway.

The fact you want to use LIKE/REGEX confuses me but that may be because you
are limiting yourself to text.  Most cross-ref searches know the exact ID
being looked for so pattern matching is not required...

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/JSON-vs-Text-Regexp-Index-Searching-tp5793472p5793492.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: JSON vs Text + Regexp Index Searching

От
David Johnston
Дата:
David Johnston wrote
>
> Eliot Gable-4 wrote
>> I advocated creating a separate mapping table which
>> maps the ID of these records to the other ID we are searching for and
>> performing a JOIN on the two tables with appropriate foreign key
>> relationships and indices. However, I was ask to instead put the list
>> into
>> a single column on each row to reduce implementation complexity.
>>
>> Assuming the list of IDs is in a column on each row as TEXT in the format
>> of a JSON array, what is the best way to index the column so I can
>> quickly
>> find the rows with the given ID?
> I recommend benchmarking two implementations:
>
> 1) id_xref integer[] --on the same table, use "search_id = ANY(id_xref)"
> as the WHERE condition
>
> [...]

#1 can be greatly expanded in usefulness by making use of the "intarray"
contrib/extension; as Merlin mentioned up-thread.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/JSON-vs-Text-Regexp-Index-Searching-tp5793472p5793495.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: JSON vs Text + Regexp Index Searching

От
Eliot Gable
Дата:
On Tue, Feb 25, 2014 at 11:17 AM, David Johnston <polobo@yahoo.com> wrote:
David Johnston wrote
>
> Eliot Gable-4 wrote
>> I advocated creating a separate mapping table which
>> maps the ID of these records to the other ID we are searching for and
>> performing a JOIN on the two tables with appropriate foreign key
>> relationships and indices. However, I was ask to instead put the list
>> into
>> a single column on each row to reduce implementation complexity.
>>
>> Assuming the list of IDs is in a column on each row as TEXT in the format
>> of a JSON array, what is the best way to index the column so I can
>> quickly
>> find the rows with the given ID?
> I recommend benchmarking two implementations:
>
> 1) id_xref integer[] --on the same table, use "search_id = ANY(id_xref)"
> as the WHERE condition
>
> [...]

#1 can be greatly expanded in usefulness by making use of the "intarray"
contrib/extension; as Merlin mentioned up-thread.

David J.



Thank you both for the responses. I will benchmark the options you suggested.
 

Re: JSON vs Text + Regexp Index Searching

От
Eliot Gable
Дата:
On Tue, Feb 25, 2014 at 3:13 PM, Eliot Gable <egable+pgsql-general@gmail.com> wrote:
On Tue, Feb 25, 2014 at 11:17 AM, David Johnston <polobo@yahoo.com> wrote:
David Johnston wrote
>
> Eliot Gable-4 wrote
>> I advocated creating a separate mapping table which
>> maps the ID of these records to the other ID we are searching for and
>> performing a JOIN on the two tables with appropriate foreign key
>> relationships and indices. However, I was ask to instead put the list
>> into
>> a single column on each row to reduce implementation complexity.
>>
>> Assuming the list of IDs is in a column on each row as TEXT in the format
>> of a JSON array, what is the best way to index the column so I can
>> quickly
>> find the rows with the given ID?
> I recommend benchmarking two implementations:
>
> 1) id_xref integer[] --on the same table, use "search_id = ANY(id_xref)"
> as the WHERE condition
>
> [...]

#1 can be greatly expanded in usefulness by making use of the "intarray"
contrib/extension; as Merlin mentioned up-thread.

David J.



Thank you both for the responses. I will benchmark the options you suggested.
 

Is there some way to make intarray work with int8 or some int8 variation of intarray? The data type of the IDs is BIGINT.