Обсуждение: schema advice for event stream with tagging and filtering

Поиск
Список
Период
Сортировка

schema advice for event stream with tagging and filtering

От
Chris Withers
Дата:
Hi All,

What would be the best schema to use when looking to implement an event
stream with tagging and filtering?

An event is a row with a primary key along the lines of (colo, host,
category) and an associated set of tags, where each tag has a type and a
value (eg: {"color": "red", "owner": "fred", "status": "open"...}).

Events come in as a streams of creates/updates as a cluster of http
posts to a web app.

We want to display events as a faceted list, with the facets being colo,
host, category and each of the tag types that exist, with counts against
each value for each facet (in an ideal world).

The idea is to be able to click a facet to include or exclude it from
the results. Furthermore, we want to do some other filtering (only show
events tagged for the team of the current user, exclude everything for
this colo between time x and y, etc).

I've deliberately tried to be abstract here as I'm trying to ask a
question rather than proposing a solution that might have problems, if
there's any more information that would help, please let me know!

cheers,

Chris




Re: schema advice for event stream with tagging and filtering

От
"Ilya Kazakevich"
Дата:
Hi,

>An event is a row with a primary key along the lines of (colo, host,
>category) and an associated set of tags, where each tag has a type and a value
>(eg: {"color": "red", "owner": "fred", "status": "open"...}).

What about  simple table with several columns and hstore  field for tags?
You may also normalize it (move hosts and categories to separate table).

indexes should help you with fast filtering, or you may load data from this table to denormalized olap table once a day
andbuild index there to speed-up queries. 

Ilya Kazakevich

JetBrains
http://www.jetbrains.com
The Drive to Develop




Re: schema advice for event stream with tagging and filtering

От
Chris Withers
Дата:
On 16/08/2016 14:29, Ilya Kazakevich wrote:
> Hi,
>
>> An event is a row with a primary key along the lines of (colo, host,
>> category) and an associated set of tags, where each tag has a type and a value
>> (eg: {"color": "red", "owner": "fred", "status": "open"...}).
>
> What about  simple table with several columns and hstore  field for tags?

Interesting!

> You may also normalize it (move hosts and categories to separate table).

Why? These form part of the primary key for the event...

> indexes should help you with fast filtering, or you may load data from this table to denormalized olap table once a
dayand build index there to speed-up queries. 

What kind of index is recommended here? The kind of queries would be:

- show me a list of tag types and the count of the number of events of
that type

- show me all events that have tag1=x, tag2=y and does not have tag3

cheers,

Chris


Re: schema advice for event stream with tagging and filtering

От
"Ilya Kazakevich"
Дата:
>>> An event is a row with a primary key along the lines of (colo, host,
>>> category) and an associated set of tags, where each tag has a type
>>> and a value
>>> (eg: {"color": "red", "owner": "fred", "status": "open"...}).
>>
>> What about  simple table with several columns and hstore  field for tags?

BTW, "pure SQL" approach here is to use separate tables: Tags(TagId, TagName) and TagValues(EventId,TagId,Value).
But in this case it will be painful to filter events by tag values directly, so only separate denormalized OLAP table
shouldbe used in "pure SQL":) 
PostgreSQL, however, supports key-value based hstore.

>> You may also normalize it (move hosts and categories to separate table).
>
>Why? These form part of the primary key for the event...

Host and category could be part of PK, but it may be better to have "HostId" field and "Hosts(HostId, Host)" table than
"text"field with many "www.foo.bar" values)  

>What kind of index is recommended here? The kind of queries would be:
>
>- show me a list of tag types and the count of the number of events of that
>type
>
>- show me all events that have tag1=x, tag2=y and does not have tag3

Hstore supports GIN and GIST (https://www.postgresql.org/docs/9.1/static/textsearch-indexes.html). I'd start with  GIN.

Do you need real-time data or, say, one-day old data is ok? In latter case it is better to load data to denormalized
tableto speed-up queries and use no index on normalized(OLTP) table because index update operation is not free) 


Ilya Kazakevich

JetBrains
http://www.jetbrains.com
The Drive to Develop



Re: schema advice for event stream with tagging and filtering

От
Venkata B Nagothi
Дата:

On Tue, Aug 16, 2016 at 6:38 PM, Chris Withers <chris@simplistix.co.uk> wrote:
Hi All,

What would be the best schema to use when looking to implement an event stream with tagging and filtering?

An event is a row with a primary key along the lines of (colo, host, category) and an associated set of tags, where each tag has a type and a value (eg: {"color": "red", "owner": "fred", "status": "open"...}).

Events come in as a streams of creates/updates as a cluster of http posts to a web app.

Not sure which version of PostgreSQL you are using. Did you consider JSON/JSONB data-types which is strongly supported by PostgreSQL ? You can consider having a table with JSON or JSONB column type. If you can use JSONB, then, it supports indexing as well.

That should make web app easy to push JSON format data to PostgreSQL.

Regards,
Venkata B N

Fujitsu Australia

Re: [GENERAL] schema advice for event stream with tagging andfiltering

От
Chris Withers
Дата:
On 16/08/2016 15:10, Ilya Kazakevich wrote:
>>>> An event is a row with a primary key along the lines of (colo, host,
>>>> category) and an associated set of tags, where each tag has a type
>>>> and a value
>>>> (eg: {"color": "red", "owner": "fred", "status": "open"...}).
>>>
>>> What about  simple table with several columns and hstore  field for tags?
>
> BTW, "pure SQL" approach here is to use separate tables: Tags(TagId, TagName) and TagValues(EventId,TagId,Value).

Well, maybe, but none of us wants to do that ;-)

> But in this case it will be painful to filter events by tag values directly, so only separate denormalized OLAP table
shouldbe used in "pure SQL":) 

I don't understand the second half of this I'm afraid...

> PostgreSQL, however, supports key-value based hstore.

Right, but hstore only allows single values for each key, if I
understand correctly?

Okay, so that leaves me with a jsonb "tags" column with a gin index, but
I still have a couple of choices..

So, in order to best answer these types of queries:

>> - show me a list of tag types and the count of the number of events of that
>> type
>>
>> - show me all events that have tag1=x, tag2=y and does not have tag3

...which of the following is going to be most performant:

# SELECT '{"tag1":["v1", "v2", "v3"]}'::jsonb @> '{"tag1": ["v1"]}'::jsonb;
  ?column?
----------
  t
(1 row)

# SELECT '[{"tag1":"v1"}, {"tag1": "v2"}, {"tag1": "v3"}]'::jsonb @>
'[{"tag1": "v1"}]'::jsonb;
  ?column?
----------
  t
(1 row)

So, should I go for a tag name that maps to a list of values for that
tag, or should I go for a sequence of one-entry mappings of tag name to
tag value?

cheers,

Chris