Обсуждение: Table Design question for gurus (without going to "NoSQL")...

Поиск
Список
Период
Сортировка

Table Design question for gurus (without going to "NoSQL")...

От
Phoenix Kiula
Дата:
Hi.

Want to start another thread, loosely related to the performance
problems thread I have going.

Need some DB design guidance from the gurus here.

My big table now has about 70 million rows, with the following columns:


 alias           | character varying(35)
 url             | text
 modify_date     | timestamp without time zone
 ip              | bigint


For each IP address (user of my application) I want to have a unique
URL. So I used to have a UNIQUE constraint on IP, URL. But the index
based on this became huge, as some URLs are gigantic. so I introduced
an md5 of the URL:


 url_md5             | varchar(32)


I now have two scenarios:

1. To have an index (unique?) on "(ip, url_md5)"

2. To not have an index on just the "ip". This way a query that tries
to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
still look only at the ip bit of the index, then refine it with the
url_md5.

The good thing about #2 is the size of index remains very small with
only a bigint field (ip) being indexed.

The bad thing about #2 is that each query of "...WHERE ip = 999 AND
url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
IP address has put in a lot of URLs, then this becomes a bit slow. As
is now happening, where I have users who have over 1 million URLs
each!

Questions:

1. Instead of md5, is there any integer hashing algorithm that will
allow me to have a bigint column and save a lot hopefully in both
storage space and speed?  (Some very useful points mentioned here:
http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
)

2. If I do go with the above scenario #1 of a joint index, is there
any way I can save space and maintain speed? Partitioning etc are out
of the question.

With a growing web database, I am sure many people face this
situation. Are nosql type databases the only sane solution to such
massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Many thanks for any ideas or pointers!

Re: Table Design question for gurus (without going to "NoSQL")...

От
Phoenix Kiula
Дата:
On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> Hi.
>
> Want to start another thread, loosely related to the performance
> problems thread I have going.
>
> Need some DB design guidance from the gurus here.
>
> My big table now has about 70 million rows, with the following columns:
>
>
>  alias           | character varying(35)
>  url             | text
>  modify_date     | timestamp without time zone
>  ip              | bigint
>
>
> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL. But the index
> based on this became huge, as some URLs are gigantic. so I introduced
> an md5 of the URL:
>
>
>  url_md5             | varchar(32)
>
>
> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.
>
> The good thing about #2 is the size of index remains very small with
> only a bigint field (ip) being indexed.
>
> The bad thing about #2 is that each query of "...WHERE ip = 999 AND
> url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
> IP address has put in a lot of URLs, then this becomes a bit slow. As
> is now happening, where I have users who have over 1 million URLs
> each!
>
> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )
>
> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.
>
> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?
>
> Many thanks for any ideas or pointers!
>



I thought of adding a bigserial (serial8) column instead of
varchar(32) for the md5. But postgresql tells me that:

--
ERROR:  type "bigserial" does not exist
--

Why is this? Why can't I create a column with this "type"? Whats the
current syntax?

Thanks.

Re: Table Design question for gurus (without going to "NoSQL")...

От
Gregg Jaskiewicz
Дата:
partition your table if it is too big.

Re: Table Design question for gurus (without going to "NoSQL")...

От
David Johnston
Дата:
On Nov 20, 2011, at 8:33, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

> Hi.
>
> Want to start another thread, loosely related to the performance
> problems thread I have going.
>
> Need some DB design guidance from the gurus here.
>
> My big table now has about 70 million rows, with the following columns:
>
>
> alias           | character varying(35)
> url             | text
> modify_date     | timestamp without time zone
> ip              | bigint
>
>

While the schema is useful you need to provide HOW the data is being used if you want to help on finding ways to
improveperformance. 

> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL.

Give a base URL can you encode an algorithm to generate the user-specific URL on-demand; then maybe cache that result
inthe application. 

> But the index
> based on this became huge, as some URLs are gigantic. so I introduced

What does this mean?  Are there any patterns to the URLs that you can leverage (like, say, grouping them by domain
name)? Is there a lot of overlap between users so that having a URL table with a biting PK would make a difference? 


> an md5 of the URL:
>
>
> url_md5             | varchar(32)
>
>
> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.
>
> The good thing about #2 is the size of index remains very small with
> only a bigint field (ip) being indexed.
>
> The bad thing about #2 is that each query of "...WHERE ip = 999 AND
> url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
> IP address has put in a lot of URLs, then this becomes a bit slow. As
> is now happening, where I have users who have over 1 million URLs
> each!

Create a additional partial index on the URL for any IP address with more than X number of records?  You smallish users
theonly need to use the IP. Index while the big ones use that PLUS their personal URL index. 

>
> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )
>
> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.
>

If you are going to discount the feature whose implementation solves this specific problem then you are basically
askingthe list to solve your specific problem and, from my comment above, to do so without providing sufficient details
asto how your application works. 

Also, WTF do you mean by "etc".  If you are going to discount something from consideration you should be able to
exactlyspecify what it is. 

Furthermore, if you ask the question and exclude possible solutions you should explain why you cannot use them so that
peoplewill not propose other solutions that would have the same faults. 

> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

You would implement these before you would partition?

There are likely multiple solutions to your problem but, again, simply giving a table schema doesn't help it
determiningwhich ones are feasible. 

>
> Many thanks for any ideas or pointers!
>

The only data ignorant, and thus generally useful, PostgreSQL solution is table partitioning.

Use It.

My other questions, while an interesting thought exercise, need intimate knowledge of the data to even evaluate if they
makesense. 

So, in short, use partitions.  If you cannot, provide reasons why and then include more details about the application
anddata so that meaningful solutions have a chance to be suggested. 

David J.




Re: Table Design question for gurus (without going to "NoSQL")...

От
Adrian Klaver
Дата:
On Sunday, November 20, 2011 7:12:59 am Phoenix Kiula wrote:
> On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula <phoenix.kiula@gmail.com>
wrote:

>
> I thought of adding a bigserial (serial8) column instead of
> varchar(32) for the md5. But postgresql tells me that:
>
> --
> ERROR:  type "bigserial" does not exist
> --
>
> Why is this? Why can't I create a column with this "type"? Whats the
> current syntax?

bigserial is not a type so much as a macro that creates a bigint column with
attached sequence.

Example:

test(5432)aklaver=>\d pk_test
                                                           
    Table "public.pk_test"
                                                           
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer | not null
 fld_1  | text    |
Indexes:
    "pk" PRIMARY KEY, btree (id)

test(5432)aklaver=>ALTER TABLE pk_test ADD column bg bigserial;
NOTICE:  ALTER TABLE will create implicit sequence "pk_test_bg_seq" for serial
column "pk_test.bg"
ALTER TABLE
test(5432)aklaver=>\d pk_test
                         Table "public.pk_test"
 Column |  Type   |                      Modifiers
--------+---------+------------------------------------------------------
 id     | integer | not null
 fld_1  | text    |
 bg     | bigint  | not null default nextval('pk_test_bg_seq'::regclass)
Indexes:
    "pk" PRIMARY KEY, btree (id)

>
> Thanks.

--
Adrian Klaver
adrian.klaver@gmail.com

Re: Table Design question for gurus (without going to "NoSQL")...

От
Gavin Flower
Дата:
On 21/11/11 02:33, Phoenix Kiula wrote:
> Hi.
>
> Want to start another thread, loosely related to the performance
> problems thread I have going.
>
> Need some DB design guidance from the gurus here.
>
> My big table now has about 70 million rows, with the following columns:
>
>
>   alias           | character varying(35)
>   url             | text
>   modify_date     | timestamp without time zone
>   ip              | bigint
>
>
> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL. But the index
> based on this became huge, as some URLs are gigantic. so I introduced
> an md5 of the URL:
>
>
>   url_md5             | varchar(32)
>
>
> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.
>
> The good thing about #2 is the size of index remains very small with
> only a bigint field (ip) being indexed.
>
> The bad thing about #2 is that each query of "...WHERE ip = 999 AND
> url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
> IP address has put in a lot of URLs, then this becomes a bit slow. As
> is now happening, where I have users who have over 1 million URLs
> each!
>
> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )
>
> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.
>
> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?
>
> Many thanks for any ideas or pointers!
>
How about having 2 indexes: one on each of ip & url_md5? Pg will combine
the indexes as required, or will just use one if that is best.

Why don't you have a time zone on your timestamp???


Regards,
Gavin

Re: Table Design question for gurus (without going to "NoSQL")...

От
Phoenix Kiula
Дата:
On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:

> How about having 2 indexes: one on each of ip & url_md5? Pg will combine the
> indexes as required, or will just use one if that is best.



Thanks Gavin. Question: what if I have a joined index? If from a
joined index I only use the first column (say, "ip") will a joined
index still be used?

It is cleaner to create two indexes for the two columns. Which is recommended?

Re: Table Design question for gurus (without going to "NoSQL")...

От
Ondrej Ivanič
Дата:
Hi,

On 21 November 2011 00:33, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> Hi.
>
> Want to start another thread, loosely related to the performance
> problems thread I have going.
>
> Need some DB design guidance from the gurus here.
>
> My big table now has about 70 million rows, with the following columns:

You can put different tables/indexes on different disks using
tablespaces. For example, one tablespace for all tables and another
one for all indexes (and  change random_page_cost and seq_page_cost
appropriately ie. lower random_page_cost). It is a good idea to put
pg_xlog on the separate drive too.

Sometimes Postgres just can't utilise all available resources properly
and you can get results faster by running query over multiple
connections. It could be worth to investigate pg-poolII's parallel
query mode. You don't need multiple servers - just setup multiple PG
instances on the same physical machine (up to one PG instance per
core, with lower shared_mem, ...). Alternative could be parallel DB
like Postgres-XC (http://wiki.postgresql.org/wiki/Postgres-XC), ...

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Table Design question for gurus (without going to "NoSQL")...

От
David Johnston
Дата:
On Nov 20, 2011, at 20:50, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

> On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower
> <GavinFlower@archidevsys.co.nz> wrote:
>
>> How about having 2 indexes: one on each of ip & url_md5? Pg will combine the
>> indexes as required, or will just use one if that is best.
>
>
>
> Thanks Gavin. Question: what if I have a joined index? If from a
> joined index I only use the first column (say, "ip") will a joined
> index still be used?
>
> It is cleaner to create two indexes for the two columns. Which is recommended?
>

An index on (a, b) can be used for queries involving only a but not for those involving only b.

Neither is recommended - both have their benefits and limitations.

David J.





Re: Table Design question for gurus (without going to "NoSQL")...

От
Gavin Flower
Дата:
On 21/11/11 14:50, Phoenix Kiula wrote:
> On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower
> <GavinFlower@archidevsys.co.nz>  wrote:
>
>> How about having 2 indexes: one on each of ip&  url_md5? Pg will combine the
>> indexes as required, or will just use one if that is best.
>
>
> Thanks Gavin. Question: what if I have a joined index? If from a
> joined index I only use the first column (say, "ip") will a joined
> index still be used?
It will use the joined index if the planner thinks it is worth it.
However, it is harder for the planner to jusify using the index for the
second field when the query does not restrict on the first field (I am
assuming it can, if required).

>
> It is cleaner to create two indexes for the two columns. Which is recommended?
If you are frequently just using one or other of the indexes and/or
could effectively use a joined index in both directins - then szeparate
indexes will probably be better.

If your predominant query mode can use just use the one joined index
effectively, then that would be better.

Consider the amount of RAM the indexes and table data will take up.

The advantages of indexing 2 fields separately compared to one =joined
index are: that if you only need either single field index, it will take
up less RAM and be also be quicker to read from disk.  Plus the 2 single
field indexes can be used together for queiries that use both fields.
The costs are that when both indexes need to be used, there is a little
bit more processing involved, and 2 single field indexes take up more
RAM than a single joined index.

So the answer is 'it depends...'!




Re: Table Design question for gurus (without going to "NoSQL")...

От
"Tomas Vondra"
Дата:
On 21 Listopad 2011, 4:17, David Johnston wrote:
> On Nov 20, 2011, at 20:50, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
>
>> On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower
>> <GavinFlower@archidevsys.co.nz> wrote:
>>
>>> How about having 2 indexes: one on each of ip & url_md5? Pg will
>>> combine the
>>> indexes as required, or will just use one if that is best.
>>
>>
>>
>> Thanks Gavin. Question: what if I have a joined index? If from a
>> joined index I only use the first column (say, "ip") will a joined
>> index still be used?
>>
>> It is cleaner to create two indexes for the two columns. Which is
>> recommended?
>>
>
> An index on (a, b) can be used for queries involving only a but not for
> those involving only b.

That is not true since 8.2 - a multi-column index may be used even for
queries without conditions on leading columns. It won't be as effective as
with conditions on leading columns, because the whole index must be
scanned, but it's usually much cheaper than keeping two indexes (memory
requirements, overhead when inserting data etc.)

Check this:
http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html



Tomas


Re: Table Design question for gurus (without going to "NoSQL")...

От
"Tomas Vondra"
Дата:
> On Nov 20, 2011, at 21:33, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
>
> My big table now has about 70 million rows, with the following columns:
>
>  alias           | character varying(35)
>  url             | text
>  modify_date     | timestamp without time zone
>  ip              | bigint
>
>
> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL. But the index
> based on this became huge, as some URLs are gigantic. so I introduced
> an md5 of the URL:
>
>  url_md5             | varchar(32)

I don't understand this. So you want to allow exactly one URL for an IP
address? Or do you want to allow only one row with the same (IP, URL)
values? Because that's exactly what a UNIQUE index on (IP, URL) does.

If you want to allow just a single URL for an IP, you should create an
index on IP only.

> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.

Have you actually performed any benchmarks with this? Just create the
indexes, run several (say 100) queries and that should give you an idea
which of the options is better.

We can speculate on the pros/cons of those options, but both are viable
under certain conditions. For example if there's just a very small number
of URLs for an IP, then #2 is probably going to be better. But if there's
enormous number of URLs per IP, then a multi-column index is probably
going to perform better.

But those are only guesses - try both options, run a few queries (not on
the overloaded system you're struggling with - that'd skew the results)
and show us EXPLAIN ANALYZE of the queries.

> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )

You can obviously take an MD5 hash, which is just a 16-byte value and use
the first 4 bytes as an integer. The problem with this is that it probably
significantly increases the collision. I.e. it will indicate two URLs to
be the same, although the URLs are different.

> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.

What is the index size, anyway? You've mentioned it's huge, but what is
the actual size?

> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Well, partitioning is usually a good approach to problems like these.

Have you thought about moving the URLs into a separate table? I.e. instead
of this table

 alias           | character varying(35)
 url             | text
 url_md5         | character varying(32)
 modify_date     | timestamp without time zone
 ip              | bigint

you'd have two tables - one for keeping the URLs:

 id              | integer
 url             | text
 url_md5         | character varying(32)

and the current one, referencing the URLs

 alias           | character varying(35)
 url_id          | integer
 modify_date     | timestamp without time zone
 ip              | bigint

That'd allow you to create a UNIQUE index on (ip, url_id), which should be
much smaller than the current one. But handling the inserts would be
significantly more complex (you'd have to check existence of the URL,
insert it etc.).

Tomas


Re: Table Design question for gurus (without going to "NoSQL")...

От
Phoenix Kiula
Дата:
On Mon, Nov 21, 2011 at 6:14 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
.....
>>
>> An index on (a, b) can be used for queries involving only a but not for
>> those involving only b.
>
> That is not true since 8.2 - a multi-column index may be used even for
> queries without conditions on leading columns. It won't be as effective as
> with conditions on leading columns, because the whole index must be
> scanned, but it's usually much cheaper than keeping two indexes (memory
> requirements, overhead when inserting data etc.)
>
> Check this:
> http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html




Thanks Tomas. VERY useful information.

I've decided to go with a unique multicolumn index for now.

Will ask the experts here if I see some issues..

Thanks!