Обсуждение: Table Design question for gurus (without going to "NoSQL")...
Hi. Want to start another thread, loosely related to the performance problems thread I have going. Need some DB design guidance from the gurus here. My big table now has about 70 million rows, with the following columns: alias | character varying(35) url | text modify_date | timestamp without time zone ip | bigint For each IP address (user of my application) I want to have a unique URL. So I used to have a UNIQUE constraint on IP, URL. But the index based on this became huge, as some URLs are gigantic. so I introduced an md5 of the URL: url_md5 | varchar(32) I now have two scenarios: 1. To have an index (unique?) on "(ip, url_md5)" 2. To not have an index on just the "ip". This way a query that tries to match "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will still look only at the ip bit of the index, then refine it with the url_md5. The good thing about #2 is the size of index remains very small with only a bigint field (ip) being indexed. The bad thing about #2 is that each query of "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will have to refine the indexed IP. If one IP address has put in a lot of URLs, then this becomes a bit slow. As is now happening, where I have users who have over 1 million URLs each! Questions: 1. Instead of md5, is there any integer hashing algorithm that will allow me to have a bigint column and save a lot hopefully in both storage space and speed? (Some very useful points mentioned here: http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer ) 2. If I do go with the above scenario #1 of a joint index, is there any way I can save space and maintain speed? Partitioning etc are out of the question. With a growing web database, I am sure many people face this situation. Are nosql type databases the only sane solution to such massive volumes and throughput expectations (e.g., CouchDb's MemBase)? Many thanks for any ideas or pointers!
On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > Hi. > > Want to start another thread, loosely related to the performance > problems thread I have going. > > Need some DB design guidance from the gurus here. > > My big table now has about 70 million rows, with the following columns: > > > alias | character varying(35) > url | text > modify_date | timestamp without time zone > ip | bigint > > > For each IP address (user of my application) I want to have a unique > URL. So I used to have a UNIQUE constraint on IP, URL. But the index > based on this became huge, as some URLs are gigantic. so I introduced > an md5 of the URL: > > > url_md5 | varchar(32) > > > I now have two scenarios: > > 1. To have an index (unique?) on "(ip, url_md5)" > > 2. To not have an index on just the "ip". This way a query that tries > to match "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will > still look only at the ip bit of the index, then refine it with the > url_md5. > > The good thing about #2 is the size of index remains very small with > only a bigint field (ip) being indexed. > > The bad thing about #2 is that each query of "...WHERE ip = 999 AND > url_md5 = '<md5 here>'..." will have to refine the indexed IP. If one > IP address has put in a lot of URLs, then this becomes a bit slow. As > is now happening, where I have users who have over 1 million URLs > each! > > Questions: > > 1. Instead of md5, is there any integer hashing algorithm that will > allow me to have a bigint column and save a lot hopefully in both > storage space and speed? (Some very useful points mentioned here: > http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer > ) > > 2. If I do go with the above scenario #1 of a joint index, is there > any way I can save space and maintain speed? Partitioning etc are out > of the question. > > With a growing web database, I am sure many people face this > situation. Are nosql type databases the only sane solution to such > massive volumes and throughput expectations (e.g., CouchDb's MemBase)? > > Many thanks for any ideas or pointers! > I thought of adding a bigserial (serial8) column instead of varchar(32) for the md5. But postgresql tells me that: -- ERROR: type "bigserial" does not exist -- Why is this? Why can't I create a column with this "type"? Whats the current syntax? Thanks.
partition your table if it is too big.
On Nov 20, 2011, at 8:33, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > Hi. > > Want to start another thread, loosely related to the performance > problems thread I have going. > > Need some DB design guidance from the gurus here. > > My big table now has about 70 million rows, with the following columns: > > > alias | character varying(35) > url | text > modify_date | timestamp without time zone > ip | bigint > > While the schema is useful you need to provide HOW the data is being used if you want to help on finding ways to improveperformance. > For each IP address (user of my application) I want to have a unique > URL. So I used to have a UNIQUE constraint on IP, URL. Give a base URL can you encode an algorithm to generate the user-specific URL on-demand; then maybe cache that result inthe application. > But the index > based on this became huge, as some URLs are gigantic. so I introduced What does this mean? Are there any patterns to the URLs that you can leverage (like, say, grouping them by domain name)? Is there a lot of overlap between users so that having a URL table with a biting PK would make a difference? > an md5 of the URL: > > > url_md5 | varchar(32) > > > I now have two scenarios: > > 1. To have an index (unique?) on "(ip, url_md5)" > > 2. To not have an index on just the "ip". This way a query that tries > to match "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will > still look only at the ip bit of the index, then refine it with the > url_md5. > > The good thing about #2 is the size of index remains very small with > only a bigint field (ip) being indexed. > > The bad thing about #2 is that each query of "...WHERE ip = 999 AND > url_md5 = '<md5 here>'..." will have to refine the indexed IP. If one > IP address has put in a lot of URLs, then this becomes a bit slow. As > is now happening, where I have users who have over 1 million URLs > each! Create a additional partial index on the URL for any IP address with more than X number of records? You smallish users theonly need to use the IP. Index while the big ones use that PLUS their personal URL index. > > Questions: > > 1. Instead of md5, is there any integer hashing algorithm that will > allow me to have a bigint column and save a lot hopefully in both > storage space and speed? (Some very useful points mentioned here: > http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer > ) > > 2. If I do go with the above scenario #1 of a joint index, is there > any way I can save space and maintain speed? Partitioning etc are out > of the question. > If you are going to discount the feature whose implementation solves this specific problem then you are basically askingthe list to solve your specific problem and, from my comment above, to do so without providing sufficient details asto how your application works. Also, WTF do you mean by "etc". If you are going to discount something from consideration you should be able to exactlyspecify what it is. Furthermore, if you ask the question and exclude possible solutions you should explain why you cannot use them so that peoplewill not propose other solutions that would have the same faults. > With a growing web database, I am sure many people face this > situation. Are nosql type databases the only sane solution to such > massive volumes and throughput expectations (e.g., CouchDb's MemBase)? You would implement these before you would partition? There are likely multiple solutions to your problem but, again, simply giving a table schema doesn't help it determiningwhich ones are feasible. > > Many thanks for any ideas or pointers! > The only data ignorant, and thus generally useful, PostgreSQL solution is table partitioning. Use It. My other questions, while an interesting thought exercise, need intimate knowledge of the data to even evaluate if they makesense. So, in short, use partitions. If you cannot, provide reasons why and then include more details about the application anddata so that meaningful solutions have a chance to be suggested. David J.
On Sunday, November 20, 2011 7:12:59 am Phoenix Kiula wrote: > On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > > I thought of adding a bigserial (serial8) column instead of > varchar(32) for the md5. But postgresql tells me that: > > -- > ERROR: type "bigserial" does not exist > -- > > Why is this? Why can't I create a column with this "type"? Whats the > current syntax? bigserial is not a type so much as a macro that creates a bigint column with attached sequence. Example: test(5432)aklaver=>\d pk_test Table "public.pk_test" Column | Type | Modifiers --------+---------+----------- id | integer | not null fld_1 | text | Indexes: "pk" PRIMARY KEY, btree (id) test(5432)aklaver=>ALTER TABLE pk_test ADD column bg bigserial; NOTICE: ALTER TABLE will create implicit sequence "pk_test_bg_seq" for serial column "pk_test.bg" ALTER TABLE test(5432)aklaver=>\d pk_test Table "public.pk_test" Column | Type | Modifiers --------+---------+------------------------------------------------------ id | integer | not null fld_1 | text | bg | bigint | not null default nextval('pk_test_bg_seq'::regclass) Indexes: "pk" PRIMARY KEY, btree (id) > > Thanks. -- Adrian Klaver adrian.klaver@gmail.com
On 21/11/11 02:33, Phoenix Kiula wrote: > Hi. > > Want to start another thread, loosely related to the performance > problems thread I have going. > > Need some DB design guidance from the gurus here. > > My big table now has about 70 million rows, with the following columns: > > > alias | character varying(35) > url | text > modify_date | timestamp without time zone > ip | bigint > > > For each IP address (user of my application) I want to have a unique > URL. So I used to have a UNIQUE constraint on IP, URL. But the index > based on this became huge, as some URLs are gigantic. so I introduced > an md5 of the URL: > > > url_md5 | varchar(32) > > > I now have two scenarios: > > 1. To have an index (unique?) on "(ip, url_md5)" > > 2. To not have an index on just the "ip". This way a query that tries > to match "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will > still look only at the ip bit of the index, then refine it with the > url_md5. > > The good thing about #2 is the size of index remains very small with > only a bigint field (ip) being indexed. > > The bad thing about #2 is that each query of "...WHERE ip = 999 AND > url_md5 = '<md5 here>'..." will have to refine the indexed IP. If one > IP address has put in a lot of URLs, then this becomes a bit slow. As > is now happening, where I have users who have over 1 million URLs > each! > > Questions: > > 1. Instead of md5, is there any integer hashing algorithm that will > allow me to have a bigint column and save a lot hopefully in both > storage space and speed? (Some very useful points mentioned here: > http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer > ) > > 2. If I do go with the above scenario #1 of a joint index, is there > any way I can save space and maintain speed? Partitioning etc are out > of the question. > > With a growing web database, I am sure many people face this > situation. Are nosql type databases the only sane solution to such > massive volumes and throughput expectations (e.g., CouchDb's MemBase)? > > Many thanks for any ideas or pointers! > How about having 2 indexes: one on each of ip & url_md5? Pg will combine the indexes as required, or will just use one if that is best. Why don't you have a time zone on your timestamp??? Regards, Gavin
On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > How about having 2 indexes: one on each of ip & url_md5? Pg will combine the > indexes as required, or will just use one if that is best. Thanks Gavin. Question: what if I have a joined index? If from a joined index I only use the first column (say, "ip") will a joined index still be used? It is cleaner to create two indexes for the two columns. Which is recommended?
Hi, On 21 November 2011 00:33, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > Hi. > > Want to start another thread, loosely related to the performance > problems thread I have going. > > Need some DB design guidance from the gurus here. > > My big table now has about 70 million rows, with the following columns: You can put different tables/indexes on different disks using tablespaces. For example, one tablespace for all tables and another one for all indexes (and change random_page_cost and seq_page_cost appropriately ie. lower random_page_cost). It is a good idea to put pg_xlog on the separate drive too. Sometimes Postgres just can't utilise all available resources properly and you can get results faster by running query over multiple connections. It could be worth to investigate pg-poolII's parallel query mode. You don't need multiple servers - just setup multiple PG instances on the same physical machine (up to one PG instance per core, with lower shared_mem, ...). Alternative could be parallel DB like Postgres-XC (http://wiki.postgresql.org/wiki/Postgres-XC), ... -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
On Nov 20, 2011, at 20:50, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower > <GavinFlower@archidevsys.co.nz> wrote: > >> How about having 2 indexes: one on each of ip & url_md5? Pg will combine the >> indexes as required, or will just use one if that is best. > > > > Thanks Gavin. Question: what if I have a joined index? If from a > joined index I only use the first column (say, "ip") will a joined > index still be used? > > It is cleaner to create two indexes for the two columns. Which is recommended? > An index on (a, b) can be used for queries involving only a but not for those involving only b. Neither is recommended - both have their benefits and limitations. David J.
On 21/11/11 14:50, Phoenix Kiula wrote: > On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower > <GavinFlower@archidevsys.co.nz> wrote: > >> How about having 2 indexes: one on each of ip& url_md5? Pg will combine the >> indexes as required, or will just use one if that is best. > > > Thanks Gavin. Question: what if I have a joined index? If from a > joined index I only use the first column (say, "ip") will a joined > index still be used? It will use the joined index if the planner thinks it is worth it. However, it is harder for the planner to jusify using the index for the second field when the query does not restrict on the first field (I am assuming it can, if required). > > It is cleaner to create two indexes for the two columns. Which is recommended? If you are frequently just using one or other of the indexes and/or could effectively use a joined index in both directins - then szeparate indexes will probably be better. If your predominant query mode can use just use the one joined index effectively, then that would be better. Consider the amount of RAM the indexes and table data will take up. The advantages of indexing 2 fields separately compared to one =joined index are: that if you only need either single field index, it will take up less RAM and be also be quicker to read from disk. Plus the 2 single field indexes can be used together for queiries that use both fields. The costs are that when both indexes need to be used, there is a little bit more processing involved, and 2 single field indexes take up more RAM than a single joined index. So the answer is 'it depends...'!
On 21 Listopad 2011, 4:17, David Johnston wrote: > On Nov 20, 2011, at 20:50, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > >> On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower >> <GavinFlower@archidevsys.co.nz> wrote: >> >>> How about having 2 indexes: one on each of ip & url_md5? Pg will >>> combine the >>> indexes as required, or will just use one if that is best. >> >> >> >> Thanks Gavin. Question: what if I have a joined index? If from a >> joined index I only use the first column (say, "ip") will a joined >> index still be used? >> >> It is cleaner to create two indexes for the two columns. Which is >> recommended? >> > > An index on (a, b) can be used for queries involving only a but not for > those involving only b. That is not true since 8.2 - a multi-column index may be used even for queries without conditions on leading columns. It won't be as effective as with conditions on leading columns, because the whole index must be scanned, but it's usually much cheaper than keeping two indexes (memory requirements, overhead when inserting data etc.) Check this: http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html Tomas
> On Nov 20, 2011, at 21:33, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > > My big table now has about 70 million rows, with the following columns: > > alias | character varying(35) > url | text > modify_date | timestamp without time zone > ip | bigint > > > For each IP address (user of my application) I want to have a unique > URL. So I used to have a UNIQUE constraint on IP, URL. But the index > based on this became huge, as some URLs are gigantic. so I introduced > an md5 of the URL: > > url_md5 | varchar(32) I don't understand this. So you want to allow exactly one URL for an IP address? Or do you want to allow only one row with the same (IP, URL) values? Because that's exactly what a UNIQUE index on (IP, URL) does. If you want to allow just a single URL for an IP, you should create an index on IP only. > I now have two scenarios: > > 1. To have an index (unique?) on "(ip, url_md5)" > > 2. To not have an index on just the "ip". This way a query that tries > to match "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will > still look only at the ip bit of the index, then refine it with the > url_md5. Have you actually performed any benchmarks with this? Just create the indexes, run several (say 100) queries and that should give you an idea which of the options is better. We can speculate on the pros/cons of those options, but both are viable under certain conditions. For example if there's just a very small number of URLs for an IP, then #2 is probably going to be better. But if there's enormous number of URLs per IP, then a multi-column index is probably going to perform better. But those are only guesses - try both options, run a few queries (not on the overloaded system you're struggling with - that'd skew the results) and show us EXPLAIN ANALYZE of the queries. > Questions: > > 1. Instead of md5, is there any integer hashing algorithm that will > allow me to have a bigint column and save a lot hopefully in both > storage space and speed? (Some very useful points mentioned here: > http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer > ) You can obviously take an MD5 hash, which is just a 16-byte value and use the first 4 bytes as an integer. The problem with this is that it probably significantly increases the collision. I.e. it will indicate two URLs to be the same, although the URLs are different. > 2. If I do go with the above scenario #1 of a joint index, is there > any way I can save space and maintain speed? Partitioning etc are out > of the question. What is the index size, anyway? You've mentioned it's huge, but what is the actual size? > With a growing web database, I am sure many people face this > situation. Are nosql type databases the only sane solution to such > massive volumes and throughput expectations (e.g., CouchDb's MemBase)? Well, partitioning is usually a good approach to problems like these. Have you thought about moving the URLs into a separate table? I.e. instead of this table alias | character varying(35) url | text url_md5 | character varying(32) modify_date | timestamp without time zone ip | bigint you'd have two tables - one for keeping the URLs: id | integer url | text url_md5 | character varying(32) and the current one, referencing the URLs alias | character varying(35) url_id | integer modify_date | timestamp without time zone ip | bigint That'd allow you to create a UNIQUE index on (ip, url_id), which should be much smaller than the current one. But handling the inserts would be significantly more complex (you'd have to check existence of the URL, insert it etc.). Tomas
On Mon, Nov 21, 2011 at 6:14 PM, Tomas Vondra <tv@fuzzy.cz> wrote: ..... >> >> An index on (a, b) can be used for queries involving only a but not for >> those involving only b. > > That is not true since 8.2 - a multi-column index may be used even for > queries without conditions on leading columns. It won't be as effective as > with conditions on leading columns, because the whole index must be > scanned, but it's usually much cheaper than keeping two indexes (memory > requirements, overhead when inserting data etc.) > > Check this: > http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html Thanks Tomas. VERY useful information. I've decided to go with a unique multicolumn index for now. Will ask the experts here if I see some issues.. Thanks!