Netflix Prize data

Поиск
Список
Период
Сортировка
От Mark Woodward
Тема Netflix Prize data
Дата
Msg-id 18350.24.91.171.78.1159994622.squirrel@mail.mohawksoft.com
обсуждение исходный текст
Ответы Re: Netflix Prize data  ("Luke Lonergan" <llonergan@greenplum.com>)
Re: Netflix Prize data  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Netflix Prize data  (Gregory Stark <stark@enterprisedb.com>)
Re: Netflix Prize data  ("Greg Sabino Mullane" <greg@turnstep.com>)
Re: Netflix Prize data  (Heikki Linnakangas <heikki@enterprisedb.com>)
Список pgsql-hackers
I signed up for the Netflix Prize. (www.netflixprize.com) and downloaded
their data and have imported it into PostgreSQL. Here is how I created the
table:   Table "public.ratings"Column |  Type   | Modifiers
--------+---------+-----------item   | integer |client | integer |rating | integer |rdate  | text    |
Indexes:   "ratings_client" btree (client)   "ratings_item" btree (item)

markw@snoopy:~/netflix$ time psql netflix -c "select count(*) from ratings"  count
-----------100480507
(1 row)


real    2m6.270s
user    0m0.004s
sys     0m0.005s


The one thing I notice is that it is REAL slow. I know it is, in fact, 100
million records, but I don't think PostgreSQL is usually slow like this.
I'm going to check with some other machines to see if there is a problem
with my test machine or if something is wierd about PostgreSQL and large
numbers of rows.

I tried to cluster the data along a particular index but had to cancel it
after 3 hours.

I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So,
the raw data is 23 bytes, the date string will probably be rounded up to
12 bytes, that's 24 bytes per row of data. What is the overhead per
variable? per row?

Is there any advantage to using "varchar(10)" over "text" ?




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Benny Amorsen
Дата:
Сообщение: Re: Faster StrNCpy
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: pgindent has been run