Re: blobs

Поиск

Список

Период

Сортировка

От	Chad Wagner
Тема	Re: blobs
Дата	2 февраля 2007 г. 03:05:18
Msg-id	81961ff50702012004m3aeff4d9k2193917682314fb@mail.gmail.com обсуждение исходный текст
Ответ на	Re: blobs (Steve Holdoway <steve.holdoway@firetrust.com>)
Ответы	Re: blobs
Список	pgsql-admin

Дерево обсуждения

On 2/1/07, Steve Holdoway <steve.holdoway@firetrust.com> wrote:

> Why would it be a "bottleneck"? If you are updating or deleting 300K a
> week, definitely make sure you take a look at autovacuum and turn it ON.
It is a potential bottleneck as I said. I am replacing between 5 and 10 percent of the data in the table every week. There are, in addition, a daily total of about a million hits ( read/update/delete ), as I said.

This is the busiest table in the database. Therefore, it's the most likely candidate to cause performance problems. That's what I call a bottleneck.

The reason I am asking why you think it is a bottleneck is because many MySQL users see 'reading' a table as a bottleneck, and this is usually because they are using MyISAM tables. Under MyISAM tables readers block readers, writers block readers, readers block writers. This is NOT the case with PostgreSQL, readers never block other readers, writers never block readers, and readers never block writers. PostgreSQL does this using multi-versioning, similar to how Oracle works.

> Depends on how you are querying the table.  This is really a database
> modeling question, and leads into many many more questions.  I would say if
> your frequently range scanning the table (selecting several rows) and in
> those cases you rarely need the "blob", then I would fork it off into a
> child table.  If the "blob" is rarely accessed, and only accessed directly,
> then definitely a child table in my book.
>
> The reason is if your frequently fetching rows from this table and rarely
> using the blob all you are doing is consuming memory that could be better
> used for other things, and spinning I/O when it is not necessary.
Sorry, you're completley wrong. If you consider that the only way of getting info is select *, then this is true. Personally, I think that anyone who does this in code is plain lazy and should find a job more suited to them (:

Okay, I would love to see your test case. My statement is based on the suggestion that you were embedding your data in the same row as a bytea column. Here is a test case that I ran:

NOTE: Fill factor is set to 10% to exaggerate a wide row, this also proves the point that EMPTY space in a table affects performance.

drop table foo;
create table foo (x integer not null primary key, y text not null) with (fillfactor = 10);
insert into foo (x, y) select generate_series(1,100000), repeat('1234567890', 10240);
checkpoint;
analyze foo;

select relpages,reltuples,pg_size_pretty(pg_relation_size(relname)) from pg_class where relname = 'foo';
relpages | reltuples | pg_size_pretty
----------+-----------+----------------
   100000 |    100000 | 781 MB
(1 row)

-- fetch just the integer column
explain analyze select x from foo;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..101000.00 rows=100000 width=4) (actual time=10.389..19288.848 rows=100000 loops=1)
Total runtime: 19700.804 ms
(2 rows)

-- fetch just the text column
explain analyze select y from foo;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..101000.00 rows=100000 width=32) (actual time=9.234..19863.485 rows=100000 loops=1)
Total runtime: 20290.618 ms
(2 rows)

-- fetch both the integer and text column
explain analyze select * from foo;

------------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..101000.00 rows=100000 width=36) (actual time=40.478..20470.648 rows=100000 loops=1)
Total runtime: 20904.611 ms
(2 rows)

-- Runtime is the same for all three cases against the "wide" table. Disproves the theory that selecting or not selecting a "column" has something to do with performance. I would think logically it does when you are talking about sorting and merging data sets, not when we are talking about physical reads.
-- Now let's seperate the "large" text column from the rest of the data.

drop table bar;
create table bar (x integer not null primary key);
insert into bar (x) select generate_series(1,100000);
checkpoint;
analyze bar;

select relpages,reltuples,pg_size_pretty(pg_relation_size(relname)) from pg_class where relname = 'bar';
relpages | reltuples | pg_size_pretty
----------+-----------+----------------
      441 |    100000 | 3528 kB
(1 row)

-- fetch from the "skinny" table
explain analyze select * from bar;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on bar (cost=0.00..1441.00 rows=100000 width=4) (actual time=19.552..416.575 rows=100000 loops=1)
Total runtime: 790.901 ms
(2 rows)

-- Notice how must faster it is, seems to suggest that a table with a narrow column with is faster.

I think this case proves the point that a "wide" blob if not accessed frequently shouldn't be included in the core table. Naturally this makes no sense if every single time you are going to fetch the blob and use it with every single fetch against the table.

I am asking for input from those who have been in this situation before, and have experience in the tradeoff of running a separate table for the big stuff as against the extra ( 8 byte? ) column that would be added to the master table. Why am I asking this? Because Postgres has an unique way of handling this kind of data, unique from even the last time I used postgres in anger - 7.4. It's different from every other rdbms ( and MySQL ), and I have no practical experience of it in the wild.

I think you are referring to large objects, which are completely different then your original statement. Your original statement seemed to ask about including binary data in the table versus not in the table. My understanding of large objects is they create a file on the filesystem (probably an oid like a table does) and you store the oid of the "file" in a table. Basically you have bytea (byte arrays) which are stored inline with the rest of the row and large objects where store a reference to another object.

> It really has to do with how you design your schema.
No. It really has to do with the gathering of relevant information so that I can design my schema properly. Academically, 3NF may be the solution to every design, but in the real world, there are many other things to take in to account. I've monitored the data flows, sized the hardware to handle the IO, and can either spend a week or two benchmarking different solutions from cold, or I can take the advice of those with relevant experience ( who I expect to find on this list ) to point me in the right direction first.

That is my point here, the schema should be designed properly. A poor schema design and yield awful I/O performance, which is essentially my point in my previous message.

Anyways, if you are typically this sarcastic then I would suggest you learn how to either better phrase your responses or not respond at all. Frankly, no one has to help you, and making statements as you have above are rude to everyone on this list.

I have plenty of experience building large distributed OLTP systems and multi terabyte data warehouses, so you can either take the advice or leave it.

В списке pgsql-admin по дате отправления:

Предыдущее

От: Michael Fuhr
Дата: 02 февраля 2007 г., 01:36:21
Сообщение: Re: questions on multibyte

Следующее

От: "Karthikeyan Sundaram"
Дата: 02 февраля 2007 г., 04:32:37
Сообщение: questions on multibyte

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: blobs

Предыдущее

Следующее