Reducing the size of BufferTag & remodeling forks

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Reducing the size of BufferTag & remodeling forks
Дата
Msg-id 20150702133619.GB16267@alap3.anarazel.de
обсуждение исходный текст
Ответы Re: Reducing the size of BufferTag & remodeling forks  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Reducing the size of BufferTag & remodeling forks  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: Reducing the size of BufferTag & remodeling forks  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
Hi,

I've complained a number of times that our BufferTag is ridiculously
large:
typedef struct buftag
{   RelFileNode rnode;          /* physical relation identifier */   ForkNumber  forkNum;   BlockNumber blockNum;
/*blknum relative to begin of reln */
 
} BufferTag;

typedef struct RelFileNode
{   Oid         spcNode;        /* tablespace */   Oid         dbNode;         /* database */   Oid         relNode;
   /* relation */
 
} RelFileNode;

that amounts to 20 bytes. That's problematic because we frequently have
to compare or hash the entire buffer tag. Comparing 20bytes is rather
branch intensive, and shows up noticably on profiles.  It's also a
stumbling block on the way to a smarter buffer mapping data structure,
because it makes e.g. trees rather deep.

The buffer tag is currently used in two situations:

1) Dealing with the buffer mapping, we need to identify the underlying  file uniquely and we need the block number (8
bytes).

2) When writing out the a block we need, in addition to 1), have  information about where to store the file. That
requiresthe  tablespace and database.
 

You may know that a filenode (RelFileNode->relNode) is currently *not*
unique across databases and tablespaces.

Additionally you might have noticed that the above description also
disregards relation forks.

I think we should work towards 1) being sufficient for its purpose. My
suggestion to get there is twofold:

1) Introduce a shared pg_relfilenode table. Every table, even  shared/nailed ones, get an entry therein. It's there to
makeit  possibly to uniquely allocate relfilenodes across databases &  tablespaces.
 

2) Replace relation forks, with the exception of the init fork which is  special anyway, with separate relfilenodes.
Storedin seperate  columns in pg_class.
 

This scheme has a number of advantages: We don't need to look at the
filesystem anymore to find out whether a relfilenode exists. The buffer
tags are 8 bytes. The number of stats doesn't scale O(#forks *
#relations) anymore, allowing us to add additional forks more easily.

I think something akin to init forks is going to survive because they've
to be copied without access to the catalogs - but that's fine, they just
aren't allowed to go through shared buffers. Afaics that's not a
problem.

Obviously this is a rather high-level description, but right now this
sounds doable to me.

Thoughts?


- Andres



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: WALWriter active during recovery
Следующее
От: Andres Freund
Дата:
Сообщение: Re: WALWriter active during recovery