Re: Pluggable Storage - Andres's take

Поиск
Список
Период
Сортировка
От Haribabu Kommi
Тема Re: Pluggable Storage - Andres's take
Дата
Msg-id CAJrrPGfH7awxE5jDwJ8_gU2J9LE56=jGjEEkXRTG37Js2EC7LA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Pluggable Storage - Andres's take  (Haribabu Kommi <kommi.haribabu@gmail.com>)
Ответы Re: Pluggable Storage - Andres's take  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Pluggable Storage - Andres's take  (Haribabu Kommi <kommi.haribabu@gmail.com>)
Список pgsql-hackers
On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Tue, Sep 4, 2018 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

Thanks for the patches!

On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
> I found couple of places where the zheap is using some extra logic in
> verifying
> whether it is zheap AM or not, based on that it used to took some extra
> decisions.
> I am analyzing all the extra code that is done, whether any callbacks can
> handle it
> or not? and how? I can come back with more details later.

Yea, I think some of them will need to stay (particularly around
integrating undo) and som other ones we'll need to abstract.
 
OK. I will list all the areas that I found with my observation of how to
abstract or leaving it and then implement around it.

The following are the change where the code is specific to checking whether
it is a zheap relation or not?

Overall I found that It needs 3 new API's at the following locations.
1. RelationSetNewRelfilenode
2. heap_create_init_fork
3. estimate_rel_size
4. Facility to provide handler options like (skip WAL and etc).
_hash_vacuum_one_page:
xlrec.flags = RelationStorageIsZHeap(heapRel) ?
XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP : 0;

_bt_delitems_delete:
xlrec_delete.flags = RelationStorageIsZHeap(heapRel) ?
XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP : 0;


Storing the type of the handler and while checking for these new types adding a 
new API for special handing can remove the need of the above code.
RelationAddExtraBlocks:
if (RelationStorageIsZHeap(relation))
{
ZheapInitPage(page, BufferGetPageSize(buffer));
freespace = PageGetZHeapFreeSpace(page);
}

Adding a new API for PageInt and PageGetHeapFreeSpace to redirect the calls to specific
table am handlers.

visibilitymap_set:
if (RelationStorageIsZHeap(rel))
{
recptr = log_zheap_visible(rel->rd_node, heapBuf, vmBuf,
   cutoff_xid, flags);
/*
* We do not have a page wise visibility flag in zheap.
* So no need to set LSN on zheap page.
*/
}

Handler options may solve need of above code.

validate_index_heapscan:
/* Set up for predicate or expression evaluation */
/* For zheap relations, the tuple is locally allocated, so free it. */
ExecStoreHeapTuple(heapTuple, slot, RelationStorageIsZHeap(heapRelation));


This will solve after changing the validate_index_heapscan function to slotify.

RelationTruncate:
/* Create the meta page for zheap */
if (RelationStorageIsZHeap(rel))
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
  InvalidTransactionId,
  InvalidMultiXactId);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
rel->rd_rel->relkind != 'p')
{
heap_create_init_fork(rel);
if (RelationStorageIsZHeap(rel))
ZheapInitMetaPage(rel, INIT_FORKNUM);
}

new API in RelationSetNewRelfilenode and heap_create_init_fork can solve it.
cluster:
if (RelationStorageIsZHeap(rel))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a zheap table")));
 
No change required.
 
copyFrom:
/*
* In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
* See zheap_prepare_insert for details.
* PBORKED / ZBORKED: abstract
*/
if (!RelationStorageIsZHeap(cstate->rel) && !XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
How about requesting the table am handler to provide options and use them here?
ExecuteTruncateGuts:

// PBORKED: Need to abstract this
minmulti = GetOldestMultiXactId();

/*
* Need the full transaction-safe pushups.
*
* Create a new empty storage file for the relation, and assign it
* as the relfilenode value. The old storage file is scheduled for
* deletion at commit.
*
* PBORKED: needs to be a callback
*/
if (RelationStorageIsZHeap(rel))
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
  InvalidTransactionId, InvalidMultiXactId);
else
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
  RecentXmin, minmulti);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
{
heap_create_init_fork(rel);
if (RelationStorageIsZHeap(rel))
ZheapInitMetaPage(rel, INIT_FORKNUM);
}

New API inside RelationSetNewRelfilenode can handle it.
ATRewriteCatalogs:
/* Inherit the storage_engine reloption from the parent table. */
if (RelationStorageIsZHeap(rel))
{
static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
DefElem *storage_engine;

storage_engine = makeDefElemExtended("toast", "storage_engine",
(Node *) makeString("zheap"),
DEFELEM_UNSPEC, -1);
reloptions = transformRelOptions((Datum) 0,
list_make1(storage_engine),
"toast",
validnsps, true, false);
}

I don't think anything can be done in API.

ATRewriteTable:
/*
* In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
* See zheap_prepare_insert for details.
*
* ZFIXME / PFIXME: We probably need a different abstraction for this.
*/
if (!RelationStorageIsZHeap(newrel) && !XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;

Options can solve this also.

estimate_rel_size:
if (curpages < 10 &&
(rel->rd_rel->relpages == 0 ||
(RelationStorageIsZHeap(rel) &&
rel->rd_rel->relpages == ZHEAP_METAPAGE + 1)) &&
!rel->rd_rel->relhassubclass &&
rel->rd_rel->relkind != RELKIND_INDEX)
curpages = 10;

/* report estimated # pages */
*pages = curpages;
/* quick exit if rel is clearly empty */
if (curpages == 0 || (RelationStorageIsZHeap(rel) &&
curpages == ZHEAP_METAPAGE + 1))
{
*tuples = 0;
*allvisfrac = 0;
break;
}
/* coerce values in pg_class to more desirable types */
relpages = (BlockNumber) rel->rd_rel->relpages;
reltuples = (double) rel->rd_rel->reltuples;
relallvisible = (BlockNumber) rel->rd_rel->relallvisible;

/*
* If it's a zheap relation, then subtract the pages
* to account for the metapage.
*/
if (relpages > 0 && RelationStorageIsZHeap(rel))
{
curpages--;
relpages--;
}

An API may be needed to find out estimation size based on handler type?
pg_stat_get_tuples_hot_updated and others:
/*
* Counter tuples_hot_updated stores number of hot updates for heap table
* and the number of inplace updates for zheap table.
*/
if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
RelationStorageIsZHeap(rel))
result = 0;
else
result = (int64) (tabentry->tuples_hot_updated);


Is the special condition is needed? The values should be 0 because of zheap right?

RelationSetNewRelfilenode:
/* Initialize the metapage for zheap relation. */
if (RelationStorageIsZHeap(relation))
ZheapInitMetaPage(relation, MAIN_FORKNUM);
New API in RelationSetNetRelfilenode Can solve this problem.
 

> >> And then:
> >> - lotsa cleanups
> >> - rebasing onto a newer version of the abstract slot patchset
> >> - splitting out smaller patches
> >>
> >>
> >> You'd moved the bulk insert into tableam callbacks - I don't quite get
> >> why? There's not really anything AM specific in that code?
> >>
> >
> > The main reason of adding them to AM is just to provide a control to
> > the specific AM to decide whether they can support the bulk insert or
> > not?
> >
> > Current framework doesn't support AM specific bulk insert state to be
> > passed from one function to another and it's structure is fixed. This needs
> > to be enhanced to add AM specific private members also.
> >
>
> Do you want me to work on it to make it generic to AM methods to extend
> the structure?

I think the best thing here would be to *remove* all AM abstraction for
bulk insert, until it's actuallly needed. The likelihood of us getting
the interface right and useful without an actual user seems low. Also,
this already is a huge patch...

OK. Will remove them and share the patch.

Bulk insert API changes are removed.
 
 

> @@ -308,7 +308,7 @@ static void CopyFromInsertBatch(CopyState cstate, EState *estate,
>                                       CommandId mycid, int hi_options,
>                                       ResultRelInfo *resultRelInfo,
>                                       BulkInsertState bistate,
> -                                     int nBufferedTuples, TupleTableSlot **bufferedSlots,
> +                                     int nBufferedSlots, TupleTableSlot **bufferedSlots,
>                                       uint64 firstBufferedLineNo);
>  static bool CopyReadLine(CopyState cstate);
>  static bool CopyReadLineText(CopyState cstate);
> @@ -2309,11 +2309,12 @@ CopyFrom(CopyState cstate)
>       void       *bistate;
>       uint64          processed = 0;
>       bool            useHeapMultiInsert;
> -     int                     nBufferedTuples = 0;
> +     int                     nBufferedSlots = 0;
>       int                     prev_leaf_part_index = -1;

> -#define MAX_BUFFERED_TUPLES 1000
> +#define MAX_BUFFERED_SLOTS 1000

What's the point of these renames? We're still dealing in tuples. Just
seems to make the patch larger.

OK. I will correct it.
 

>                               if (useHeapMultiInsert)
>                               {
> +                                     int tup_size;
> +
>                                       /* Add this tuple to the tuple buffer */
> -                                     if (nBufferedTuples == 0)
> +                                     if (nBufferedSlots == 0)
> +                                     {
>                                               firstBufferedLineNo = cstate->cur_lineno;
> -                                     Assert(bufferedSlots[nBufferedTuples] == myslot);
> -                                     nBufferedTuples++;
> +
> +                                             /*
> +                                              * Find out the Tuple size of the first tuple in a batch and
> +                                              * use it for the rest tuples in a batch. There may be scenarios
> +                                              * where the first tuple is very small and rest can be large, but
> +                                              * that's rare and this should work for majority of the scenarios.
> +                                              */
> +                                             tup_size = heap_compute_data_size(myslot->tts_tupleDescriptor,
> +                                                                                                               myslot->tts_values,
> +                                                                                                               myslot->tts_isnull);
> +                                     }

This seems too exensive to me.  I think it'd be better if we instead
used the amount of input data consumed for the tuple as a proxy. Does that
sound reasonable?

Yes, the cstate structure contains the line_buf member that holds the information of 
the line length of the row, this can be used as a tuple length to limit the size usage.
comments?

copy from batch insert memory usage limit fix and Provide grammer support for USING method
to create table as also.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Hannu Krosing
Дата:
Сообщение: Re: CREATE ROUTINE MAPPING
Следующее
От: Jinhua Luo
Дата:
Сообщение: Can I just reload the slave to change primary_conninfo?