Обсуждение: REINDEX filtering in the backend
Hi, I had some spare time tonight so I started a prototype to allow filtering the indexes that are processed using the REINDEX command, as Peter suggested in the parallel reindexdb thread [1]. I didn't want to spend too much time enjoying bison and adding new unreserved keywords, so for now I just implemented this syntax to start a discussion for this feature in the next commitfest: REINDEX ( FILTER = COLLATION ) [...] The FILTER clause can be used multiple times, each one is OR-ed with the ReindexStmt's option, so we could easily add a LIBC, ICU and other filters, also making COLLATION (or more realistically a better new keyword) an alias for (LIBC | ICU) for instance. The filtering is done at table level (with and without the concurrently option), so SCHEMA, DATABASE and SYSTEM automatically benefit from it. If this clause is used with a REINDEX INDEX, the statement errors out, as I don't see a valid use case for providing a single index name and asking to possibly filter it at the same time. Under the hood, the filtering is for now done in a single function by appending elements, not removing them. An empty oid list is created, all indexes belonging to the underlying relation are processed by the specific filter(s), and any index that fails to be discarded by at least one filter, even partially, is added to the final list. I also added some minimal documentation and regression tests. I'll add this patch to the next commitfest. [1] https://www.postgresql.org/message-id/7140716c-679e-a0b9-a273-b201329d8891%402ndquadrant.com
Вложения
On Thu, Jul 11, 2019 at 11:14:20PM +0200, Julien Rouhaud wrote: > I didn't want to spend too much time enjoying bison and adding new > unreserved keywords, so for now I just implemented this syntax to > start a discussion for this feature in the next commitfest: > > REINDEX ( FILTER = COLLATION ) [...] > > The FILTER clause can be used multiple times, each one is OR-ed with > the ReindexStmt's option, so we could easily add a LIBC, ICU and other > filters, also making COLLATION (or more realistically a better new > keyword) an alias for (LIBC | ICU) for instance. I would prefer keeping the interface simple with only COLLATION, so as only collation sensitive indexes should be updated, including icu and libc ones. Or would it be useful to have the filtering for both as libicu can break things similarly to glibc in an update still a breakage on one or the other would not happen at the same time? I don't know enough of libicu regarding that, eager to learn. In which case, wouldn't it be better to support that from the start? > The filtering is done at table level (with and without the > concurrently option), so SCHEMA, DATABASE and SYSTEM automatically > benefit from it. If this clause is used with a REINDEX INDEX, the > statement errors out, as I don't see a valid use case for providing a > single index name and asking to possibly filter it at the same time. Supporting that case would not be that much amount of work, no? > I also added some minimal documentation and regression tests. I'll > add this patch to the next commitfest. > > [1] https://www.postgresql.org/message-id/7140716c-679e-a0b9-a273-b201329d8891%402ndquadrant.com + if ((stmt->options & REINDEXOPT_ALL_FILTERS) != 0) + elog(ERROR, "FILTER clause is not compatible with REINDEX INDEX"); [...] + discard indexes whose ordering does not depend on a collation. Note that + the FILTER option is not compatible with <command>REINDEX + SCHEMA</command>. Why do you have both limitations? I think that it would be nice to be able to do both, generating an error for REINDEX INDEX if the index specified is not compatible, and a LOG if the index is not filtered out when a list is processed. Please note that elog() cannot be used for user-facing failures, only for internal ones. +REINDEX (VERBOSE, FILTER = COLLATION) TABLE reindex_verbose; +-- One column, not depending on a collation In order to make sure that a reindex has been done for a given entry with the filtering, an idea is to save the relfilenode before the REINDEX and check it afterwards. That would be nice to make sure that only the wanted indexes are processed, but it is not possible to be sure of that based on your tests, and some tests should be done on relations which have collation-sensitive as well as collation-insensitive indexes. + index = index_open(indexOid, AccessShareLock); + numAtts = index->rd_index->indnatts; + index_close(index, AccessShareLock); Wouldn't it be better to close that after doing the scan? Nit: I am pretty sure that this should be indented. Could you add tests with REINDEX CONCURRENTLY? -- Michael
Вложения
On Wed, Aug 28, 2019 at 02:02:08PM +0900, Michael Paquier wrote: > + index = index_open(indexOid, AccessShareLock); > + numAtts = index->rd_index->indnatts; > + index_close(index, AccessShareLock); > Wouldn't it be better to close that after doing the scan? > > Nit: I am pretty sure that this should be indented. > > Could you add tests with REINDEX CONCURRENTLY? Bonus: support for reindexdb should be added. Let's not forget about it. -- Michael
Вложения
On Wed, Aug 28, 2019 at 7:03 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Jul 11, 2019 at 11:14:20PM +0200, Julien Rouhaud wrote: > > I didn't want to spend too much time enjoying bison and adding new > > unreserved keywords, so for now I just implemented this syntax to > > start a discussion for this feature in the next commitfest: > > > > REINDEX ( FILTER = COLLATION ) [...] > > > > The FILTER clause can be used multiple times, each one is OR-ed with > > the ReindexStmt's option, so we could easily add a LIBC, ICU and other > > filters, also making COLLATION (or more realistically a better new > > keyword) an alias for (LIBC | ICU) for instance. > > I would prefer keeping the interface simple with only COLLATION, so as > only collation sensitive indexes should be updated, including icu and > libc ones. Or would it be useful to have the filtering for both as > libicu can break things similarly to glibc in an update still a > breakage on one or the other would not happen at the same time? I > don't know enough of libicu regarding that, eager to learn. In which > case, wouldn't it be better to support that from the start? I'm not sure either. Another thing would be to add extra syntax to be able to discard even more indexes. For instance we could store the version of the underlying lib used when the index is (re)created, and do something like REINDEX (FILTER = LIBC!=2.28) or REINDEX (FILTER = LIBC==2.27) or something similar. > > The filtering is done at table level (with and without the > > concurrently option), so SCHEMA, DATABASE and SYSTEM automatically > > benefit from it. If this clause is used with a REINDEX INDEX, the > > statement errors out, as I don't see a valid use case for providing a > > single index name and asking to possibly filter it at the same time. > > Supporting that case would not be that much amount of work, no? Probably not, but I'm dubious about the use case. Adding the lib version in the catalog would be more useful for people who want to write their own rules to reindex specific set of indexes. > > I also added some minimal documentation and regression tests. I'll > > add this patch to the next commitfest. > > > > [1] https://www.postgresql.org/message-id/7140716c-679e-a0b9-a273-b201329d8891%402ndquadrant.com > > + if ((stmt->options & REINDEXOPT_ALL_FILTERS) != 0) > + elog(ERROR, "FILTER clause is not compatible with REINDEX INDEX"); > [...] > + discard indexes whose ordering does not depend on a collation. Note that > + the FILTER option is not compatible with <command>REINDEX > + SCHEMA</command>. > > Why do you have both limitations? That's actually a typo, the documentation should have specified that it's not compatible with REINDEX INDEX, not REINDEX SCHEMA, I'll fix. > I think that it would be nice to be > able to do both, generating an error for REINDEX INDEX if the index > specified is not compatible, and a LOG if the index is not filtered > out when a list is processed. Do you mean having an error if the index does not contain any collation based type? Also, REINDEX only accept a single name, so there shouldn't be any list processing for REINDEX INDEX? I'm not really in favour of adding extra code the filtering when user asks for a specific index name to be reindexed. > Please note that elog() cannot be used > for user-facing failures, only for internal ones. Indeed, I'll change with an ereport and ERRCODE_FEATURE_NOT_SUPPORTED. > > +REINDEX (VERBOSE, FILTER = COLLATION) TABLE reindex_verbose; > +-- One column, not depending on a collation > In order to make sure that a reindex has been done for a given entry > with the filtering, an idea is to save the relfilenode before the > REINDEX and check it afterwards. That would be nice to make sure that > only the wanted indexes are processed, but it is not possible to be > sure of that based on your tests, and some tests should be done on > relations which have collation-sensitive as well as > collation-insensitive indexes. That's what I did when I first submitted the feature in reindexdb. I didn't use them because it means switching to TAP tests. I can drop the simple regression test (especially since I now realize than one is quite broken) and use the TAP one if, or should I keep both? > > + index = index_open(indexOid, AccessShareLock); > + numAtts = index->rd_index->indnatts; > + index_close(index, AccessShareLock); > Wouldn't it be better to close that after doing the scan? Yes indeed. > Could you add tests with REINDEX CONCURRENTLY? Sure! > Bonus: support for reindexdb should be added. Let's not forget about > it. Yep. That was a first prototype to see if this approach is ok. I'll add more tests, run pgindent and reindexdb support if this approach is sensible.
On Wed, Aug 28, 2019 at 10:22:07AM +0200, Julien Rouhaud wrote: >>> The filtering is done at table level (with and without the >>> concurrently option), so SCHEMA, DATABASE and SYSTEM automatically >>> benefit from it. If this clause is used with a REINDEX INDEX, the >>> statement errors out, as I don't see a valid use case for providing a >>> single index name and asking to possibly filter it at the same time. >> >> Supporting that case would not be that much amount of work, no? > > Probably not, but I'm dubious about the use case. Adding the lib > version in the catalog would be more useful for people who want to > write their own rules to reindex specific set of indexes. Hearing from others here would be helpful. My take is that having a simple option doing the filtering, without the need to store anything in the catalogs, would be helpful enough for users mixing both index types on a single table. Others may not agree. >> I think that it would be nice to be >> able to do both, generating an error for REINDEX INDEX if the index >> specified is not compatible, and a LOG if the index is not filtered >> out when a list is processed. > > Do you mean having an error if the index does not contain any > collation based type? Also, REINDEX only accept a single name, so > there shouldn't be any list processing for REINDEX INDEX? I'm not > really in favour of adding extra code the filtering when user asks for > a specific index name to be reindexed. I was referring to adding an error if trying to reindex an index with the filtering enabled. That's useful to inform the user that what he intends to do is not compatible with the options provided. > That's what I did when I first submitted the feature in reindexdb. I > didn't use them because it means switching to TAP tests. I can drop > the simple regression test (especially since I now realize than one is > quite broken) and use the TAP one if, or should I keep both? There is no need for TAP I think. You could for example store the relid and its relfilenode in a temporary table before running the reindex, run the REINDEX and then compare with what pg_class stores. And that's much cheaper than setting a new instance for a TAP test. -- Michael
Вложения
On Thu, Aug 29, 2019 at 2:09 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Aug 28, 2019 at 10:22:07AM +0200, Julien Rouhaud wrote: > >>> The filtering is done at table level (with and without the > >>> concurrently option), so SCHEMA, DATABASE and SYSTEM automatically > >>> benefit from it. If this clause is used with a REINDEX INDEX, the > >>> statement errors out, as I don't see a valid use case for providing a > >>> single index name and asking to possibly filter it at the same time. > >> > >> Supporting that case would not be that much amount of work, no? > > > > Probably not, but I'm dubious about the use case. Adding the lib > > version in the catalog would be more useful for people who want to > > write their own rules to reindex specific set of indexes. > > Hearing from others here would be helpful. My take is that having a > simple option doing the filtering, without the need to store anything > in the catalogs, would be helpful enough for users mixing both index > types on a single table. Others may not agree. That was already suggested by Thomas and seconded by Peter E., see https://www.postgresql.org/message-id/2b1504ac-3d6c-11ec-e1ce-3daf132b3d37%402ndquadrant.com. I personally think that it's a sensible approach, and I'll be happy to propose a patch for that too if no one worked on this yet. > >> I think that it would be nice to be > >> able to do both, generating an error for REINDEX INDEX if the index > >> specified is not compatible, and a LOG if the index is not filtered > >> out when a list is processed. > > > > Do you mean having an error if the index does not contain any > > collation based type? Also, REINDEX only accept a single name, so > > there shouldn't be any list processing for REINDEX INDEX? I'm not > > really in favour of adding extra code the filtering when user asks for > > a specific index name to be reindexed. > > I was referring to adding an error if trying to reindex an index with > the filtering enabled. That's useful to inform the user that what he > intends to do is not compatible with the options provided. Ok, I can add it if needed. > > That's what I did when I first submitted the feature in reindexdb. I > > didn't use them because it means switching to TAP tests. I can drop > > the simple regression test (especially since I now realize than one is > > quite broken) and use the TAP one if, or should I keep both? > > There is no need for TAP I think. You could for example store the > relid and its relfilenode in a temporary table before running the > reindex, run the REINDEX and then compare with what pg_class stores. > And that's much cheaper than setting a new instance for a TAP test. Oh indeed, good point! I'll work on better tests using this approach then.
On Thu, Aug 29, 2019 at 10:52:55AM +0200, Julien Rouhaud wrote: > That was already suggested by Thomas and seconded by Peter E., see > https://www.postgresql.org/message-id/2b1504ac-3d6c-11ec-e1ce-3daf132b3d37%402ndquadrant.com. > > I personally think that it's a sensible approach, and I'll be happy to > propose a patch for that too if no one worked on this yet. That may be interesting to sort out first then because we'd likely want to know what is first in the catalogs before designing the filtering processing looking at it, no? -- Michael
Вложения
On 2019-08-30 02:10, Michael Paquier wrote: > On Thu, Aug 29, 2019 at 10:52:55AM +0200, Julien Rouhaud wrote: >> That was already suggested by Thomas and seconded by Peter E., see >> https://www.postgresql.org/message-id/2b1504ac-3d6c-11ec-e1ce-3daf132b3d37%402ndquadrant.com. >> >> I personally think that it's a sensible approach, and I'll be happy to >> propose a patch for that too if no one worked on this yet. > > That may be interesting to sort out first then because we'd likely > want to know what is first in the catalogs before designing the > filtering processing looking at it, no? Right. We should aim to get per-object collation version tracking done. And then we might want to have a REINDEX variant that processes exactly those indexes with an out-of-date version number -- and then updates that version number once the reindexing is done. I think that project is achievable for PG13. I think we can keep this present patch in our back pocket for, say, the last commit fest if we don't make sufficient progress on those other things. Right now, it's hard to review because the target is moving, and it's unclear what guidance to give users. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Sep 04, 2019 at 01:54:53PM +0200, Peter Eisentraut wrote: > Right. We should aim to get per-object collation version tracking done. > And then we might want to have a REINDEX variant that processes exactly > those indexes with an out-of-date version number -- and then updates > that version number once the reindexing is done. I think that project > is achievable for PG13. > > I think we can keep this present patch in our back pocket for, say, the > last commit fest if we don't make sufficient progress on those other > things. Right now, it's hard to review because the target is moving, > and it's unclear what guidance to give users. Okay, agreed. I have marked the patch as returned with feedback then. -- Michael