Обсуждение: pageinspect: add tuple_data_record()
Background:
In my usage of pageinspect for debugging or other purposes I often find it frustrating that I don't have a way to easily view a heap page's tuple data as actual records rather than binary data. After encountering this again last week while doing some data recovery after an accidental delete (and manually parsing a few int and timestamp fields with get_byte magic) I decided to write up a patch to add a function to handle this case.Summary:
The new function tuple_data_record() parallels the existing tuple_data_split() function, but instead of returning a bytea array of raw attribute heap values, it returns a row type of the relation being examined.
Status:
The attached patch applies to master and builds cleanly and passes tests locally. It is still WIP (see TODOs below), but is (I believe) functionally complete.
This is my first patch submission, so I encountered a few questions that the coding style docs didn't seem to address, like whether it's preferable to line up variable declarations by name.
After my refactor of tuple_data_split_internal() I'm considering inlining that function (and the similar tuple_data_record_internal() function) into tuple_data_{split,record}(). I'd appreciate feedback on this possibility.
TODO:
- Add documentation.
- Consider inlining _internal functions.
Thanks,
James Coleman
Вложения
Hi, On 2018-10-16 21:39:02 -0400, James Coleman wrote: > Background: > In my usage of pageinspect for debugging or other purposes I often find it > frustrating that I don't have a way to easily view a heap page's tuple data > as actual records rather than binary data. After encountering this again > last week while doing some data recovery after an accidental delete (and > manually parsing a few int and timestamp fields with get_byte magic) I > decided to write up a patch to add a function to handle this case. > Summary: > The new function tuple_data_record() parallels the existing > tuple_data_split() function, but instead of returning a bytea array of raw > attribute heap values, it returns a row type of the relation being examined. I don't think it's entirely safe to do so for invisible rows. The toast references could be reused, constraints be violated, etc. So while I could have used this several times before, it's also very likely a good way to cause trouble. It'd probably be ok to just fetch the binary value of the columns, but detoasting sure ain't ok. > This is my first patch submission, so I encountered a few questions that > the coding style docs didn't seem to address, like whether it's preferable > to line up variable declarations by name. There's ./src/tools/pgindent/pgindent (which needs an external dep), to do that kind of thing... Greetings, Andres Freund
I don't think it's entirely safe to do so for invisible rows. The toast
references could be reused, constraints be violated, etc. So while I
could have used this several times before, it's also very likely a good
way to cause trouble. It'd probably be ok to just fetch the binary
value of the columns, but detoasting sure ain't ok.
Wouldn’t that be equally true for the already existing toast option to tuple_data_split()?
Is “unsafe” here something that would lead to a crash? Or just returning invalid data? Given that pageinspect is already clearly an internal viewing of data, I think allowing invalid or limited data is a reasonable result. 
Alternatively, would returning only inline values be safe (and say, returning null for any toasted data)?
> This is my first patch submission, so I encountered a few questions that
> the coding style docs didn't seem to address, like whether it's preferable
> to line up variable declarations by name.
There's ./src/tools/pgindent/pgindent (which needs an external dep), to
do that kind of thing...
Yes, though seems like maybe the style guide could be a bit more descriptive in some of these areas to be more friendly to newcomers. In contrast the wiki page for submitting a patch is extremely detailed.  
Thanks,
James Coleman
			
		On Tue, Oct 16, 2018 at 6:39 PM James Coleman <jtc331@gmail.com> wrote: > Summary: > The new function tuple_data_record() parallels the existing tuple_data_split() function, but instead of returning a byteaarray of raw attribute heap values, it returns a row type of the relation being examined. I've been doing something similar with a modified btreefuncs.c/bt_page_items() recently. The approach I've taken is pretty grotty. It would be nice to have a reasonable way of doing this, for both heap tuples and index tuples. (Relatedly, bt_page_items() should probably have a bytea "data" field instead of a text "data" field.) -- Peter Geoghegan
Hi, On 2018-10-16 22:05:28 -0400, James Coleman wrote: > > I don't think it's entirely safe to do so for invisible rows. The toast > > references could be reused, constraints be violated, etc. So while I > > could have used this several times before, it's also very likely a good > > way to cause trouble. It'd probably be ok to just fetch the binary > > value of the columns, but detoasting sure ain't ok. > > > > Wouldn’t that be equally true for the already existing toast option to > tuple_data_split()? Yes. Whoever added that didn't think far enough. Teodor, Nikolay? > Is “unsafe” here something that would lead to a crash? Or just returning > invalid data? Given that pageinspect is already clearly an internal viewing > of data, I think allowing invalid or limited data is a reasonable result. Both. > Alternatively, would returning only inline values be safe (and say, > returning null for any toasted data)? I think that'd be somewhat confusing. I don't think it's necessarily just toast that's a problem here, that was just an obvious point. > Yes, though seems like maybe the style guide could be a bit more > descriptive in some of these areas to be more friendly to newcomers. In > contrast the wiki page for submitting a patch is extremely detailed. I think there's a lot in that area we should improve. Greetings, Andres Freund
В письме от 16 октября 2018 19:17:20 пользователь Andres Freund написал: > Hi, > > On 2018-10-16 22:05:28 -0400, James Coleman wrote: > > > I don't think it's entirely safe to do so for invisible rows. The toast > > > references could be reused, constraints be violated, etc. So while I > > > could have used this several times before, it's also very likely a good > > > way to cause trouble. It'd probably be ok to just fetch the binary > > > value of the columns, but detoasting sure ain't ok. > > Wouldn’t that be equally true for the already existing toast option to > > tuple_data_split()? > Yes. Whoever added that didn't think far enough. Teodor, Nikolay? I did compleatly got the question... The question is it safe to split tuple record into array of raw bytea? It is quite safe from my point of view. We use only data that is inside the tuple, and info from pg_catalog that describes the tuple structure. So we are not affected if for example toast table were cleaned by vacuum. If you try to deTOAST data when TOAST table were already overwritten by new data, you can get some trouble... > > Is “unsafe” here something that would lead to a crash? Or just returning > > invalid data? I think that you should try and see. Find a way to make postgres vacuum toast table, and overwrite it with some new data, but keep tuple in heap page untouched. Then try to deTOAST old data and see what happened. (Hint: it is good to turn off compression for your data to be TOASTed, so you can just look into toast-file in binary mode and get an idea what in it now) ====== >> Summary: >> The new function tuple_data_record() parallels the existing >> tuple_data_split() function, but instead of returning a bytea array of raw >> attribute heap values, it returns a row type of the relation being >> examined. I find this approach a bit wrong. pageinspect is about viewing data in raw mode, not about looking at the tuples that were deleted. It can be used for this task, but it has other main purpose. I would suggest instead of automatically detoast and uncompress actual data, I would add functions that: 1. Checks if TOAST id is valid, and data can be fetched 2. Fetches that data and print it in raw mode 3. Says if binary data is compressed varlen data 4. Uncompress binary data and return it as another binary data 5. Convert binary data into string. or something like that. This would be more according to the pageinspect spirit, as I see it. Than using these functions you can write pure sql wrapper that do what you really want. -- Do code for fun.
Вложения
I did compleatly got the question... The question is it safe to split tuple
record into array of raw bytea? It is quite safe from my point of view. We
use only data that is inside the tuple, and info from pg_catalog that
describes the tuple structure. So we are not affected if for example toast
table were cleaned by vacuum. If you try to deTOAST data when TOAST table were
already overwritten by new data, you can get some trouble...
The existing tuple_data_split() method already explicitly allows deTOASTing data,
so if this is a problem, the problem already exists in pageinspect.
I find this approach a bit wrong. pageinspect is about viewing data in raw
mode, not about looking at the tuples that were deleted. It can be used for
this task, but it has other main purpose.
...
Than using these functions you can write pure sql wrapper that do what you
really want.
It's actually extremely difficult to extract real data types from the raw disk data in pure
SQL, which was the reason for this patch.
When using pageinspect to investigate the raw data behind a potential bug or other issue
(for example I've used it in the past pre-9.6 to check the xmin/xmax information to help
track down replication bugs (determining if the tuple information was there but just not
visible after a vacuum or if the tuple data itself was missing). Even in this kind of "under
the hood" inspection it's quite useful to be able to see actual data types rather than binary
data.
In these kinds of use cases I believe that just ignoring TOASTed data is even acceptable
because once you're already down the rabbit hole of looking at non-visible information
you've already accepted that the data may no longer be complete or valid.
В письме от 17 октября 2018 12:36:54 пользователь James Coleman написал: > > I did compleatly got the question... The question is it safe to split > > tuple > > record into array of raw bytea? It is quite safe from my point of view. > > We > > use only data that is inside the tuple, and info from pg_catalog that > > describes the tuple structure. So we are not affected if for example toast > > table were cleaned by vacuum. If you try to deTOAST data when TOAST table > > were > > already overwritten by new data, you can get some trouble... > The existing tuple_data_split() method already explicitly allows deTOASTing > data, > so if this is a problem, the problem already exists in pageinspect. Oh. that's true... I do not remember writing it, but it seems I did it. After Andreas notion, I am sure this should be fixed by removing de_toast option for tuple_data_split... O_o -- Do code for fun.
Вложения
On 2018-10-17 12:36:54 -0400, James Coleman wrote: > > > > > > I did compleatly got the question... The question is it safe to split > > tuple > > record into array of raw bytea? It is quite safe from my point of view. > > We > > use only data that is inside the tuple, and info from pg_catalog that > > describes the tuple structure. So we are not affected if for example toast > > table were cleaned by vacuum. If you try to deTOAST data when TOAST table > > were > > already overwritten by new data, you can get some trouble... > > > > > The existing tuple_data_split() method already explicitly allows deTOASTing > data, > so if this is a problem, the problem already exists in pageinspect. Indeed. But I do think your approach - which means that the binary data is actually interpreded as a datum of a specific type, drastically increases the risk. Greetings, Andres Freund
Indeed. But I do think your approach - which means that the binary data is
actually interpreded as a datum of a specific type, drastically
increases the risk.
Agreed.
As I noted earlier, I don't at all think deTOASTing is a must for this function to be
valuable, just as tuple_data_split() is also valuable without deTOASTINGing.
I believe "best effort" is very reasonable in the case of a what is an investigatory
method to begin with. 
Hi, On 2018-10-17 13:04:33 -0400, James Coleman wrote: > Indeed. But I do think your approach - which means that the binary data is > > actually interpreded as a datum of a specific type, drastically > > increases the risk. > > > > > Agreed. > > As I noted earlier, I don't at all think deTOASTing is a must for this > function to be > valuable, just as tuple_data_split() is also valuable without deTOASTINGing. > > I believe "best effort" is very reasonable in the case of a what is an > investigatory > method to begin with. It's far from only toast that could be problematic here. Greetings, Andres Freund
It's far from only toast that could be problematic here.
Do you have an example in mind? Wouldn’t the tuples have to be corrupted in some way to have problems with being interpreted as a datum? Or are you thinking very old tuples with a radically different structure to be causing the problem?
			
		Hi, On 2018-10-17 13:14:17 -0400, James Coleman wrote: > > It's far from only toast that could be problematic here. > > > > Do you have an example in mind? Wouldn’t the tuples have to be corrupted in > some way to have problems with being interpreted as a datum? Or are you > thinking very old tuples with a radically different structure to be causing > the problem? There's plenty ways it can go horribly wrong. Let's start with something simple: BEGIN; ALTER TABLE ... ADD COLUMN blarg INT; INSERT ... (blag) VALUES (132467890); ROLLBACK; ALTER TABLE ... ADD COLUMN blarg TEXT; If you now read the table with your function you'll see a dead row that will re-interpret a int datum as a text datum. Which in all likelyhood will crash the server. And yes, it's more likely to break with your patch, but I think the current *split* function is dangerous too, and shouldn't have been added without *A LOT* more attention to issues like this. Greetings, Andres Freund
There's plenty ways it can go horribly wrong. Let's start with something
simple:
BEGIN;
ALTER TABLE ... ADD COLUMN blarg INT;
INSERT ... (blag) VALUES (132467890);
ROLLBACK;
ALTER TABLE ... ADD COLUMN blarg TEXT;
If you now read the table with your function you'll see a dead row that
will re-interpret a int datum as a text datum. Which in all likelyhood
will crash the server.
That particular case gives this result:
ERROR:  number of attributes in tuple header is greater than number of attributes in tuple descriptor
Some more extended monkeying with adding/dropping columns repeatedly
gave this result:
ERROR:  unexpected end of tuple data
That error (unexpected end of tuple data) should (at least in the non-TOAST case)
prevent the bug of reading beyond the raw tuple data in memory, which would be
the easiest way I could imagine to cause a serious problem.
Is there a case that could crash outside of a non-primitive type that has unsafe
data reading code?
On 2018-10-17 17:02:20 -0400, James Coleman wrote: > > There's plenty ways it can go horribly wrong. Let's start with something > > simple: > > > > BEGIN; > > ALTER TABLE ... ADD COLUMN blarg INT; > > INSERT ... (blag) VALUES (132467890); > > ROLLBACK; > > > > ALTER TABLE ... ADD COLUMN blarg TEXT; > > > > If you now read the table with your function you'll see a dead row that > > will re-interpret a int datum as a text datum. Which in all likelyhood > > will crash the server. > > > > That particular case gives this result: > ERROR: number of attributes in tuple header is greater than number of > attributes in tuple descriptor I don't see why you'd get that error, if you re-add a column (with a different type), as I did above? Obviously the "replacement" column addition would need to be committed. > Some more extended monkeying with adding/dropping columns repeatedly > gave this result: > ERROR: unexpected end of tuple data > > That error (unexpected end of tuple data) should (at least in the non-TOAST > case) > prevent the bug of reading beyond the raw tuple data in memory, which would > be > the easiest way I could imagine to cause a serious problem. You don't need to read beyond the end of the data. You just need to do something like reinterpret types, where the original type looks enough like a toast header (e.g.) to cause problems. > Is there a case that could crash outside of a non-primitive type that has > unsafe data reading code? Just about anything that's not a fixed length type would be unsafe. Greetings, Andres Freund
I don't see why you'd get that error, if you re-add a column (with a
different type), as I did above? Obviously the "replacement" column
addition would need to be committed.
Here's my test case:
CREATE TABLE t3(i INTEGER);
BEGIN;
  ALTER TABLE t3 ADD COLUMN blarg INT;
  INSERT INTO t3(blarg) VALUES (132467890);
ROLLBACK;
ALTER TABLE t3 ADD COLUMN blarg TEXT;
SELECT *, tuple_data_split(
  't3'::regclass, t_data, t_infomask, t_infomask2, t_bits
) FROM heap_page_items(get_raw_page('t3', 0));
SELECT tdr.*, blarg IS NULL
FROM heap_page_items(get_raw_page('t3', 0))
LEFT JOIN LATERAL (
  SELECT *
  FROM tuple_data_record(
    't3'::regclass, t_data, t_infomask, t_infomask2, t_bits
  ) AS n(i INTEGER, blarg TEXT)
) tdr ON true;
DROP TABLE t3;
and here's the output:
CREATE TABLE
BEGIN
ALTER TABLE
INSERT 0 1
ROLLBACK
ALTER TABLE
psql:test.sql:12: ERROR:  unexpected end of tuple data
psql:test.sql:21: ERROR:  unexpected end of tuple data
DROP TABLE
> That error (unexpected end of tuple data) should (at least in the non-TOAST
> case)
> prevent the bug of reading beyond the raw tuple data in memory, which would
> be
> the easiest way I could imagine to cause a serious problem.
You don't need to read beyond the end of the data. You just need to do
something like reinterpret types, where the original type looks enough
like a toast header (e.g.) to cause problems.
> Is there a case that could crash outside of a non-primitive type that has
> unsafe data reading code?
Just about anything that's not a fixed length type would be unsafe.
Let's use the example of TEXT (that isn't TOASTed). If some bytes
are incorrectly interpreted as variable length text, then you have two
possibilities that I can see:
1. The determined length is less than the fixed type's space, in which
case we get bogus interpretation but it is not unsafe.
2. The determined length is greater than the fixed type's space. In this
case we get either bogus interpretation (of data beyond the fixed type's
space) or we get an error (like above).
Assuming bounds checking disallows reading beyond the raw tuple
data, then I still do not understand how a variable length field like
text (outside of TOAST at least) could be actually unsafe. It obviously
could give bizarre data.