Обсуждение: (Updated) Table File Format
I need a little help on the format of the postgres tables. I've got this wonderfully corrupted database where just about everything isfubar. I've tried a number of things to get itback using postgres andrelated tools with no success. It looks like most of the data is there, butthere may be a smallamount of corruption that's causing all kinds ofproblems. I've broken down and begin development of a tool to allow examination of thedata within the table files. This could actually be useful for recoveringand undoing changes (or at least until the row-reusecode goes intoproduction). I've been hacking the file format and trying to find stuff in the source anddocs as much as possible, but here goes... a) tuples cannot span multiple pages (yet).b) the data is not platform independant??? Ie the data from a sun looksdifferentfrom an intel? For every page, I see that the first 2 words are for the end of the tuplepointers and the beginning of the tuple data. What are the next 2 words used for? In all my cases they appear to be set to0x2000. Following that I find the 2 word tuple pointers. The first word appears to be the offset in the page where the tuple can be found but the MSB has to be stripped off (haven't found it's function in the source yet). The second is the transactionid that, if comitted gives this tuple visibility??? Are these tuple pointers always stored in order of last to first? Or should I be loading and sorting them according to offset? Now on to the tuple data... I have my tool to the point where it extracts all the tuple data from the table, but I haven't been able to find the place in the postgres source that explains the format. I assume a tuple contains a number of attributes (referencing pg_attribute). Those not found in the tuple would be assumed to be NULL. Since I'm ignoring transaction ids right now, I'm planning on extracting all the tuple and ordering them by oid so you can see all the comitted and uncomitted changes. I may even make it look good once I've recovered my data... -Michael
"Michael Richards" <miker@interchange.ca> writes:
> Following that I find the 2 word tuple pointers.
> The first word appears to be the offset in the page where the tuple can be
> found but the MSB has to be stripped off (haven't found it's function in the
> source yet).
> The second is the transactionid that, if comitted gives this tuple
> visibility???
No, offset and length --- there is also a 2-bit flags field. Look at
the page and item declarations in src/include/storage/
Someone else was recently working on a bit-level dump tool, but I've
forgotten who.
regards, tom lane
Michael Richards wrote:
>
> I need a little help on the format of the postgres tables.
>
> I've got this wonderfully corrupted database where just about everything is
> fubar. I've tried a number of things to get it back using postgres and
> related tools with no success. It looks like most of the data is there, but
> there may be a small amount of corruption that's causing all kinds of
> problems.
Find attached a python script that I used to get deleted (actually all
;)
records from a database table.
It was not corrupted, just a simple programming error in client software
had deleted more than needed.
Fortunately it was not vacuumed so the data itself (info for web-based
paper
postcard sending system) was there
It works as-is only for my table as the field extraction code is
hard-coded, but
it should be quite easy to modify for your needs
It worked 1 year ago probably on 6.4.x . I hope that the structure had
not
changed since.
sendcard.py is the actual script used, pgtabdump.py is a somewhat
cleaned-up version
---------------
Hannu#!/usr/bin/python
import sys,os,struct,string
site_base = "/var/lib/pgsql/base"
db_name = "betest"
table_name = "test"
db_path = os.path.join(site_base,db_name)
table_path = os.path.join(db_path,table_name)
page_size = 8*1024
def strbits(s,len): bits = [] while s: c = s[0] s = s[1:] b = struct.unpack('B',c)[0] for i
inrange(8): if b & (1<<i): bits.append(1) else: bits.append(0) return
string.join(map(str,bits),'')[:len]
class table_page: "class to represent a database table page" def __init__(self,fd,page_nr):
fd.seek(page_nr*page_size) self.rawdata = fd.read(page_size) self.lower,\ self.upper,\
self.special,\ self.opaque = struct.unpack('HHHH',self.rawdata[:8]) self.item_pointers=[]
self.items=[] for i in range(8,self.lower,4): rawItemIdData = self.rawdata[i:i+4]
ItemIdData_I32= struct.unpack('L',rawItemIdData)[0] if not ItemIdData_I32: break lp_len =
int(ItemIdData_I32>> 17) lp_flags = int((ItemIdData_I32 >> 15) & 3) lp_off = int(ItemIdData_I32 &
0x7fff)
self.item_pointers.append(hex(ItemIdData_I32),lp_off,lp_flags,lp_len) rawItemData =
self.rawdata[lp_off:lp_off+lp_len] t_oid = struct.unpack('L', rawItemData[ 0: 4])[0] t_001 =
struct.unpack('L',rawItemData[ 4: 8])[0] t_002 = struct.unpack('L', rawItemData[ 8:12])[0] t_xmin
=struct.unpack('L', rawItemData[12:16])[0] t_xmax = struct.unpack('L', rawItemData[16:20])[0]
t_ctid= struct.unpack('LH', rawItemData[20:26]) t_fcnt = struct.unpack('H', rawItemData[26:28])[0]
t_xxx = struct.unpack('2B', rawItemData[28:30]) t_doff = struct.unpack('B', rawItemData[30:31])[0]
t_mask = strbits(rawItemData[31:t_doff],t_fcnt)
# t_mask = rawItemData[31:t_doff] t_data = rawItemData[t_doff:]
self.items.append(t_oid,t_001,t_002,t_xmin,t_xmax,t_ctid,t_fcnt,t_xxx,t_doff,t_mask,t_data) def __str__(self):
strval= [str((self.lower, self.upper, self.special, self.opaque))]
strval.append(string.join(map(str,self.item_pointers),'\n')) strval.append(string.join(map(repr,self.items),'\n'))
return string.join(strval,'\n------------------\n')
if __name__=="__main__": print '# dumping %s' % table_path fd = open(table_name) page = table_page(fd,0) print
page
#!/usr/bin/python
import sys,os,struct,string
table_name = "sendcard"
page_size = 8*1024
def strbits(s,len): bits = [] while s: c = s[0] s = s[1:] b = struct.unpack('B',c)[0] for i
inrange(8): if b & (1<<i): bits.append(1) else: bits.append(0) bits =
string.join(map(str,bits[:len]),'') return bits
"""
rapos=> \d sendcard
Table = sendcard
+----------------------------------+----------------------------------+-------+
| Field | Type | Length|
+----------------------------------+----------------------------------+-------+
| card_id | int4 default nextval ( 'card_id_ | 4 |
| card_name | text | var |
| recipient_name | text | var |
| recipient_street_name | text | var |
| recipient_house_nr | text | var |
| recipient_apartment_nr | text | var |
| recipient_city | text | var |
| recipient_village | text | var |
| recipient_state | text | var |
| recipient_zip_code | text | var |
| sender_name | text | var |
| sender_email | text | var |
| message | text | var |
| bank_ref_id | text | var |
| delivery_nr | text | var |
| delivery_route | text | var |
| sender_remote_addr | text | var |
| card_cost | float8 | 8 |
| cookie_id | int4 default nextval ( 'cookie_i | 4 |
| is_payed | text | var |
| printing_date | datetime | 8 |
| delivery_date | date | 4 |
| payment_date | date | 4 |
| entered_at | timestamp default now ( ) | 4 |
| send_bill_to | text | var |
+----------------------------------+----------------------------------+-------+
"""
create_statement = """\
create table savecard( u_oid int, u_xmin int, u_xmax int,
card_id int4, card_name text, recipient_name text, recipient_street_name
text, recipient_house_nr text, recipient_apartment_nr text, recipient_city text, recipient_village
text, recipient_state text, recipient_zip_code text, sender_name text, sender_email
text, message text, bank_ref_id text, delivery_nr text,
delivery_route text, sender_remote_addr text
);
copy savecard from stdin;
"""
def bin2text(buf,offset): offset = ((offset + 3)/4)*4 len = int(struct.unpack('L',buf[offset:offset+4])[0]) str =
buf[offset+4:offset+len] newoff = offset+len return str,newoff
class table_page: "class to represent a database table page" def __init__(self,fd,page_nr):
fd.seek(page_nr*page_size) self.rawdata = fd.read(page_size) self.lower,\ self.upper,\
self.special,\ self.opaque = struct.unpack('HHHH',self.rawdata[:8]) self.item_pointers=[]
self.items=[] for i in range(8,self.lower,4): rawItemIdData = self.rawdata[i:i+4]
ItemIdData_I32= struct.unpack('L',rawItemIdData)[0] if not ItemIdData_I32: break lp_len =
int(ItemIdData_I32>> 17) lp_flags = int((ItemIdData_I32 >> 15) & 3) lp_off = int(ItemIdData_I32 &
0x7fff)
self.item_pointers.append(hex(ItemIdData_I32),lp_off,lp_flags,lp_len) rawItemData =
self.rawdata[lp_off:lp_off+lp_len] t_oid = struct.unpack('L', rawItemData[ 0: 4])[0] t_001 =
struct.unpack('L',rawItemData[ 4: 8])[0] t_002 = struct.unpack('L', rawItemData[ 8:12])[0] t_xmin
=struct.unpack('L', rawItemData[12:16])[0] t_xmax = struct.unpack('L', rawItemData[16:20])[0]
t_ctid= struct.unpack('LH', rawItemData[20:26]) t_fcnt = struct.unpack('H', rawItemData[26:28])[0]
t_xxx = struct.unpack('2B', rawItemData[28:30]) t_doff = struct.unpack('B', rawItemData[30:31])[0]
t_mask = strbits(rawItemData[31:t_doff],t_fcnt) t_data = rawItemData[t_doff:] id =
int(struct.unpack('L',t_data[:4])[0]) values = [int(t_oid),int(t_xmin),int(t_xmax),id] texts = []
offset = 4 for c in t_mask[1:17]: if not int(c): values.append('')
continue text,offset = bin2text(t_data,offset) values.append(repr(text))
values=values+texts self.items.append(values)
# self.items.append(t_oid,t_001,t_002,t_xmin,t_xmax,t_ctid,t_fcnt,t_xxx,t_doff,t_mask,t_data) def
__str__(self):
# strval = [str((self.lower, self.upper, self.special, self.opaque))]
# strval.append(string.join(map(str,self.item_pointers),'\n'))
# strval.append(string.join(map(repr,self.items),'\n'))
# return string.join(strval,'\n------------------\n') rows = [] for record in self.items:
rows.append(string.join(map(str,record),'\t')) return string.join(rows,'\n')
if __name__=="__main__":
# print '# dumping %s' % table_name print create_statement fd = open(table_name) page_nr = 0 while 1:
try: page = table_page(fd,page_nr) print page except: break page_nr =
page_nr+ 1 print '\\.\n'
Okay, Where would I find a definition of the tuple data? I didn't see anything promising in include/storage? I've found a definition for the page inside pagebuf.h That clears up all the page stuff. I'm still having a little trouble decoding the tuple data within. Hannu Krosing sent me a python script to do the extract, but having never seen a line of Python before in my life, I'm having a little trouble with the actual tuple data. I can see where the actual transaction visibility info is in the tuple data, but the actual data... nope. My program (c++) is at the point where it will create tuple objects for every block of "tuple" data within the page. thanks -Michael ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Michael Richards" <miker@interchange.ca> Cc: <pgsql-hackers@postgresql.org> Sent: Thursday, December 14, 2000 12:44 AM Subject: Re: [HACKERS] (Updated) Table File Format > "Michael Richards" <miker@interchange.ca> writes: > > Following that I find the 2 word tuple pointers. > > The first word appears to be the offset in the page where the tuple can be > > found but the MSB has to be stripped off (haven't found it's function in the > > source yet). > > The second is the transactionid that, if comitted gives this tuple > > visibility??? > > No, offset and length --- there is also a 2-bit flags field. Look at > the page and item declarations in src/include/storage/ > > Someone else was recently working on a bit-level dump tool, but I've > forgotten who. > > regards, tom lane
Michael Richards wrote: > > Okay, > > Where would I find a definition of the tuple data? I didn't see anything > promising in include/storage? > > I've found a definition for the page inside pagebuf.h That clears up all the > page stuff. I'm still having a little trouble decoding the tuple data > within. Hannu Krosing sent me a python script to do the extract, but having > never seen a line of Python before in my life, I'm having a little trouble > with the actual tuple data. I can see where the actual transaction > visibility info is in the tuple data, but the actual data... nope. My > program (c++) is at the point where it will create tuple objects for every > block of "tuple" data within the page. IIRC, the data field format for individual fields is the same as defined in the back-end/front-end protocol for binary cursors. if there are any NULL fields in the record then there is a flag somewhere in the tuple header and a bitmap of N*32 bits (N=no_of_fields/32) for missing . It is possible that there is no flag and you must deduce the presence of bitmap from the tuple-header length, im not sure which way it was. The actual fields in a table and their order must be extracted from pg_class and pg_attribute tables. ------------ Hannu