Обсуждение: empty text fields
On a tangent to the never-ending NULL debate, I've got a table: CREATE TABLE events ( event_id INTEGER PRIMARY KEY, tag_fk INTEGER REFERENCES tags (tag_id), place_fk INTEGER REFERENCES places (place_id), event_date CHAR(18) NOT NULL DEFAULT '000000003000000001', sort_date DATE NOT NULL DEFAULT '40041024BC', event_note TEXT NOT NULL DEFAULT '' ); The event_note will contain text in roughly 1 out of 5 rows: pgslekt=> select count(*) from events; count ------- 29473 (1 row) pgslekt=> select count(*) from events where event_note <> ''; count ------- 5572 (1 row) I wonder if this is sane design, in theory and in practice, or should I break out the event_note field in a separate table? -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE
On Wed, Jun 28, 2006 at 05:14:42PM +0200, Leif B. Kristensen wrote: > On a tangent to the never-ending NULL debate, I've got a table: > > CREATE TABLE events ( > event_id INTEGER PRIMARY KEY, > tag_fk INTEGER REFERENCES tags (tag_id), > place_fk INTEGER REFERENCES places (place_id), > event_date CHAR(18) NOT NULL DEFAULT '000000003000000001', ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is Badâ¢. What's wrong with TIMESTAMPTZ? > sort_date DATE NOT NULL DEFAULT '40041024BC', > event_note TEXT NOT NULL DEFAULT '' This is generally bad, too. It's got MySQL goo all over it. Do you want NOT NULL, or do you want a DEFAULT that's meaningful? > ); > > The event_note will contain text in roughly 1 out of 5 rows: > > pgslekt=> select count(*) from events; > count > ------- > 29473 > (1 row) > > pgslekt=> select count(*) from events where event_note <> ''; > count > ------- > 5572 > (1 row) > > I wonder if this is sane design, in theory and in practice, or should I > break out the event_note field in a separate table? Only if it's a 1:N relationship. In this case, I'd say scrap the NOT NULL requirement and replace the empty strings with NULLs. Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Wednesday 28. June 2006 17:37, David Fetter wrote: >On Wed, Jun 28, 2006 at 05:14:42PM +0200, Leif B. Kristensen wrote: >> On a tangent to the never-ending NULL debate, I've got a table: >> >> CREATE TABLE events ( >> event_id INTEGER PRIMARY KEY, >> tag_fk INTEGER REFERENCES tags (tag_id), >> place_fk INTEGER REFERENCES places (place_id), >> event_date CHAR(18) NOT NULL DEFAULT >> '000000003000000001', > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >This is Badâ¢. What's wrong with TIMESTAMPTZ? What's Badâ¢? The event_date field is a "fuzzy date" construct. It will allow the storage of such dates as "1784", "ca. 1810", "May 1852", "1798 or 1799", "between 1820 and 1830" and so on. It's very useful in historical research to handle such dates meaningfully. By the way, I was also going to ask sometime if there's a better way to handle such a construct than an unspecified CHAR(18) column. >> sort_date DATE NOT NULL DEFAULT '40041024BC', >> event_note TEXT NOT NULL DEFAULT '' > >This is generally bad, too. It's got MySQL goo all over it. Do you >want NOT NULL, or do you want a DEFAULT that's meaningful? Sure. I came to PostgreSQL about a year ago. I've been using MySQL for read-only data since 2002, but when I started writing my own data-entry routines, I found the "quiet truncation" misfeature of MySQL to render it all but useless. >> I wonder if this is sane design, in theory and in practice, or >> should I break out the event_note field in a separate table? > >Only if it's a 1:N relationship. In this case, I'd say scrap the NOT >NULL requirement and replace the empty strings with NULLs. Thanks. I'll think it over. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE
On Wed, Jun 28, 2006 at 06:25:22PM +0200, Leif B. Kristensen wrote: > >> event_date CHAR(18) NOT NULL DEFAULT > The event_date field is a "fuzzy date" construct. It will allow the > storage of such dates as "1784", "ca. 1810", "May 1852", "1798 or > 1799", "between 1820 and 1830" and so on. It's very useful in > historical research to handle such dates meaningfully. How and where do you handle the fuzziness of it ? In the application ? We have to deal with the same thing in medical history data and haven't yet satisfactorily solved it. > By the way, I was also going to ask sometime if there's a better way to > handle such a construct than an unspecified CHAR(18) column. A composite type comes to mind. Tagged types (google for "tagged_type") would probably help, too. A "full-blown" implementation of a fuzzy timestamp type which a) preserves the input/update timestamp b) allows setting the accuracy of the value per row c) allows for known "modifiers" and "terms" ("mid-summer", "second half of ...") d) allows for an arbitrary textual addition ("ca.", "probably") would be great. I know I should be helping to write one instead of hoping someone does it for me. But I lack the expertise to do it properly. I am willing to help, at any rate, though. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
self-replying with additional information On Wed, Jun 28, 2006 at 09:33:18PM +0200, Karsten Hilbert wrote: > A "full-blown" implementation of a fuzzy timestamp type which > > a) preserves the input/update timestamp which tagged_types is able to handle > b) allows setting the accuracy of the value per row which I have a Python (application) class for wrapping mx.DateTime.DateTime > c) allows for known "modifiers" and "terms" ("mid-summer", "second half of ...") > d) allows for an arbitrary textual addition ("ca.", "probably") which tagged_types should be able to handle as well Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Wednesday 28. June 2006 21:33, Karsten Hilbert wrote: >On Wed, Jun 28, 2006 at 06:25:22PM +0200, Leif B. Kristensen wrote: >> >> event_date CHAR(18) NOT NULL DEFAULT >> >> The event_date field is a "fuzzy date" construct. It will allow the >> storage of such dates as "1784", "ca. 1810", "May 1852", "1798 or >> 1799", "between 1820 and 1830" and so on. It's very useful in >> historical research to handle such dates meaningfully. > >How and where do you handle the fuzziness of it ? In the >application ? We have to deal with the same thing in medical >history data and haven't yet satisfactorily solved it. For now, I'm handling it in the application. But as I'm very much into moving as much as possible into views, functions and triggers of the db, I'm investigating how to do it in a more efficient way. The construct has been lifted pretty much unmodified from "The Master Genealogist" (TMG), the genealogy program I was using until I decided to write my own. The format of the string is YYYYMMDDAYYYYMMDDB. The YYYMMDD is a "normal" date, where the day may be set to 00 if the month is known, and the month likewise set to 00 if only the year is known. The A is a qualifier, which can take the following values: 0 = before (date1), 1 = say (date1), 2 = ca. (date1), 3 = exact (date1), 4 = after (date1), 5 = between (date1) and (date2), 6 = (date1) or (date2), 7 = from (date1) to (date2). The B can be either 0 for BC or 1 for AD. >> By the way, I was also going to ask sometime if there's a better way >> to handle such a construct than an unspecified CHAR(18) column. > >A composite type comes to mind. Tagged types (google for >"tagged_type") would probably help, too. > >A "full-blown" implementation of a fuzzy timestamp type which > >a) preserves the input/update timestamp >b) allows setting the accuracy of the value per row >c) allows for known "modifiers" and "terms" ("mid-summer", "second > half of ...") d) allows for an arbitrary textual addition ("ca.", > "probably") > >would be great. I know I should be helping to write one >instead of hoping someone does it for me. But I lack the >expertise to do it properly. I am willing to help, at any >rate, though. I haven't felt any need to enter more irregular dates than those described above. I think that if a date is that much undetermined, it would be better to render it textually in a free-text note. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE
Leif B. Kristensen wrote: > 0 = before (date1), > 1 = say (date1), > 2 = ca. (date1), > 3 = exact (date1), > 4 = after (date1), > 5 = between (date1) and (date2), > 6 = (date1) or (date2), > 7 = from (date1) to (date2). This kind of reeks like a begin/end date and an accuracy quantifier, though that wouldn't account for option 6. Your cases 0 to 5 and 7 would be transformed into something like: date A date B accuracy 0: NULL date1 5 1: date1 date1 3 2: date1 date1 2 (is 'ca.' less accurate than 'say'?) 3: date1 date1 5 4: date1 NULL 5 5: date1 date2 2 7: date1 date2 5 Where I defined '5' as being accurate, and lower values less accurate. You may want to use values with a wider spread, it'll allow more fuzziness about how sure you are about a certain date. Just my 2 centims. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
On Thu, Jun 29, 2006 at 12:02:40PM +0200, Alban Hertroys wrote: > This kind of reeks like a begin/end date and an accuracy quantifier, > though that wouldn't account for option 6. > > Your cases 0 to 5 and 7 would be transformed into something like: ... > Where I defined '5' as being accurate, and lower values less accurate. > You may want to use values with a wider spread, it'll allow more > fuzziness about how sure you are about a certain date. Just for your information: In our Python implementation of a fuzzy timestamp type we used accuracy values ranging from 1 to 7 denoting the "precision" of a complete timestamp definition: 7 - full subsecond accuracy (7 "digits" precision) 6 - seconds 5 - minutes 4 - hours 3 - days 2 - months 1 - years Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert wrote: > Just for your information: > > In our Python implementation of a fuzzy timestamp type we > used accuracy values ranging from 1 to 7 denoting the > "precision" of a complete timestamp definition: > > 7 - full subsecond accuracy (7 "digits" precision) > 6 - seconds > 5 - minutes > 4 - hours > 3 - days > 2 - months > 1 - years This is getting off topic, but let me add that IMO you chose the wrong direction. Your timestamps won't get any more accurate than seconds (your highest value), but it could get less accurate. What are you going to use for ranges larger than a year? 0? -1? -2? You locked yourself out... -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //