Обсуждение: Large Rows

Поиск
Список
Период
Сортировка

Large Rows

От
Lee Hachadoorian
Дата:
I need some advice on storing/retrieving data in large rows. Invariably
someone points out that very long rows are probably poorly normalized,
but I have to deal with how to store a dataset which cannot be changed,
specifically the ~23,000 column US Census American Community Survey.

The Census releases these data in 117 "sequences" of < 256 columns (in
order to be read by spreadsheet applications with a 256 column limit). I
have previously stored each sequence in its own table, which is pretty
straightforward.

My problem is that some of the demographic researchers I work with want
a one-table dump of the entire dataset. This would primarily be for data
transfer. This is of limited actual use in analysis, but nonetheless,
that's what we want to be able to do.

Now, I can't join all the sequences in one SQL query for export because
of the 1600 column limit. So based on previous list activity (Tom Lane:
Perhaps you could collapse multiple similar columns into an array
column?
http://archives.postgresql.org/pgsql-admin/2008-05/msg00211.php), I
decided to try to combine all the sequences into one table using array
columns. (This would actually make querying easier since the users
wouldn't have to constantly JOIN the sequences in their queries.) Next
problem: I run into the 8k row size limit once about half the columns
are populated. As far as I can understand, even though a row
theoretically supports a 1.6TB (!) row size, this only works for
TOASTable data types (primarily text?). The vast majority of the 23k
columns I'm storing are bigint.

Questions:

1) Is there any way to solve problem 1, which is to export the 23k
columns from the database as it is, with 117 linked tables?
2) Is there any way to store the data all in one row? If numeric types
are un-TOASTable, 23k columns will necessarily break the 8k limit even
if they were all smallint, correct?

Regards,
--Lee

--
Lee Hachadoorian
PhD, Earth&  Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu


Re: Large Rows

От
David Johnston
Дата:
On Oct 25, 2011, at 22:17, Lee Hachadoorian <lee.hachadoorian@gmail.com> wrote:

> I need some advice on storing/retrieving data in large rows. Invariably someone points out that very long rows are
probablypoorly normalized, but I have to deal with how to store a dataset which cannot be changed, specifically the
~23,000column US Census American Community Survey. 
>
> The Census releases these data in 117 "sequences" of < 256 columns (in order to be read by spreadsheet applications
witha 256 column limit). I have previously stored each sequence in its own table, which is pretty straightforward. 
>
> My problem is that some of the demographic researchers I work with want a one-table dump of the entire dataset. This
wouldprimarily be for data transfer. This is of limited actual use in analysis, but nonetheless, that's what we want to
beable to do. 
>
> Now, I can't join all the sequences in one SQL query for export because of the 1600 column limit. So based on
previouslist activity (Tom Lane: Perhaps you could collapse multiple similar columns into an array column?
http://archives.postgresql.org/pgsql-admin/2008-05/msg00211.php),I decided to try to combine all the sequences into one
tableusing array columns. (This would actually make querying easier since the users wouldn't have to constantly JOIN
thesequences in their queries.) Next problem: I run into the 8k row size limit once about half the columns are
populated.As far as I can understand, even though a row theoretically supports a 1.6TB (!) row size, this only works
forTOASTable data types (primarily text?). The vast majority of the 23k columns I'm storing are bigint. 
>
> Questions:
>
> 1) Is there any way to solve problem 1, which is to export the 23k columns from the database as it is, with 117
linkedtables? 
> 2) Is there any way to store the data all in one row? If numeric types are un-TOASTable, 23k columns will necessarily
breakthe 8k limit even if they were all smallint, correct? 
>
> Regards,
> --Lee
>
> --
> Lee Hachadoorian
> PhD, Earth&  Environmental Sciences (Geography)
> Research Associate, CUNY Center for Urban Research
> http://freecity.commons.gc.cuny.edu
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

You can brute-force a 23k column CSV output file using a programming language but if you need to keep it in a database
thefact we are talking about being over the numeric column limit by a factor of twenty means you are basically SOL with
PostgreSQL.

Even if such a table were possible how it, in it's entirety, would be useful is beyond me.

There are few things that cannot be changed, and this requirement is unlikely to be one of those things.  Your problems
aremore political than technical and those are hard to provide advice for in an e-mail. 

If you need technical solutions there may be another tool out there that can get you what you want but stock PostgreSQL
isn'tgoing to cut it. 

Not having any idea what those 23k columns are doesn't help either; the census questionnaire isn't that big...

Instead of giving them what they think they want talk to them and then try to provide them what they actually need
giventhe limitations of your current toolset, or resolve to find a more suitable tool if the needs are valid but cannot
bemet with the existing tools. 

David J.

Re: Large Rows

От
Royce Ausburn
Дата:
On 26/10/2011, at 1:17 PM, Lee Hachadoorian wrote:

> I need some advice on storing/retrieving data in large rows. Invariably someone points out that very long rows are
probablypoorly normalized, but I have to deal with how to store a dataset which cannot be changed, specifically the
~23,000column US Census American Community Survey. 
>
> The Census releases these data in 117 "sequences" of < 256 columns (in order to be read by spreadsheet applications
witha 256 column limit). I have previously stored each sequence in its own table, which is pretty straightforward. 
>
> My problem is that some of the demographic researchers I work with want a one-table dump of the entire dataset. This
wouldprimarily be for data transfer. This is of limited actual use in analysis, but nonetheless, that's what we want to
beable to do. 
>
> Now, I can't join all the sequences in one SQL query for export because of the 1600 column limit. So based on
previouslist activity (Tom Lane: Perhaps you could collapse multiple similar columns into an array column?
http://archives.postgresql.org/pgsql-admin/2008-05/msg00211.php),I decided to try to combine all the sequences into one
tableusing array columns. (This would actually make querying easier since the users wouldn't have to constantly JOIN
thesequences in their queries.) Next problem: I run into the 8k row size limit once about half the columns are
populated.As far as I can understand, even though a row theoretically supports a 1.6TB (!) row size, this only works
forTOASTable data types (primarily text?). The vast majority of the 23k columns I'm storing are bigint. 
>
> Questions:
>
> 1) Is there any way to solve problem 1, which is to export the 23k columns from the database as it is, with 117
linkedtables? 
> 2) Is there any way to store the data all in one row? If numeric types are un-TOASTable, 23k columns will necessarily
breakthe 8k limit even if they were all smallint, correct? 
>


Perhaps hstore would help? http://www.postgresql.org/docs/9.0/static/hstore.html

I'm not sure if they're TOASTable, though.

--Royce


Re: Large Rows

От
Lee Hachadoorian
Дата:

On 10/26/2011 12:31 AM, David Johnston wrote:
> On Oct 25, 2011, at 22:17, Lee Hachadoorian<lee.hachadoorian@gmail.com>  wrote:
>
>> I need some advice on storing/retrieving data in large rows. Invariably someone points out that very long rows are
probablypoorly normalized, but I have to deal with how to store a dataset which cannot be changed, specifically the
~23,000column US Census American Community Survey. 
>>
>> The Census releases these data in 117 "sequences" of<  256 columns (in order to be read by spreadsheet applications
witha 256 column limit). I have previously stored each sequence in its own table, which is pretty straightforward. 
>>
>> My problem is that some of the demographic researchers I work with want a one-table dump of the entire dataset. This
wouldprimarily be for data transfer. This is of limited actual use in analysis, but nonetheless, that's what we want to
beable to do. 
>>
>> Now, I can't join all the sequences in one SQL query for export because of the 1600 column limit. So based on
previouslist activity (Tom Lane: Perhaps you could collapse multiple similar columns into an array column?
http://archives.postgresql.org/pgsql-admin/2008-05/msg00211.php),I decided to try to combine all the sequences into one
tableusing array columns. (This would actually make querying easier since the users wouldn't have to constantly JOIN
thesequences in their queries.) Next problem: I run into the 8k row size limit once about half the columns are
populated.As far as I can understand, even though a row theoretically supports a 1.6TB (!) row size, this only works
forTOASTable data types (primarily text?). The vast majority of the 23k columns I'm storing are bigint. 
>>
>> Questions:
>>
>> 1) Is there any way to solve problem 1, which is to export the 23k columns from the database as it is, with 117
linkedtables? 
>> 2) Is there any way to store the data all in one row? If numeric types are un-TOASTable, 23k columns will
necessarilybreak the 8k limit even if they were all smallint, correct? 
>>
>> Regards,
>> --Lee
>>
>> --
>> Lee Hachadoorian
>> PhD, Earth&   Environmental Sciences (Geography)
>> Research Associate, CUNY Center for Urban Research
>> http://freecity.commons.gc.cuny.edu
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
> You can brute-force a 23k column CSV output file using a programming language but if you need to keep it in a
databasethe fact we are talking about being over the numeric column limit by a factor of twenty means you are basically
SOLwith PostgreSQL. 
>
> Even if such a table were possible how it, in it's entirety, would be useful is beyond me.
It's not, as no one would ever analyze all the variables at once. Doing
this with a programming language is probably the way to go. But am I
correct that using arrays to reduce the number of columns won't work
because numeric data types aren't TOASTable?
> There are few things that cannot be changed, and this requirement is unlikely to be one of those things.  Your
problemsare more political than technical and those are hard to provide advice for in an e-mail. 
>
> If you need technical solutions there may be another tool out there that can get you what you want but stock
PostgreSQLisn't going to cut it. 
>
> Not having any idea what those 23k columns are doesn't help either; the census questionnaire isn't that big...
The vast majority of the columns represent population counts. Sometimes
it might represent a dollar amount (income or contract rent, for
example). While a sample of individual questionnaires is released (the
microdata), this question concerns the summary files, where the
individual answers are categorized/bucketed and aggregated by various
geographies. So a cell might represent number of people in a county
(row) who commuted to work by bicycle (column). The number of rows grows
when various categories are crossed with each other. Table B08519 -
"MEANS OF TRANSPORTATION TO WORK BY WORKERS' EARNINGS IN THE PAST 12
MONTHS" contains 6 transportation modes crossed by 8 income classes, for
63 columns once subtotals are added. The complete list of variables is
available at
http://www2.census.gov/acs2009_5yr/summaryfile/Sequence_Number_and_Table_Number_Lookup.xls.

> Instead of giving them what they think they want talk to them and then try to provide them what they actually need
giventhe limitations of your current toolset, or resolve to find a more suitable tool if the needs are valid but cannot
bemet with the existing tools. 
>
> David J.
Regards,
--Lee

--
Lee Hachadoorian
PhD, Earth&  Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu


Re: Large Rows

От
Simon Riggs
Дата:
On Wed, Oct 26, 2011 at 3:17 AM, Lee Hachadoorian
<lee.hachadoorian@gmail.com> wrote:

> I decided
> to try to combine all the sequences into one table using array columns.
> (This would actually make querying easier since the users wouldn't have to
> constantly JOIN the sequences in their queries.) Next problem: I run into
> the 8k row size limit once about half the columns are populated. As far as I
> can understand, even though a row theoretically supports a 1.6TB (!) row
> size, this only works for TOASTable data types (primarily text?). The vast
> majority of the 23k columns I'm storing are bigint.

Arrays are toastable, so you are getting an error from another source.

create table array_example as
select array_fill(1010110101010101, ARRAY[100000], ARRAY[1])::bigint[]
as arraycol;

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Large Rows

От
Simon Riggs
Дата:
On Wed, Oct 26, 2011 at 2:57 PM, Lee Hachadoorian
<lee.hachadoorian@gmail.com> wrote:

> Interesting. Although your example of one, 100000-dimension array works,
> five hundred 2-dimension arrays does not work. I can do the SELECT, but the
> CREATE TABLE fails:
>
> ERROR: row is too big: size 9024, maximum size 8160
> SQL state: 54000
>
> David has already hit the nail on the head in terms of this being a
> "political" problem rather than a technology problem. I'm open to ideas, but
> I realize there might be no other answer than "No one in their right mind
> should do this."

No, this is a technology problem.

Toast pointers are 20 bytes per column, so with 500 columns that is
10000 bytes - which will not fit in one block.

If you wish to fit this in then you should use a 2 dimensional array,
which will then be just 1 column and your data will fit.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Large Rows

От
Lee Hachadoorian
Дата:


On Wed, Oct 26, 2011 at 10:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Oct 26, 2011 at 2:57 PM, Lee Hachadoorian
<lee.hachadoorian@gmail.com> wrote:

> Interesting. Although your example of one, 100000-dimension array works,
> five hundred 2-dimension arrays does not work. I can do the SELECT, but the
> CREATE TABLE fails:
>
> ERROR: row is too big: size 9024, maximum size 8160
> SQL state: 54000
>
> David has already hit the nail on the head in terms of this being a
> "political" problem rather than a technology problem. I'm open to ideas, but
> I realize there might be no other answer than "No one in their right mind
> should do this."

No, this is a technology problem.

Toast pointers are 20 bytes per column, so with 500 columns that is
10000 bytes - which will not fit in one block.

If you wish to fit this in then you should use a 2 dimensional array,
which will then be just 1 column and your data will fit.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Very useful to know. Thank you.

--Lee

--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/