Обсуждение: queries with lots of UNIONed relations

Поиск
Список
Период
Сортировка

queries with lots of UNIONed relations

От
Jon Nelson
Дата:
I was recently asked to look into why one particular set of queries
was taking a long time. The queries are all of the same form.  They
select the UNION of a few
columns from around 100 tables.

The query in particular was taking some 7-8 minutes to run.

On a whim, I changed the query from this form:

SELECT a, b FROM FOO_a WHERE <conditions>
UNION
SELECT a,b FROM FOO_b WHERE <conditions>
....

to:

SELECT DISTINCT a,b FROM FOO_a WHERE <conditions>
UNION
SELECT DISTINCT a,b FROM FOO_b WHERE <conditions>
...

and the query time dropped to under a minute.

In the former case, the query plan was a bitmap heap scan for each
table. Then those results were Appended, Sorted, Uniqued, Sorted
again, and then returned.

In the latter, before Appending, each table's results were run through
HashAggregate.

The total number of result rows is in the 500K range. Each table holds
approximately 150K matching rows (but this can vary a bit).

What I'm asking is this: since adding DISTINCT to each participating
member of the UNION query reduced the total number of appended rows,
is there some sort of heuristic that postgresql could use to do this
automatically?  The 12x speedup was quite nice.

--
Jon

Re: queries with lots of UNIONed relations

От
Tom Lane
Дата:
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> In the former case, the query plan was a bitmap heap scan for each
> table. Then those results were Appended, Sorted, Uniqued, Sorted
> again, and then returned.

> In the latter, before Appending, each table's results were run through
> HashAggregate.

Probably the reason it did that is that each individual de-duplication
looked like it would fit in work_mem, but a single de-duplication
didn't.  Consider raising work_mem, at least for this one query.

            regards, tom lane

Re: queries with lots of UNIONed relations

От
Jon Nelson
Дата:
On Thu, Jan 13, 2011 at 11:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> In the former case, the query plan was a bitmap heap scan for each
>> table. Then those results were Appended, Sorted, Uniqued, Sorted
>> again, and then returned.
>
>> In the latter, before Appending, each table's results were run through
>> HashAggregate.
>
> Probably the reason it did that is that each individual de-duplication
> looked like it would fit in work_mem, but a single de-duplication
> didn't.  Consider raising work_mem, at least for this one query.

I raised work_mem to as high as 512MB (SET LOCAL work_mem = '512MB',
within the transaction).  Nice. Instead of 7-10 minutes the result is
now about a minute (the same as with individual de-duplication).

Your comment regarding "each individual de-duplication looked like it
would fit in work_mem" doesn't really make sense, exactly. Maybe I'm
misunderstanding you.

What I'm asking is this: can postgresql apply a de-duplication to each
member of a UNION (as I did with SELECT DISTINCT) in order to reduce
the total number of rows that need to be de-duplicated when all of the
rows have been Appended?

The results of the various plans/tweaks are:

Initial state: (work_mem = 16MB, no DISTINCT, run time of 7-10 minutes):
Unique (Sort (Append ( Lots of Bitmap Heap Scans Here ) ) )

and (work_mem = 16MB, with DISTINCT, run time of ~ 1 minute):
HashAggregate ( Append ( Lots Of HashAggregate( Bitmap Heap Scan ) ) )

and (work_mem = 64kB, DISTINCT, run time of *15+ minutes*):
Unique (Sort ( Append ( Lots Of HashAggregate( Bitmap Heap Scan ) ) ) )

So I take from this the following:

1. if the result set fits in work_mem, hash aggregate is wicked fast.
About 1 jillion times faster than Unique+Sort.

2. it would be nifty if postgresql could be taught that, in a UNION,
to de-duplicate each contributory relation so as to reduce the total
set of rows that need to be re-de-duplicated. It's extra work, true,
and maybe there are some tricks here, but it seems to make a big
difference. This is useful so that the total result set is small
enough that hash aggregate might apply.

NOTE:

I have to have work_mem really low as a global on this machine because
other queries involving the same tables (such as those that involve
UNION ALL for SUM() or GROUP BY operations) cause the machine to run
out of memory. Indeed, even with work_mem at 1MB I run the machine out
of memory if I don't explicitly disable hashagg for some queries. Can
anything be done about that?


--
Jon

Re: queries with lots of UNIONed relations

От
Tom Lane
Дата:
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> Your comment regarding "each individual de-duplication looked like it
> would fit in work_mem" doesn't really make sense, exactly. Maybe I'm
> misunderstanding you.

Yeah.  What I was suggesting was to NOT add the DISTINCT's, but instead
raise work_mem high enough so you get just one HashAggregation step at
the top level.  (Although I think this only works in 8.4 and up.)
That should be faster than two levels of de-duplication.

            regards, tom lane

Re: queries with lots of UNIONed relations

От
Jon Nelson
Дата:
On Thu, Jan 13, 2011 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> Your comment regarding "each individual de-duplication looked like it
>> would fit in work_mem" doesn't really make sense, exactly. Maybe I'm
>> misunderstanding you.
>
> Yeah.  What I was suggesting was to NOT add the DISTINCT's, but instead
> raise work_mem high enough so you get just one HashAggregation step at
> the top level.  (Although I think this only works in 8.4 and up.)
> That should be faster than two levels of de-duplication.

Gave it a try -- performance either way doesn't seem to change -
although the final set that has to undergo de-duplication is rather
larger (WITHOUT DISTINCT) so I still run the risk of not getting Hash
Aggregation.

Since having the DISTINCT doesn't seem to hurt, and it avoids
(potential) significant pain, I'll keep it.

I still think that having UNION do de-duplication of each contributory
relation is a beneficial thing to consider -- especially if postgresql
thinks the uniqueness is not very high.

Thanks!

--
Jon

Re: queries with lots of UNIONed relations

От
Robert Haas
Дата:
On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> I still think that having UNION do de-duplication of each contributory
> relation is a beneficial thing to consider -- especially if postgresql
> thinks the uniqueness is not very high.

This might be worth a TODO.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: queries with lots of UNIONed relations

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>> I still think that having UNION do de-duplication of each contributory
>> relation is a beneficial thing to consider -- especially if postgresql
>> thinks the uniqueness is not very high.

> This might be worth a TODO.

I don't believe there is any case where hashing each individual relation
is a win compared to hashing them all together.  If the optimizer were
smart enough to be considering the situation as a whole, it would always
do the latter.

            regards, tom lane

Re: queries with lots of UNIONed relations

От
Robert Haas
Дата:
On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>>> I still think that having UNION do de-duplication of each contributory
>>> relation is a beneficial thing to consider -- especially if postgresql
>>> thinks the uniqueness is not very high.
>
>> This might be worth a TODO.
>
> I don't believe there is any case where hashing each individual relation
> is a win compared to hashing them all together.  If the optimizer were
> smart enough to be considering the situation as a whole, it would always
> do the latter.

You might be right, but I'm not sure.  Suppose that there are 100
inheritance children, and each has 10,000 distinct values, but none of
them are common between the tables.  In that situation, de-duplicating
each individual table requires a hash table that can hold 10,000
entries.  But deduplicating everything at once requires a hash table
that can hold 1,000,000 entries.

Or am I all wet?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: queries with lots of UNIONed relations

От
Robert Haas
Дата:
On Thu, Jan 13, 2011 at 5:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>>>> I still think that having UNION do de-duplication of each contributory
>>>> relation is a beneficial thing to consider -- especially if postgresql
>>>> thinks the uniqueness is not very high.
>>
>>> This might be worth a TODO.
>>
>> I don't believe there is any case where hashing each individual relation
>> is a win compared to hashing them all together.  If the optimizer were
>> smart enough to be considering the situation as a whole, it would always
>> do the latter.
>
> You might be right, but I'm not sure.  Suppose that there are 100
> inheritance children, and each has 10,000 distinct values, but none of
> them are common between the tables.  In that situation, de-duplicating
> each individual table requires a hash table that can hold 10,000
> entries.  But deduplicating everything at once requires a hash table
> that can hold 1,000,000 entries.
>
> Or am I all wet?

Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
end.  But then why did the OP get a speedup?  *scratches head*

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: queries with lots of UNIONed relations

От
Andy Colson
Дата:
On 1/13/2011 4:42 PM, Robert Haas wrote:
> On Thu, Jan 13, 2011 at 5:41 PM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> Robert Haas<robertmhaas@gmail.com>  writes:
>>>> On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelson<jnelson+pgsql@jamponi.net>  wrote:
>>>>> I still think that having UNION do de-duplication of each contributory
>>>>> relation is a beneficial thing to consider -- especially if postgresql
>>>>> thinks the uniqueness is not very high.
>>>
>>>> This might be worth a TODO.
>>>
>>> I don't believe there is any case where hashing each individual relation
>>> is a win compared to hashing them all together.  If the optimizer were
>>> smart enough to be considering the situation as a whole, it would always
>>> do the latter.
>>
>> You might be right, but I'm not sure.  Suppose that there are 100
>> inheritance children, and each has 10,000 distinct values, but none of
>> them are common between the tables.  In that situation, de-duplicating
>> each individual table requires a hash table that can hold 10,000
>> entries.  But deduplicating everything at once requires a hash table
>> that can hold 1,000,000 entries.
>>
>> Or am I all wet?
>
> Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
> end.  But then why did the OP get a speedup?  *scratches head*
>

Because it all fix it memory and didnt swap to disk?

-Andy

Re: queries with lots of UNIONed relations

От
Robert Haas
Дата:
On Thu, Jan 13, 2011 at 5:47 PM, Andy Colson <andy@squeakycode.net> wrote:
>>>> I don't believe there is any case where hashing each individual relation
>>>> is a win compared to hashing them all together.  If the optimizer were
>>>> smart enough to be considering the situation as a whole, it would always
>>>> do the latter.
>>>
>>> You might be right, but I'm not sure.  Suppose that there are 100
>>> inheritance children, and each has 10,000 distinct values, but none of
>>> them are common between the tables.  In that situation, de-duplicating
>>> each individual table requires a hash table that can hold 10,000
>>> entries.  But deduplicating everything at once requires a hash table
>>> that can hold 1,000,000 entries.
>>>
>>> Or am I all wet?
>>
>> Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
>> end.  But then why did the OP get a speedup?  *scratches head*
>
> Because it all fix it memory and didnt swap to disk?

Doesn't make sense.  The re-de-duplication at the end should use the
same amount of memory regardless of whether the individual relations
have already been de-duplicated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: queries with lots of UNIONed relations

От
Andy Colson
Дата:
On 1/13/2011 4:49 PM, Robert Haas wrote:
> On Thu, Jan 13, 2011 at 5:47 PM, Andy Colson<andy@squeakycode.net>  wrote:
>>>>> I don't believe there is any case where hashing each individual relation
>>>>> is a win compared to hashing them all together.  If the optimizer were
>>>>> smart enough to be considering the situation as a whole, it would always
>>>>> do the latter.
>>>>
>>>> You might be right, but I'm not sure.  Suppose that there are 100
>>>> inheritance children, and each has 10,000 distinct values, but none of
>>>> them are common between the tables.  In that situation, de-duplicating
>>>> each individual table requires a hash table that can hold 10,000
>>>> entries.  But deduplicating everything at once requires a hash table
>>>> that can hold 1,000,000 entries.
>>>>
>>>> Or am I all wet?
>>>
>>> Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
>>> end.  But then why did the OP get a speedup?  *scratches head*
>>
>> Because it all fix it memory and didnt swap to disk?
>
> Doesn't make sense.  The re-de-duplication at the end should use the
> same amount of memory regardless of whether the individual relations
> have already been de-duplicated.
>

Unless I missed something in the thread:

distinctList + distinctList + ... -> [fit in mem] -> last distinct ->
[fit in mem]

vs.

fullList + fullList + ... -> [swapped to disk] -> last distinct -> [fit
in mem]


-Andy

Re: queries with lots of UNIONed relations

От
Jon Nelson
Дата:
On Thu, Jan 13, 2011 at 4:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 13, 2011 at 5:47 PM, Andy Colson <andy@squeakycode.net> wrote:
>>>>> I don't believe there is any case where hashing each individual relation
>>>>> is a win compared to hashing them all together.  If the optimizer were
>>>>> smart enough to be considering the situation as a whole, it would always
>>>>> do the latter.
>>>>
>>>> You might be right, but I'm not sure.  Suppose that there are 100
>>>> inheritance children, and each has 10,000 distinct values, but none of
>>>> them are common between the tables.  In that situation, de-duplicating
>>>> each individual table requires a hash table that can hold 10,000
>>>> entries.  But deduplicating everything at once requires a hash table
>>>> that can hold 1,000,000 entries.
>>>>
>>>> Or am I all wet?
>>>
>>> Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
>>> end.  But then why did the OP get a speedup?  *scratches head*
>>
>> Because it all fix it memory and didnt swap to disk?
>
> Doesn't make sense.  The re-de-duplication at the end should use the
> same amount of memory regardless of whether the individual relations
> have already been de-duplicated.

I don't believe that to be true.
Assume 100 tables each of which produces 10,000 rows from this query.
Furthermore, let's assume that there are 3,000 duplicates per table.

Without DISTINCT:
uniqify (100 * 10,000 = 1,000,000 rows)

With DISTINCT:
uniqify (100 * (10,000 - 3,000) = 700,000 rows)

300,000 rows times (say, 64 bytes/row) = 18.75MB.
Not a lot, but more than the work_mem of 16MB.

Or maybe *I'm* all wet?

--
Jon

Re: queries with lots of UNIONed relations

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't believe there is any case where hashing each individual relation
>> is a win compared to hashing them all together. �If the optimizer were
>> smart enough to be considering the situation as a whole, it would always
>> do the latter.

> You might be right, but I'm not sure.  Suppose that there are 100
> inheritance children, and each has 10,000 distinct values, but none of
> them are common between the tables.  In that situation, de-duplicating
> each individual table requires a hash table that can hold 10,000
> entries.  But deduplicating everything at once requires a hash table
> that can hold 1,000,000 entries.

> Or am I all wet?

If you have enough memory to de-dup them individually, you surely have
enough to de-dup all at once.  It is not possible for a single hashtable
to have worse memory consumption than N hashtables followed by a union
hashtable, and in fact if there are no common values then the latter eats
twice as much space because every value appears twice in two different
hashtables.

            regards, tom lane

Re: queries with lots of UNIONed relations

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
> end.  But then why did the OP get a speedup?  *scratches head*

He was reporting that 2 levels of hashing was faster than sort+uniq
(with the sorts swapping to disk, no doubt).  One level of hashing
should be faster yet, but maybe not by enough to be obvious as long
as you don't start to swap.

            regards, tom lane

Re: queries with lots of UNIONed relations

От
Jon Nelson
Дата:
On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't believe there is any case where hashing each individual relation
>>> is a win compared to hashing them all together.  If the optimizer were
>>> smart enough to be considering the situation as a whole, it would always
>>> do the latter.
>
>> You might be right, but I'm not sure.  Suppose that there are 100
>> inheritance children, and each has 10,000 distinct values, but none of
>> them are common between the tables.  In that situation, de-duplicating
>> each individual table requires a hash table that can hold 10,000
>> entries.  But deduplicating everything at once requires a hash table
>> that can hold 1,000,000 entries.
>
>> Or am I all wet?
>
> If you have enough memory to de-dup them individually, you surely have
> enough to de-dup all at once.  It is not possible for a single hashtable
> to have worse memory consumption than N hashtables followed by a union
> hashtable, and in fact if there are no common values then the latter eats
> twice as much space because every value appears twice in two different
> hashtables.

If everything were available up-front, sure.
However, and please correct me if I'm wrong, but doesn't postgresql
work in a fairly linear fashion, moving from table to table performing
a series of operations on each? That seems to indicate that is what
the plan is:

Compare:

for each table LOOP
  scan table for result rows, append to results
END LOOP
hash / sort + unique results

versus:

for each table LOOP
  scan table for result rows, append to table-results
  hash / sort+unique table-results, append to results
END LOOP
hash / sort + unique results

In the former case, all of the result rows from all tables are
appended together before the de-duplification process can start.

In the latter case, only enough memory for each table's result set is
necessary for de-duplification, and it would only be necessary to
allocate it for that table.

Is that not how this works?

--
Jon

Re: queries with lots of UNIONed relations

От
Tom Lane
Дата:
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If you have enough memory to de-dup them individually, you surely have
>> enough to de-dup all at once.

> If everything were available up-front, sure.
> However, and please correct me if I'm wrong, but doesn't postgresql
> work in a fairly linear fashion, moving from table to table performing
> a series of operations on each?

Doing a single sort+uniq works like that.  But the alternate plan you
are proposing we should consider involves building all the lower
hashtables, and then reading from them to fill the upper hashtable.
Max memory consumption *is* worst case here.  Remember HashAggregate
is incapable of swapping to disk (and if it did, you wouldn't be nearly
as pleased with its performance).

            regards, tom lane

Re: queries with lots of UNIONed relations

От
Mladen Gogala
Дата:
On 1/13/2011 5:41 PM, Robert Haas wrote:
> You might be right, but I'm not sure.  Suppose that there are 100
> inheritance children, and each has 10,000 distinct values, but none of
> them are common between the tables.  In that situation, de-duplicating
> each individual table requires a hash table that can hold 10,000
> entries.  But deduplicating everything at once requires a hash table
> that can hold 1,000,000 entries.
>
> Or am I all wet?
>

Have you considered using Google's map-reduce framework for things like
that? Union and group functions look like  ideal candidates for such a
thing.  I am not sure whether map-reduce can be married to a relational
database, but I must say that I was impressed with the speed of MongoDB.
I am not suggesting that PostgreSQL should sacrifice its ACID compliance
for speed, but Mongo sure does look like a speeding bullet.
On the other hand, the algorithms that have been paralleled for a long
time are precisely sort/merge and hash algorithms used for union and
group by functions. This is what I have in mind:
http://labs.google.com/papers/mapreduce.html

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: queries with lots of UNIONed relations

От
Vitalii Tymchyshyn
Дата:
14.01.11 00:26, Tom Lane написав(ла):
> Robert Haas<robertmhaas@gmail.com>  writes:
>> On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelson<jnelson+pgsql@jamponi.net>  wrote:
>>> I still think that having UNION do de-duplication of each contributory
>>> relation is a beneficial thing to consider -- especially if postgresql
>>> thinks the uniqueness is not very high.
>> This might be worth a TODO.
> I don't believe there is any case where hashing each individual relation
> is a win compared to hashing them all together.  If the optimizer were
> smart enough to be considering the situation as a whole, it would always
> do the latter.
>
>
How about cases when individual relations are already sorted? This will
mean that they can be deduplicated fast and in streaming manner. Even
partial sort order may help because you will need to deduplicate only
groups with equal sorted fields, and this will take much less memory and
be much more streaming. And if all individual deduplications are
streaming and are sorted in one way - you can simply do a merge on top.

Best regards, Vitalii Tymchyshyn.


Re: queries with lots of UNIONed relations

От
Jon Nelson
Дата:
On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> If you have enough memory to de-dup them individually, you surely have
>>> enough to de-dup all at once.
>
>> If everything were available up-front, sure.
>> However, and please correct me if I'm wrong, but doesn't postgresql
>> work in a fairly linear fashion, moving from table to table performing
>> a series of operations on each?
>
> Doing a single sort+uniq works like that.  But the alternate plan you
> are proposing we should consider involves building all the lower
> hashtables, and then reading from them to fill the upper hashtable.
> Max memory consumption *is* worst case here.  Remember HashAggregate
> is incapable of swapping to disk (and if it did, you wouldn't be nearly
> as pleased with its performance).

That's not exactly what I'm proposing - but it is probably due to a
lack of understanding some of the underlying details of how postgresql
works. I guess I had assumed that the result of a HashAggregate or any
other de-duplication process was a table-like structure.

Regarding being pleased with hash aggregate - I am! - except when it
goes crazy and eats all of the memory in the machine. I'd trade a bit
of performance loss for not using up all of the memory and crashing.

However, maybe I'm misunderstanding how SELECT DISTINCT works internally.
In the case where a hashtable is used, does postgresql utilize
table-like structure or does it remain a hashtable in memory?

If it's a hashtable, couldn't the hashtable be built on-the-go rather
than only after all of the underlying tuples are available?

I'd love a bit more explanation as to how this works.

Another example of where this might be useful:   I'm currently running
a SELECT DISTINCT query over some 500 million rows (120 contributory
tables). I expect a de-duplicated row count of well under 10% of that
500 million, probably below 1%. The plan as it stands is to execute a
series of sequential scans, appending each of the rows from each
contributory table and then aggregating them. If the expected
distinctness of each contributory subquery is, say, 5% then instead of
aggregating over 500 million tuples the aggregation would take place
over 25 million. In this case, that is a savings of 10 gigabytes,
approximately.

Yes, it's true, the same amount of data has to be scanned. However,
the amount of data that needs to be stored (in memory or on disk) in
order to provide a final de-duplication is much smaller.

--
Jon

Re: queries with lots of UNIONed relations

От
Jon Nelson
Дата:
On Fri, Jan 14, 2011 at 2:11 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>>> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> If you have enough memory to de-dup them individually, you surely have
>>>> enough to de-dup all at once.
>>
>>> If everything were available up-front, sure.
>>> However, and please correct me if I'm wrong, but doesn't postgresql
>>> work in a fairly linear fashion, moving from table to table performing
>>> a series of operations on each?
>>
>> Doing a single sort+uniq works like that.  But the alternate plan you
>> are proposing we should consider involves building all the lower
>> hashtables, and then reading from them to fill the upper hashtable.
>> Max memory consumption *is* worst case here.  Remember HashAggregate
>> is incapable of swapping to disk (and if it did, you wouldn't be nearly
>> as pleased with its performance).
>
> That's not exactly what I'm proposing - but it is probably due to a
> lack of understanding some of the underlying details of how postgresql
> works. I guess I had assumed that the result of a HashAggregate or any
> other de-duplication process was a table-like structure.

And I assumed wrong, I think. I dug into the code (nodeHash.c and
others) and I think I understand now why HashAggregate works only in
certain circumstances, and I think I understand your comments a bit
better now.  Basically, HashAggregate doesn't stream unique Nodes the
way nodeUnique.c does. nodeUnique basically emits Nodes and elides
subsequent, identical Nodes, which is why it relies on the input being
sorted.  HashAggregate works only on entire input sets at once, and
nodeHash.c doesn't emit Nodes at all, really.

This makes me wonder if nodeHash.c and nodeHashjoin.c couldn't be
modified to output Nodes in a streaming fashion. The memory
requirements would not be any worse than now.

Does postgresql support any sort of merge sort?  If it did, then if
the hashtable started consuming too much memory, it could be cleared
and the nodes output from the new hashtable could be directed to
another temporary file, and then a merge sort could be performed on
all of the temporary files (and thus Unique could be used to affect
the UNION operation).

--
Jon