Обсуждение: [GENERAL] is (not) distinct from

Поиск
Список
Период
Сортировка

[GENERAL] is (not) distinct from

От
Johann Spies
Дата:
When I query table a I get 18 rows.
The same query on table b results in 28 rows.

Both tables have the same structure.

When I export the results to csv-files and do a diff it confirms that all 18 rows from a are also in b. Table b has 10 new rows.

When I combine these queries and use "is (not) distinct from"
I get strange results:

with a as (select citing_article, cited_article, pubyear, year_cited,  cited_author, cited_title, cited_work, doi
from wos_2017_1.citation
where citing_article='abcdefg'
order by 3,4,5,6,8),
b as (
select citing_article, cited_article, pubyear, year_cited, cited_author, cited_title, cited_work, doi
from wos_2017_1.citationbackup
where citing_article='abcdefg'
order by 3,4,5,6,8)
select distinct b.* from b , a
where
( B.citing_article,
             B.cited_article,
             B.pubyear,
             B.year_cited,
             B.cited_author,
             B.cited_title,
             B.cited_work,
B.doi)
 is distinct from
        (A.citing_article,
             A.cited_article,
             A.pubyear,
             A.year_cited,
             A.cited_author,
             A.cited_title,
             A.cited_work, A.doi)

The result of this query is 28 rows - thus that of b. 
I expected this to be 10.

If I change the query to "is not distinct from" it results in 18 rows
which is what I would have expected.

Regards
Johann.
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

Re: [GENERAL] is (not) distinct from

От
Adrian Klaver
Дата:
On 02/28/2017 12:08 AM, Johann Spies wrote:
> When I query table a I get 18 rows.
> The same query on table b results in 28 rows.
>
> Both tables have the same structure.
>
> When I export the results to csv-files and do a diff it confirms that
> all 18 rows from a are also in b. Table b has 10 new rows.
>
> When I combine these queries and use "is (not) distinct from"
> I get strange results:
>
> with a as (select citing_article, cited_article, pubyear, year_cited,
> cited_author, cited_title, cited_work, doi
> from wos_2017_1.citation
> where citing_article='abcdefg'
> order by 3,4,5,6,8),
> b as (
> select citing_article, cited_article, pubyear, year_cited, cited_author,
> cited_title, cited_work, doi
> from wos_2017_1.citationbackup
> where citing_article='abcdefg'
> order by 3,4,5,6,8)
> select distinct b.* from b , a
> where
> ( B.citing_article,
>              B.cited_article,
>              B.pubyear,
>              B.year_cited,
>              B.cited_author,
>              B.cited_title,
>              B.cited_work,
> B.doi)
>  is distinct from
>         (A.citing_article,
>              A.cited_article,
>              A.pubyear,
>              A.year_cited,
>              A.cited_author,
>              A.cited_title,
>              A.cited_work, A.doi)
>
> The result of this query is 28 rows - thus that of b.
> I expected this to be 10.
>
> If I change the query to "is not distinct from" it results in 18 rows
> which is what I would have expected.

I have not worked through all this but at first glance I suspect:

select distinct b.* from b ...

is distinct from ...

constitutes a double negative.

What happens if you eliminate the first distinct?


>
> Regards
> Johann.
> --
> Because experiencing your loyal love is better than life itself,
> my lips will praise you.  (Psalm 63:3)


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] is (not) distinct from

От
Johann Spies
Дата:
On 28 February 2017 at 17:06, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

I have not worked through all this but at first glance I suspect:

select distinct b.* from b ...

is distinct from ...

constitutes a double negative.

What happens if you eliminate the first distinct?



Thanks Adrian,

The dynamics of the data has changed because of data updates so an exact comparison is not possible.

Other tests now confirm that the 28 records are identical in both tables.
The results then become more confusing:

If I remove the first distinct
and use "is distinct from" 

I get 756 rows

and when I use "is not distinct from"

I get 28.

In the first (756) case when I use "group by" the result of the first query is exactly the same as the second one.

Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

Re: [GENERAL] is (not) distinct from

От
Adrian Klaver
Дата:
On 03/01/2017 12:15 AM, Johann Spies wrote:
> On 28 February 2017 at 17:06, Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
>
>     I have not worked through all this but at first glance I suspect:
>
>     select distinct b.* from b ...
>
>     is distinct from ...
>
>     constitutes a double negative.
>
>     What happens if you eliminate the first distinct?
>
>
>
> Thanks Adrian,
>
> The dynamics of the data has changed because of data updates so an exact
> comparison is not possible.
>
> Other tests now confirm that the 28 records are identical in both tables.
> The results then become more confusing:
>
> If I remove the first distinct
> and use "is distinct from"
>
> I get 756 rows
>
> and when I use "is not distinct from"
>
> I get 28.
>
> In the first (756) case when I use "group by" the result of the first
> query is exactly the same as the second one.

To be clear you are looking for records in citation that are different
from citationbackup over a subset(Are there more fields?) of 8 fields,
correct?

What do those 8 fields represent?

Is citationbackup really a backup of citation?

Is there a Primary Key on either/both tables?

What are you grouping by?

Where I am going with this, is that it is not clear to me how you are
matching the two sets of records to determine whether they are different
or not. Your result that yields 756 rows indicates that the comparison
is not an apples to apples comparison, but a comparison of two
'shuffled' sets. Adding the group by seems to sort that out. So some
idea of what constitutes a difference and how you determine which
records from each table you want to match would be helpful. If you could
show the table schema and some sample data it would be even better.

>
> Regards
> Johann
> --
> Because experiencing your loyal love is better than life itself,
> my lips will praise you.  (Psalm 63:3)


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] is (not) distinct from

От
Tom Lane
Дата:
Adrian Klaver <adrian.klaver@aklaver.com> writes:
> Where I am going with this, is that it is not clear to me how you are
> matching the two sets of records to determine whether they are different
> or not.

He's not.  The query is forming the cartesian product of the two tables
and then dropping join rows where the tables match ... but every B row is
going to have multiple A rows where it doesn't match, and those join rows
will all survive the WHERE.  Then "select distinct" gets rid of the
duplicates, and since nothing from A is presented in the result, it's not
very obvious what's happening.

This is a great example of "select distinct" being used as a band-aid
over a fundamental misunderstanding of SQL.  It's good advice to never use
"distinct" unless you know exactly why your query is generating duplicate
rows in the first place.

            regards, tom lane


[GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from

От
George Neuner
Дата:
On Wed, 01 Mar 2017 11:12:29 -0500, Tom Lane <tgl@sss.pgh.pa.us>
wrote:

>This is a great example of "select distinct" being used as a band-aid
>over a fundamental misunderstanding of SQL.  It's good advice to never use
>"distinct" unless you know exactly why your query is generating duplicate
>rows in the first place.

On that note:

I know most people here don't pay much - or any - attention to
SQLServer, however there was an interesting article recently regarding
significant performance differences between DISTINCT and GROUP BY as
used to remove duplicates.

https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct


Now I'm wondering if something similar might be lurking in Postgresql?

[Yeah, I know - test it and find out!

Thing is, the queries used in the article are not simple.  Although
not explicitly stated, it hints that - at least for SQLServer - a
simple case involving a string column is probably insufficient, and
complex scenarios are required to produce significant differences.
]


I'll get around to doing some testing soon.  For now, I am just asking
if anyone has ever run into something like this?

George

Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from

От
"Sven R. Kunze"
Дата:
On 03.03.2017 06:26, George Neuner wrote:
I know most people here don't pay much - or any - attention to
SQLServer, however there was an interesting article recently regarding
significant performance differences between DISTINCT and GROUP BY as
used to remove duplicates.

https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct

On a similar note, this is also an interesting read about the topic of distinct vs group by:

https://blogs.oracle.com/developer/entry/counting_with_oracle_is_faster

Interesting is the performance difference between integers and strings for PostgreSQL which doesn't exist for Oracle.

I also tried rewriting "select distinct" to "select group by" using PostgreSQL. It didn't help; it was even worse (see appendix).

I'll get around to doing some testing soon.  For now, I am just asking
if anyone has ever run into something like this?

Yes, my team did. We use Django on a daily basis to generate SQL queries. In case of model-spanning queries, a lot of joining and duplications are involved. Distinct is the "generally" accepted way to remedy the situation but it's actually more like Tom said: distinct is a band-aid here. UNIONS and SUBSELECTs would be better I guess.


Sven


** Appendix **


>>>># \d docs
                         Table "public.docs"
 Column |  Type   |                     Modifiers                    
--------+---------+---------------------------------------------------
 id     | integer | not null default nextval('docs_id_seq'::regclass)
 meta   | jsonb   |
Indexes:
    "docs_pkey" PRIMARY KEY, btree (id)


>>>># explain analyze select count(distinct meta->>'blood_group') from docs;
                                                         QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=760497.00..760497.01 rows=1 width=449) (actual time=37631.727..37631.727 rows=1 loops=1)
   ->  Seq Scan on docs  (cost=0.00..710497.00 rows=10000000 width=449) (actual time=0.500..3999.417 rows=10000000 loops=1)
 Planning time: 0.211 ms
 Execution time: 37631.829 ms
(4 rows)


>>>># explain analyze select count(*) from (select meta->>'blood_group' from docs group by meta->>'blood_group') as x;
                                                               QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4441923.83..4441923.84 rows=1 width=0) (actual time=41189.472..41189.472 rows=1 loops=1)
   ->  Group  (cost=4241923.83..4316923.83 rows=10000000 width=449) (actual time=31303.690..41189.455 rows=8 loops=1)
         Group Key: ((docs.meta ->> 'blood_group'::text))
         ->  Sort  (cost=4241923.83..4266923.83 rows=10000000 width=449) (actual time=31303.686..40475.227 rows=10000000 loops=1)
               Sort Key: ((docs.meta ->> 'blood_group'::text))
               Sort Method: external merge  Disk: 129328kB
               ->  Seq Scan on docs  (cost=0.00..735497.00 rows=10000000 width=449) (actual time=0.349..6433.691 rows=10000000 loops=1)
 Planning time: 2.189 ms
 Execution time: 41203.669 ms
(9 rows)

Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from

От
David Rowley
Дата:
On 3 March 2017 at 18:26, George Neuner <gneuner2@comcast.net> wrote:
> I know most people here don't pay much - or any - attention to
> SQLServer, however there was an interesting article recently regarding
> significant performance differences between DISTINCT and GROUP BY as
> used to remove duplicates.
>
> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
>
>
> Now I'm wondering if something similar might be lurking in Postgresql?

Yes things lurk there in PostgreSQL too. But to be honest I find the
examples in the URL you included a bit strange. There's almost
certainly going to be a table called "orders" that you'd use for the
outer part of the query. In that case the orderid would already be
unique.  To do the same in PostgreSQL you'd just use: select orderid,
string_agg(description,'|') from orderitems group by orderid; assuming
all orders had at least one line, you'd get the same result.

In more general terms, PostgreSQL will allow you to GROUP BY and
non-aggregated columns which are functionally dependent on the GROUP
BY clause, for example:

SELECT parts.partcode,parts.description,sum(sales.quantity) from sales
inner join parts on sales.partcode = parts.partcode GROUP BY
parts.partcode;

Assuming that parts.partcode is the PRIMARY KEY of parts, this query
is legal in PostgreSQL. In some other databases, and I believe SQL
Server might be one of them, you would have been forced to include
part.description in the GROUP BY clause. Since PostgreSQL 9.6, if
you'd have done the same with that, internally the database would
ignore the parts.description in the GROUP BY clause, as its smart
enough to know that including parts.description in the clause is not
going to change anything as the description is always the same for
each parts.partcode, and no two records can share the same partcode.

There's no such optimisation when it comes to DISTINCT. In PostgreSQL
as of today DISTINCT is a bit naive, and will just uniquify the
results on each column in the select clause.  Although quite possibly
the same optimisation could apply to DISTINCT too, just nobody has
thought to add it yet.

In short, the main difference is going to be the fewer columns you're
using to identify the groups the better. If you included all of the
columns in the GROUP BY clause as you put in the select list with the
DISTINCT query then in most cases the performance would be the same. I
believe the only exception to this is in regards to parallel query, as
currently only GROUP BYs may be parallelised, not DISTINCT.

Historically with older now unsupported versions of PostgreSQL (pre
8.4) you may have also preferred to use GROUP BY over DISTINCT as
GROUP BY could be implemented internally by sorting or hashing the
results, whereas DISTINCT used to only be implemented by Sorting the
results. Although this has long since been the case.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from

От
George Neuner
Дата:
Hi David,

On Sat, 4 Mar 2017 02:32:48 +1300, David Rowley
<david.rowley@2ndquadrant.com> wrote:

>On 3 March 2017 at 18:26, George Neuner <gneuner2@comcast.net> wrote:
>> I know most people here don't pay much - or any - attention to
>> SQLServer, however there was an interesting article recently regarding
>> significant performance differences between DISTINCT and GROUP BY as
>> used to remove duplicates.
>>
>> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
>>
>>
>> Now I'm wondering if something similar might be lurking in Postgresql?
>
>Yes things lurk there in PostgreSQL too. But to be honest I find the
>examples in the URL you included a bit strange. There's almost
>certainly going to be a table called "orders" that you'd use for the
>outer part of the query. In that case the orderid would already be
>unique.  To do the same in PostgreSQL you'd just use: select orderid,
>string_agg(description,'|') from orderitems group by orderid; assuming
>all orders had at least one line, you'd get the same result.


The author mentioned at the beginning that the simple queries:

  SELECT DISTINCT Description FROM Sales.OrderLines
  SELECT Description FROM Sales.OrderLines GROUP BY Description;

wouldn't display the subject behavior.

Of course, analyzing the much more complex queries is much more
difficult.  It begs the question: what actually is going on there?


But I don't use SQL Server ... my interest is in how Postgresql deals
with a similar situation.


>Assuming that parts.partcode is the PRIMARY KEY of parts, this query
>is legal in PostgreSQL. In some other databases, and I believe SQL
>Server might be one of them, you would have been forced to include
>part.description in the GROUP BY clause. Since PostgreSQL 9.6, if
>you'd have done the same with that, internally the database would
>ignore the parts.description in the GROUP BY clause, as its smart
>enough to know that including parts.description in the clause is not
>going to change anything as the description is always the same for
>each parts.partcode, and no two records can share the same partcode.
>
>There's no such optimisation when it comes to DISTINCT. In PostgreSQL
>as of today DISTINCT is a bit naive, and will just uniquify the
>results on each column in the select clause.  Although quite possibly
>the same optimisation could apply to DISTINCT too, just nobody has
>thought to add it yet.
>
>In short, the main difference is going to be the fewer columns you're
>using to identify the groups the better. If you included all of the
>columns in the GROUP BY clause as you put in the select list with the
>DISTINCT query then in most cases the performance would be the same. I
>believe the only exception to this is in regards to parallel query, as
>currently only GROUP BYs may be parallelised, not DISTINCT.
>
>Historically with older now unsupported versions of PostgreSQL (pre
>8.4) you may have also preferred to use GROUP BY over DISTINCT as
>GROUP BY could be implemented internally by sorting or hashing the
>results, whereas DISTINCT used to only be implemented by Sorting the
>results. Although this has long since been the case.


I often have occasion to use multiple mapping relations: e.g.,
  A{1}->B{N}
  C{1}->B{N}
together in a query where C is provided and I need to find the
corresponding A(s).  Frequently these queries result in the same A
being found multiple times.

Although the mapping tuples are small [usually just a pair of keys],
the number of rows in the mapping tables may be very large, and a
given query may need to join/work its way through several such
mappings.

Typically in such situations, I divide the query using CTEs and (try
to) minimize the volume of data at each step by filtering duplicates
from any results that might include them.

I have always used DISTINCT to filter duplication, reserving GROUP BY
for aggregations (counting, etc.).  But if I understand correctly, you
are saying that GROUP BY should be preferred even for the simpler use.


George

Re: [GENERAL] is (not) distinct from

От
Johann Spies
Дата:
Thanks (again Adrian) and Tom.

The situation was that I had a table with 731million records which I wanted to copy into a partitioned one where there was a unique constraint on the fields used in my query.

The "backup" table was the single one.

While inserting into the partitioned table from the backup one, several (about 120000) records failed to insert.

I wanted to find out which records were involved and found that some had "'" characters in the values which broke some of the functions used to do some calculations.. 

As there were fields that might have null values I have tried the "is not distinct from".

Both sides of the query had primary keys and I did not use group by.  That was why I used "distinct".

Anyhow in the end, I made some progress with a modified query:

where
s.citing_article = A.citing_article
   and
          s.cited_article !=  A.cited_article
   and
   s.pubyear is  distinct from A.pubyear
   and
   s.year_cited is distinct from A.year_cited
   and
   s.cited_author is distinct from A.cited_author
   and 
        regexp_replace(s.cited_title,  $$'$$, $$''$$,'g') is distinct from
        regexp_replace(A.cited_title,  $$'$$, $$''$$,'g')
   and 
        regexp_replace(s.cited_work,  $$'$$, $$''$$,'g') is distinct from regexp_replace(A.cited_work,   $$'$$, $$''$$,'g')
   and
        s.doi is distinct from A.doi

Regards.
Johann

--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

Re: [GENERAL] is (not) distinct from

От
Adrian Klaver
Дата:
On 03/07/2017 01:22 AM, Johann Spies wrote:
> Thanks (again Adrian) and Tom.
>
> The situation was that I had a table with 731million records which I
> wanted to copy into a partitioned one where there was a unique
> constraint on the fields used in my query.
>
> The "backup" table was the single one.
>
> While inserting into the partitioned table from the backup one, several
> (about 120000) records failed to insert.
>
> I wanted to find out which records were involved and found that some had
> "'" characters in the values which broke some of the functions used to
> do some calculations..
>
> As there were fields that might have null values I have tried the "is
> not distinct from".
>
> Both sides of the query had primary keys and I did not use group by.
> That was why I used "distinct".

Would it not be easier to use a LEFT JOIN between the original
table(backup) and the new table:

https://www.postgresql.org/docs/9.6/static/sql-select.html

join_type

"LEFT OUTER JOIN returns all rows in the qualified Cartesian product
(i.e., all combined rows that pass its join condition), plus one copy of
each row in the left-hand table for which there was no right-hand row
that passed the join condition. This left-hand row is extended to the
full width of the joined table by inserting null values for the
right-hand columns. Note that only the JOIN clause's own condition is
considered while deciding which rows have matches. Outer conditions are
applied afterwards."

So something like:

SELECT
    a.pk
FROM
    original_table AS a
LEFT JOIN                    --The OUTER is not required
    new_table AS b
ON
    a.pk = b.pk
WHERE
    b.pk IS NULL

That would show all the rows in the original table that where not
transferred over.

>
> Anyhow in the end, I made some progress with a modified query:
>
> where
> s.citing_article = A.citing_article
>    and
>           s.cited_article !=  A.cited_article
>    and
>    s.pubyear is  distinct from A.pubyear
>    and
>    s.year_cited is distinct from A.year_cited
>    and
>    s.cited_author is distinct from A.cited_author
>    and
>         regexp_replace(s.cited_title,  $$'$$, $$''$$,'g') is distinct from
>         regexp_replace(A.cited_title,  $$'$$, $$''$$,'g')
>    and
>         regexp_replace(s.cited_work,  $$'$$, $$''$$,'g') is distinct
> from regexp_replace(A.cited_work,   $$'$$, $$''$$,'g')
>    and
>         s.doi is distinct from A.doi
>
> Regards.
> Johann
>
> --
> Because experiencing your loyal love is better than life itself,
> my lips will praise you.  (Psalm 63:3)


--
Adrian Klaver
adrian.klaver@aklaver.com