Обсуждение: STRING_AGG and GROUP BY

Поиск
Список
Период
Сортировка

STRING_AGG and GROUP BY

От
Alexander Farber
Дата:
Good afternoon,

I have prepared an SQL Fiddle for my question: http://sqlfiddle.com/#!17/4ef8b/2

Here are my 4 test tables:

CREATE TABLE players (
    uid SERIAL PRIMARY KEY,
    name text NOT NULL
);

CREATE TABLE games (
    gid SERIAL PRIMARY KEY,
    player1 integer NOT NULL REFERENCES players ON DELETE CASCADE,
    player2 integer NOT NULL REFERENCES players ON DELETE CASCADE
);

CREATE TABLE moves (
    mid BIGSERIAL PRIMARY KEY,
    uid integer NOT NULL REFERENCES players ON DELETE CASCADE,
    gid integer NOT NULL REFERENCES games ON DELETE CASCADE,
    played timestamptz NOT NULL,
    tiles jsonb NOT NULL
);

CREATE TABLE scores (
    mid     bigint  NOT NULL REFERENCES moves ON DELETE CASCADE,
    uid     integer NOT NULL REFERENCES players ON DELETE CASCADE,
    gid     integer NOT NULL REFERENCES games ON DELETE CASCADE,
    word    text    NOT NULL CHECK(word ~ '^[A-Z]{2,}$'),
    score   integer NOT NULL CHECK(score >= 0)
);

Here they are filled with test data (two players Alice and Bob interchangeably performing moves in game #1):

INSERT INTO players (name) VALUES ('Alice'), ('Bob');
INSERT INTO games (player1, player2) VALUES (1, 2);

INSERT INTO moves (uid, gid, played, tiles) VALUES
(1, 1, now() + interval '1 min', '[{"col": 7, "row": 12, "value": 3, "letter": "A"}, {"col": 8, "row": 12, "value": 10, "letter": "A"}, {"col": 9, "row": 12, "value": 1, "letter": "C"}, {"col": 10, "row": 12, "value": 2, "letter": "D"}]
'::jsonb),
(2, 1, now() + interval '2 min', '[{"col": 7, "row": 12, "value": 3, "letter": "X"}, {"col": 8, "row": 12, "value": 10, "letter": "X"}, {"col": 9, "row": 12, "value": 1, "letter": "Z"}]
'::jsonb),
(1, 1, now() + interval '3 min', '[{"col": 7, "row": 12, "value": 3, "letter": "K"}, {"col": 8, "row": 12, "value": 10, "letter": "K"}, {"col": 9, "row": 12, "value": 1, "letter": "M"}, {"col": 10, "row": 12, "value": 2, "letter": "N"}]
'::jsonb),
(2, 1, now() + interval '4 min', '[]'::jsonb),
(1, 1, now() + interval '5 min', '[{"col": 7, "row": 12, "value": 3, "letter": "A"}, {"col": 8, "row": 12, "value": 10, "letter": "B"}, {"col": 9, "row": 12, "value": 1, "letter": "C"}, {"col": 10, "row": 12, "value": 2, "letter": "D"}]
'::jsonb),
(2, 1, now() + interval '6 min', '[{"col": 7, "row": 12, "value": 3, "letter": "P"}, {"col": 8, "row": 12, "value": 10, "letter": "P"}]
'::jsonb);

INSERT INTO scores (mid, uid, gid, word, score) VALUES
(1, 1, 1, 'AACD', 40),
(2, 2, 1, 'XXZ', 30),
(2, 2, 1, 'XAB', 30),
(3, 1, 1, 'KKMN', 40),
(3, 1, 1, 'KYZ', 30),
(5, 1, 1, 'ABCD', 40),
(6, 2, 1, 'PP', 20),
(6, 2, 1, 'PABCD', 50);

For a PHP-script which would display all moves+words+scores played in a certain game I am trying:

SELECT
    mid,
    STRING_AGG(x->>'letter', '') AS tiles,
    STRING_AGG(DISTINCT y, ', ') AS words
FROM (
    SELECT
        mid,
        JSONB_ARRAY_ELEMENTS(m.tiles) AS x,
        FORMAT('%s (%s)', s.word, s.score) AS y
    FROM moves m
    LEFT JOIN scores s
    USING (mid)
    WHERE m.gid = 1
) AS z
GROUP BY mid;

Which produces a slightly wrong result (the played letters are duplicated):

mid   tiles                    words
1     AACD                 AACD (40)
2     XXZXXZ             XAB (30), XXZ (30)
3     KKMNKKMN      KKMN (40), KYZ (30)
5     ABCD                 ABCD (40)
6     PPPP                 PABCD (50), PP (20)

My expected result would actually be:

mid   tiles                    words
1     AACD                 AACD (40)
2     XXZ                    XAB (30), XXZ (30)
3     KKMN                KKMN (40), KYZ (30)
5     ABCD                 ABCD (40)
6     PP                      PABCD (50), PP (20)

Is that actually possible by the means of SQL or should I do it in the PHP script?

And I can not change the query to: STRING_AGG(DISTINCT x->>'letter', '') AS tiles,

Because for example in the last move with mid=6 the player Bob had played 2 tiles, both with letter-value "P" and has formed 2 words (PP and PABCD), but adding distinct would suggest he played a single tile "P".

Thank you
Alex





Re: STRING_AGG and GROUP BY

От
"David G. Johnston"
Дата:
On Fri, Mar 16, 2018 at 7:17 AM, Alexander Farber <alexander.farber@gmail.com> wrote:
And I can not change the query to: STRING_AGG(DISTINCT x->>'letter', '') AS tiles,

Because for example in the last move with mid=6 the player Bob had played 2 tiles, both with letter-value "P" and has formed 2 words (PP and PABCD), but adding distinct would suggest he played a single tile "P".

​First reaction is to ARRAY_AGG(DISTINCT x) and then write a function that converts ​that array into a string by extracting 'letter' from each cell in the array.

Thinking it over a bit you have two columns that both are aggregates but that are otherwise independent of each other.  Since they are independent they cannot be aggregated at the same time.  You need to write a two subqueries, either in the target list or as separate from/join items, and then join the already aggregated queries together on their common group by column.

The presence of DISTINCT here (and, IMO, generally), even if it worked, would be an indicator that something is not quite right.

David J.


Re: STRING_AGG and GROUP BY

От
Alexander Farber
Дата:
Hi David -

On Fri, Mar 16, 2018 at 4:40 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
​First reaction is to ARRAY_AGG(DISTINCT x) and then write a function that converts ​that array into a string by extracting 'letter' from each cell in the array.

Thinking it over a bit you have two columns that both are aggregates but that are otherwise independent of each other.  Since they are independent they cannot be aggregated at the same time.  You need to write a two subqueries, either in the target list or as separate from/join items, and then join the already aggregated queries together on their common group by column.

The presence of DISTINCT here (and, IMO, generally), even if it worked, would be an indicator that something is not quite right.


thank you for confirming my feeling that DISTINCT is a bad indicator here...

But you say that "tiles" and  "word (score)" are unrelated and this does not seem true to me:

For each move id aka "mid" there is a JSON value, describing how the player played the letter tiles.
And for the same "mid" there is a list of one or more "word (score)"s achieved...

Maybe it is possible to GROUP BY tiles somehow or some kind of special SQL JOIN?

Regards
Alex

Re: STRING_AGG and GROUP BY

От
"David G. Johnston"
Дата:
On Fri, Mar 16, 2018 at 9:10 AM, Alexander Farber <alexander.farber@gmail.com> wrote:

But you say that "tiles" and  "word (score)" are unrelated and this does not seem true to me:

For each move id aka "mid" there is a JSON value, describing how the player played the letter tiles.
And for the same "mid" there is a list of one or more "word (score)"s achieved...

​For each mid you want to know all tiles played and all word scores achieved - but you want to forget/ignore that a given tile achieved a given word score.  IOW, you are intentionally forgetting/ignoring the fact that the tiles and the corresponding word scores are related to each other, beyond the simple/incomplete relationship that both share the same mid.  You thus need to write a query that only relates tiles and word scores to mid and not to each other.


Maybe it is possible to GROUP BY tiles somehow or some kind of special SQL JOIN?


​I do not know if the tables or columns below match your model but the concept should still come across intact.​

​SELECT mid,
(SELECT string_agg(tiles, '; ') FROM tiles WHERE ta.mid = moves.mid) AS mid_tiles,
(SELECT string_agg(words, '; ') FROM words WHERE words.mid = moves.mid) AS mid_words
FROM moves​

There are other ways to write that that could perform better but the idea holds.

David J.

Re: STRING_AGG and GROUP BY

От
Alexander Farber
Дата:
Thank you, David -

On Fri, Mar 16, 2018 at 5:40 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, Mar 16, 2018 at 9:10 AM, Alexander Farber <alexander.farber@gmail.com> wrote:
​SELECT mid,
(SELECT string_agg(tiles, '; ') FROM tiles WHERE ta.mid = moves.mid) AS mid_tiles,
(SELECT string_agg(words, '; ') FROM words WHERE words.mid = moves.mid) AS mid_words
FROM moves​

There are other ways to write that that could perform better but the idea holds.


I've come up with the following query, wonder if you meant something similar -


WITH cte1 AS (
SELECT 
    mid,
    STRING_AGG(x->>'letter', '') AS tiles
FROM (
        SELECT 
            mid,
            JSONB_ARRAY_ELEMENTS(m.tiles) AS x
        FROM moves m
        WHERE m.gid = 1
) AS z
GROUP BY mid),
cte2 AS (
        SELECT 
        mid,
        STRING_AGG(y, ', ') AS words
    FROM (
        SELECT 
            mid,
            FORMAT('%s (%s)', s.word, s.score) AS y
        FROM scores s
        WHERE s.gid = 1
) AS z
GROUP BY mid)
SELECT mid, tiles, words 
FROM cte1 JOIN cte2 using (mid) ORDER BY mid ASC;

Regards
Alex

P.S. Below is the complete test data in case SQL Fiddle link stops working:

CREATE TABLE players (
    uid SERIAL PRIMARY KEY,
    name text NOT NULL
);

CREATE TABLE games (
    gid SERIAL PRIMARY KEY,
    player1 integer NOT NULL REFERENCES players ON DELETE CASCADE,
    player2 integer NOT NULL REFERENCES players ON DELETE CASCADE
);

CREATE TABLE moves (
    mid BIGSERIAL PRIMARY KEY,
    uid integer NOT NULL REFERENCES players ON DELETE CASCADE,
    gid integer NOT NULL REFERENCES games ON DELETE CASCADE,
    played timestamptz NOT NULL,
    tiles jsonb NOT NULL
);

CREATE TABLE scores (
    mid     bigint  NOT NULL REFERENCES moves ON DELETE CASCADE,
    uid     integer NOT NULL REFERENCES players ON DELETE CASCADE,
    gid     integer NOT NULL REFERENCES games ON DELETE CASCADE,
    word    text    NOT NULL CHECK(word ~ '^[A-Z]{2,}$'),
    score   integer NOT NULL CHECK(score >= 0)
);

INSERT INTO players (name) VALUES ('Alice'), ('Bob');
INSERT INTO games (player1, player2) VALUES (1, 2);

INSERT INTO moves (uid, gid, played, tiles) VALUES
(1, 1, now() + interval '1 min', '[{"col": 7, "row": 12, "value": 3, "letter": "A"}, {"col": 8, "row": 12, "value": 10, "letter": "A"}, {"col": 9, "row": 12, "value": 1, "letter": "C"}, {"col": 10, "row": 12, "value": 2, "letter": "D"}]
'::jsonb), 
(2, 1, now() + interval '2 min', '[{"col": 7, "row": 12, "value": 3, "letter": "X"}, {"col": 8, "row": 12, "value": 10, "letter": "X"}, {"col": 9, "row": 12, "value": 1, "letter": "Z"}]
'::jsonb), 
(1, 1, now() + interval '3 min', '[{"col": 7, "row": 12, "value": 3, "letter": "K"}, {"col": 8, "row": 12, "value": 10, "letter": "K"}, {"col": 9, "row": 12, "value": 1, "letter": "M"}, {"col": 10, "row": 12, "value": 2, "letter": "N"}]
'::jsonb), 
(2, 1, now() + interval '4 min', '[]'::jsonb), 
(1, 1, now() + interval '5 min', '[{"col": 7, "row": 12, "value": 3, "letter": "A"}, {"col": 8, "row": 12, "value": 10, "letter": "B"}, {"col": 9, "row": 12, "value": 1, "letter": "C"}, {"col": 10, "row": 12, "value": 2, "letter": "D"}]
'::jsonb), 
(2, 1, now() + interval '6 min', '[{"col": 7, "row": 12, "value": 3, "letter": "P"}, {"col": 8, "row": 12, "value": 10, "letter": "Q"}]
'::jsonb);

INSERT INTO scores (mid, uid, gid, word, score) VALUES
(1, 1, 1, 'AACD', 40),
(2, 2, 1, 'XXZ', 30),
(2, 2, 1, 'XAB', 30),
(3, 1, 1, 'KKMN', 40),
(3, 1, 1, 'KYZ', 30),
(5, 1, 1, 'ABCD', 40),
(6, 2, 1, 'PQ', 20),
(6, 2, 1, 'PABCD', 50);
 

Re: STRING_AGG and GROUP BY

От
"David G. Johnston"
Дата:
On Mon, Mar 19, 2018 at 1:54 PM, Alexander Farber <alexander.farber@gmail.com> wrote:

I've come up with the following query, wonder if you meant something similar -


WITH cte1 AS (
SELECT 
    mid,
    STRING_AGG(x->>'letter', '') AS tiles
FROM (
        SELECT 
            mid,
            JSONB_ARRAY_ELEMENTS(m.tiles) AS x
        FROM moves m
        WHERE m.gid = 1
) AS z
GROUP BY mid),
cte2 AS (
        SELECT 
        mid,
        STRING_AGG(y, ', ') AS words
    FROM (
        SELECT 
            mid,
            FORMAT('%s (%s)', s.word, s.score) AS y
        FROM scores s
        WHERE s.gid = 1
) AS z
GROUP BY mid)
SELECT mid, tiles, words 
FROM cte1 JOIN cte2 using (mid) ORDER BY mid ASC;


​​Yes.  It does end up presuming that the sets moves.mid and scores.mid ​are identical but that is probably a safe assumption.  Repetition of m.gid = 1 is worth avoiding in theory though depending on how its done the solution can be worse than the problem (if the planner ends up unable to push the predicate down).

​David J.