Обсуждение: joining tables
I'm having problems joining 3 tables to provide a 4th table in the correct format. I believe I need a FULL OUTER JOIN but does not give the result that I require.See below: Example 1 Table: Names Table: Friend Name Tel email PostCode Name friend email Jim 1 aa I Dave yes cc Ted 2 bb J Will yes dd Dave 3 cc K Zac yes ff Will 4 dd L Byron yes gg Table: Details Name email Town PostCode John ee A M Zac ff B N Byron gg C O Will dd D L Gary hh E P Ted bb F J Combined table: Result Name Tel email Town PostCode friend Jim 1 aa I Ted 2 bb F J Dave 3 cc K yes Will 4 dd D L yes John ee A M Zac ff B N yes Byron gg C O yes Gary hh E P Example 2 Table: Names Table: Friend Name Tel email PostCode Name friend email Jim 1 aa I Gareth yes ii Ted 2 bb J Tony yes jj Dave 3 cc K Ken yes kk Will 4 dd L Lloyd yes ll Table: Details Name email Town PostCode John ee A M Zac ff B N Byron gg C O Gary hh E P Combined table: Result Name Tel email Town PostCode friend Jim 1 aa I Ted 2 bb J Dave 3 cc K Will 4 dd L John ee A M Zac ff B N Byron gg C O Gary hh E P Gareth ii yes Tony jj yes Ken kk yes Lloyd ll yes -- Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html
The first thing I noticed is that there is no primary key field on Names. If Friend is a child table then you will use a LEFT OUTER JOIN. Friend will need a foreign key field that has values matching the primary key field in Names. Friend should also have it's own primary key field.
If you are storing names and friends in the same table then you should join them using a junction table. A junction table will have the unique ID value of the name record and the unique ID value of the friend record within the same record. A junction table allows for many-to-many relationships. In other words, you could build a tree of friend relationships.
Where you want to display yes/no values, if you use a LEFT OUTER JOIN and there is no related friend you will get the Name record back and an empty Friend record. If there is a friend record you will get the Name and the Friend record back. If there is more than one friend you will get one Name record per Friend record.
If you really just want yes/no values, you can use a sub query instead of a JOIN within the main query that selects the count of Friend records related to each Name record. If the count = 0, there are no Friend records. If the count is greater than 0, there are Friends. A CASE statement can convert the count to yes/no.
The primary key field on each table should be unique to that table and have no other meaning. For example, you shouldn't use email or telephone as primary key fields because of the chance that someone else my eventually have the same email address or telephone number as the person you've already stored in your database. Another good reason for primary keys is if you want to create updatable views. Without primary keys the database engine can't determine which records you are working on.
A primary key field can be an auto incrementing number (sequence or identity). It can also be a timestamp or UUID. The problem with timestamp and UUID is the slight possibility of a duplicate value. Additionally, numeric keys process faster than text keys.
The other thing about your sample is why Details is separate from Names. Things that could have multiple records per name record should be in separate tables. Examples would be emails, addresses and phone numbers (work, home, etc.). Additional details for a person should be stored with the person unless the same field could have more than one value.
If you are storing names and friends in the same table then you should join them using a junction table. A junction table will have the unique ID value of the name record and the unique ID value of the friend record within the same record. A junction table allows for many-to-many relationships. In other words, you could build a tree of friend relationships.
Where you want to display yes/no values, if you use a LEFT OUTER JOIN and there is no related friend you will get the Name record back and an empty Friend record. If there is a friend record you will get the Name and the Friend record back. If there is more than one friend you will get one Name record per Friend record.
If you really just want yes/no values, you can use a sub query instead of a JOIN within the main query that selects the count of Friend records related to each Name record. If the count = 0, there are no Friend records. If the count is greater than 0, there are Friends. A CASE statement can convert the count to yes/no.
The primary key field on each table should be unique to that table and have no other meaning. For example, you shouldn't use email or telephone as primary key fields because of the chance that someone else my eventually have the same email address or telephone number as the person you've already stored in your database. Another good reason for primary keys is if you want to create updatable views. Without primary keys the database engine can't determine which records you are working on.
A primary key field can be an auto incrementing number (sequence or identity). It can also be a timestamp or UUID. The problem with timestamp and UUID is the slight possibility of a duplicate value. Additionally, numeric keys process faster than text keys.
The other thing about your sample is why Details is separate from Names. Things that could have multiple records per name record should be in separate tables. Examples would be emails, addresses and phone numbers (work, home, etc.). Additional details for a person should be stored with the person unless the same field could have more than one value.
On 9/2/19 11:24 AM, TedJones wrote:
I'm having problems joining 3 tables to provide a 4th table in the correct format. I believe I need a FULL OUTER JOIN but does not give the result that I require.See below: Example 1 Table: Names Table: Friend Name Tel email PostCode Name friend email Jim 1 aa I Dave yes cc Ted 2 bb J Will yes dd Dave 3 cc K Zac yes ff Will 4 dd L Byron yes gg Table: Details Name email Town PostCode John ee A M Zac ff B N Byron gg C O Will dd D L Gary hh E P Ted bb F J Combined table: Result Name Tel email Town PostCode friend Jim 1 aa I Ted 2 bb F J Dave 3 cc K yes Will 4 dd D L yes John ee A M Zac ff B N yes Byron gg C O yes Gary hh E P Example 2 Table: Names Table: Friend Name Tel email PostCode Name friend email Jim 1 aa I Gareth yes ii Ted 2 bb J Tony yes jj Dave 3 cc K Ken yes kk Will 4 dd L Lloyd yes ll Table: Details Name email Town PostCode John ee A M Zac ff B N Byron gg C O Gary hh E P Combined table: Result Name Tel email Town PostCode friend Jim 1 aa I Ted 2 bb J Dave 3 cc K Will 4 dd L John ee A M Zac ff B N Byron gg C O Gary hh E P Gareth ii yes Tony jj yes Ken kk yes Lloyd ll yes -- Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html
Hi Patrick Thanks for your response. The tables that I gave were just an example and there is no relevance in the friend/yes or names/contact details in different tables. To clarity, another example would be three tables - shop1, shop2 and shop3. Each would have products so that would be a common column but product price may also be a common column. However, there would be also different column names for each shop. (columns with same names would be of the same data type). I have no control over the three tables that I get as csv files. It should be easy enough, if needed to add a primary key column (to each?) after reading in the csv files into tables. So, generally, what I am trying to achieve is to combine three tables into one table where some of the columns are the same and some are not. e.g table 1 with 12 columns; table 2 with 10 columns where 4 are the same as in table 1; table 3 with 4 columns where 1 is the same as in table 1 and 1 the same as in table 2. The number of columns in the result table would then be 12 (table 1) + 6 (new ones from table 2) + 2 (new ones from table 3) = 20 columns. Usually the data rows from each table will be different but not always. If data in a column common to all three tables e.g product name, was the same from all tables i.e. same product name then all the information about that product from the three tables would be in all the columns for that data row. I hope that’s clear. i.e. the difference between example1 and example2 in my original question. In this was in reverse it would start with the large result table and use three SELECT statements to create the three tables. Thanks Ted Jones -- Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html
Hi Ted, In essence, you want to merge the three tables, removing records that are duplicated by another table, right? Here are two approaches: 1) SELECT DISTINCT (field list) FROM (SELECT * from table1 UNION SELECT * from table2 UNION SELECT * from table 3). This will remove all duplicates, regardless of their source (e.g. if table1 as two “Ted” fields, one of the rows will beeliminated. Overlapping fields will be combined (e.g. only one field in the output), and non-overlapping fields will be merged. For example if t1 has f1,f2, and f3, and t2 has f1, f3, and f4, and table 3 has f1, f2, and f5, the resulting output will have f1, f2, f3, f4, and f5; fields that did notexist in the source table will be NULL in the output. 2) SELECT CASE WHEN t1.field1 IS NOT NULL THEN t1.field1 WHEN t2.field1 IS NOT NULL THEN t2.field1 ELSE t3.field1 END, … FROM t1 LEFT OUTER JOIN t2 ON (key field) LEFT OUTER JOIN t3 ON (key field) Using LEFT OUTER JOIN ensures that all records from all three tables will be included and that duplicates among the threetables will result in only one record. Duplicates within a table create a problem. For example, if t1 has “Ted” twice and t2 has “Ted” twice,there will be four “Ted” records in the output. If you can live with the elimination of duplicate rows within a table, option 1 is the easiest as you don’t have to specificallycode each field to be selected. > On Sep 3, 2019, at 4:01 AM, TedJones <ted@mentra.co.uk> wrote: > > Hi Patrick > > Thanks for your response. The tables that I gave were just an example and > there is no relevance in the friend/yes or names/contact details in > different tables. > > To clarity, another example would be three tables - shop1, shop2 and shop3. > Each would have products so that would be a common column but product price > may also be a common column. However, there would be also different column > names for each shop. (columns with same names would be of the same data > type). > > I have no control over the three tables that I get as csv files. It should > be easy enough, if needed to add a primary key column (to each?) after > reading in the csv files into tables. > > So, generally, what I am trying to achieve is to combine three tables into > one table where some of the columns are the same and some are not. e.g table > 1 with 12 columns; table 2 with 10 columns where 4 are the same as in table > 1; table 3 with 4 columns where 1 is the same as in table 1 and 1 the same > as in table 2. The number of columns in the result table would then be 12 > (table 1) + 6 (new ones from table 2) + 2 (new ones from table 3) = 20 > columns. > > Usually the data rows from each table will be different but not always. If > data in a column common to all three tables e.g product name, was the same > from all tables i.e. same product name then all the information about that > product from the three tables would be in all the columns for that data row. > I hope that’s clear. i.e. the difference between example1 and example2 in my > original question. > > In this was in reverse it would start with the large result table and use > three SELECT statements to create the three tables. > > Thanks > Ted Jones > > > > > > -- > Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html > >
Hi Jack I'm not sure if I understand your comment about two 'Ted' fields and duplication of rows. In the example below there is 'Ted' twice in Table3 1 and 3 and must appear twice as I've shown in the result as the data in the rest of the row is different. I agree with no duplication of rows if all of the row is the same. Ted Table: 1 Author Title Sales Publication Date Jim A aa I Ted B bb J Dave C cc K Ted D dd L Table: 2 Author Publisher Jim him Ted me Dave me Will you Gary him Table:3 Author Title Country Ted B UK Ted D US Jim A UK Dave C UK Combined table: Result Author Title Publication Date Publisher Sales Country Ted B J me bb UK Ted D L me dd US Jim A I him aa UK Dave C K me cc UK Gary null null him null null Will null null you null null -- Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html
Not having unique IDs from the source tables will present a problem in that you could have more than one record in one of the tables with the same data.
It looks like you will need to determine the fields that make each row unique but common in all three tables. Then, you can match up records from the three tables.
The next issue is if there is similar data in similar columns, which data takes prescience. That will be something you have to decide. I think the best way to do that is to import the most accurate table first. If a record from the second table matches a record from the first table, only import column values that weren't already populated by the first import. Then do the same thing with the 3rd table and so on.
I think the overall process will be to import each CSV file into a Postgres work table of it's own. Then, import the first table into a master table that contains every field. Do the same with the second import with the additional logic to only set a column value if it wasn't already set by the first (prior) import. Then, do the same with the third table and so on. By getting the CSV files into Postgres first, it will be easier to manage cases where a column between the two tables is the same data but the names are different. I think it will also perform better.
The import of the first table into the main table will be a simple INSERT query. For each of the rest of the tables I think I would first run an UPDATE query with an INNER JOIN that updates column values of matching records but only if the column isn't already populated. Then, run an INSERT query using the second table on the left side of a LEFT OUTER JOIN and the main table on the right side and only return records from the left hand table that don't match any records on the right hand table (i.e. fields in right table are null). Do the same for the rest of the tables.
A different way to perform the INSERT query would be to use a sub query in the WHERE clause of the second table that only returns records that don't have matches in the main table. This will be easier if each of the source tables has it's own unique ID across all source tables. If you don't have unique IDs, you can use a concatenation of key field values.
SELECT columns
FROM table2
WHERE [concatenation of ] key_column(s) in table2 NOT IN (SELECT [contenation of] key_column(s) FROM table1);
It looks like you will need to determine the fields that make each row unique but common in all three tables. Then, you can match up records from the three tables.
The next issue is if there is similar data in similar columns, which data takes prescience. That will be something you have to decide. I think the best way to do that is to import the most accurate table first. If a record from the second table matches a record from the first table, only import column values that weren't already populated by the first import. Then do the same thing with the 3rd table and so on.
I think the overall process will be to import each CSV file into a Postgres work table of it's own. Then, import the first table into a master table that contains every field. Do the same with the second import with the additional logic to only set a column value if it wasn't already set by the first (prior) import. Then, do the same with the third table and so on. By getting the CSV files into Postgres first, it will be easier to manage cases where a column between the two tables is the same data but the names are different. I think it will also perform better.
The import of the first table into the main table will be a simple INSERT query. For each of the rest of the tables I think I would first run an UPDATE query with an INNER JOIN that updates column values of matching records but only if the column isn't already populated. Then, run an INSERT query using the second table on the left side of a LEFT OUTER JOIN and the main table on the right side and only return records from the left hand table that don't match any records on the right hand table (i.e. fields in right table are null). Do the same for the rest of the tables.
A different way to perform the INSERT query would be to use a sub query in the WHERE clause of the second table that only returns records that don't have matches in the main table. This will be easier if each of the source tables has it's own unique ID across all source tables. If you don't have unique IDs, you can use a concatenation of key field values.
SELECT columns
FROM table2
WHERE [concatenation of ] key_column(s) in table2 NOT IN (SELECT [contenation of] key_column(s) FROM table1);
On 9/3/19 5:01 AM, TedJones wrote:
Hi Patrick Thanks for your response. The tables that I gave were just an example and there is no relevance in the friend/yes or names/contact details in different tables. To clarity, another example would be three tables - shop1, shop2 and shop3. Each would have products so that would be a common column but product price may also be a common column. However, there would be also different column names for each shop. (columns with same names would be of the same data type). I have no control over the three tables that I get as csv files. It should be easy enough, if needed to add a primary key column (to each?) after reading in the csv files into tables. So, generally, what I am trying to achieve is to combine three tables into one table where some of the columns are the same and some are not. e.g table 1 with 12 columns; table 2 with 10 columns where 4 are the same as in table 1; table 3 with 4 columns where 1 is the same as in table 1 and 1 the same as in table 2. The number of columns in the result table would then be 12 (table 1) + 6 (new ones from table 2) + 2 (new ones from table 3) = 20 columns. Usually the data rows from each table will be different but not always. If data in a column common to all three tables e.g product name, was the same from all tables i.e. same product name then all the information about that product from the three tables would be in all the columns for that data row. I hope that’s clear. i.e. the difference between example1 and example2 in my original question. In this was in reverse it would start with the large result table and use three SELECT statements to create the three tables. Thanks Ted Jones -- Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html
Let me step back a bit, as I realize upon further reflection that the first method will not work. As far as the issue about two “Ted” records, read it as though I said two “Ted D” records. It refers to two records in thesame table with the same key value. If “Ted D” appeared twice in table 1 and twice in table 2, the join would give fourresulting records (all combinations of the records from table 1 and the records from table 2 (and the records from table3) — that’s fundamentally how a join works. If you don’t want that, then make sure that there are no duplicates withineach of the tables. > On Sep 3, 2019, at 9:35 AM, TedJones <ted@mentra.co.uk> wrote: > > Hi Jack > > I'm not sure if I understand your comment about two 'Ted' fields and > duplication of rows. In the example below there is 'Ted' twice in Table3 1 > and 3 and must appear twice as I've shown in the result as the data in the > rest of the row is different. I agree with no duplication of rows if all of > the row is the same. > > Ted > > Table: 1 > Author Title Sales Publication Date > Jim A aa I > Ted B bb J > Dave C cc K > Ted D dd L > > Table: 2 > Author Publisher > Jim him > Ted me > Dave me > Will you > Gary him > > Table:3 > Author Title Country > Ted B UK > Ted D US > Jim A UK > Dave C UK > > Combined table: Result > Author Title Publication Date Publisher Sales Country > Ted B J me bb UK > Ted D L me dd US > Jim A I him aa UK > Dave C K me cc UK > Gary null null him null null > Will null null you null null > > > > > -- > Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html > >
Also, be aware that this is NOT the postgresql email list, but the pgAdmin email list, you would get a lot more help about SQL there.
On Tue, Sep 3, 2019, 19:44 Jack Royal-Gordon <jackrg@pobox.com> wrote:
Let me step back a bit, as I realize upon further reflection that the first method will not work.
As far as the issue about two “Ted” records, read it as though I said two “Ted D” records. It refers to two records in the same table with the same key value. If “Ted D” appeared twice in table 1 and twice in table 2, the join would give four resulting records (all combinations of the records from table 1 and the records from table 2 (and the records from table 3) — that’s fundamentally how a join works. If you don’t want that, then make sure that there are no duplicates within each of the tables.
> On Sep 3, 2019, at 9:35 AM, TedJones <ted@mentra.co.uk> wrote:
>
> Hi Jack
>
> I'm not sure if I understand your comment about two 'Ted' fields and
> duplication of rows. In the example below there is 'Ted' twice in Table3 1
> and 3 and must appear twice as I've shown in the result as the data in the
> rest of the row is different. I agree with no duplication of rows if all of
> the row is the same.
>
> Ted
>
> Table: 1
> Author Title Sales Publication Date
> Jim A aa I
> Ted B bb J
> Dave C cc K
> Ted D dd L
>
> Table: 2
> Author Publisher
> Jim him
> Ted me
> Dave me
> Will you
> Gary him
>
> Table:3
> Author Title Country
> Ted B UK
> Ted D US
> Jim A UK
> Dave C UK
>
> Combined table: Result
> Author Title Publication Date Publisher Sales Country
> Ted B J me bb UK
> Ted D L me dd US
> Jim A I him aa UK
> Dave C K me cc UK
> Gary null null him null null
> Will null null you null null
>
>
>
>
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html
>
>
yes, what you need is a full outer join. How are your actual results different from the expected ones?
On Wed, Sep 4, 2019 at 4:18 AM Michel Feinstein <michelfeinstein@gmail.com> wrote:
Also, be aware that this is NOT the postgresql email list, but the pgAdmin email list, you would get a lot more help about SQL there.On Tue, Sep 3, 2019, 19:44 Jack Royal-Gordon <jackrg@pobox.com> wrote:Let me step back a bit, as I realize upon further reflection that the first method will not work.
As far as the issue about two “Ted” records, read it as though I said two “Ted D” records. It refers to two records in the same table with the same key value. If “Ted D” appeared twice in table 1 and twice in table 2, the join would give four resulting records (all combinations of the records from table 1 and the records from table 2 (and the records from table 3) — that’s fundamentally how a join works. If you don’t want that, then make sure that there are no duplicates within each of the tables.
> On Sep 3, 2019, at 9:35 AM, TedJones <ted@mentra.co.uk> wrote:
>
> Hi Jack
>
> I'm not sure if I understand your comment about two 'Ted' fields and
> duplication of rows. In the example below there is 'Ted' twice in Table3 1
> and 3 and must appear twice as I've shown in the result as the data in the
> rest of the row is different. I agree with no duplication of rows if all of
> the row is the same.
>
> Ted
>
> Table: 1
> Author Title Sales Publication Date
> Jim A aa I
> Ted B bb J
> Dave C cc K
> Ted D dd L
>
> Table: 2
> Author Publisher
> Jim him
> Ted me
> Dave me
> Will you
> Gary him
>
> Table:3
> Author Title Country
> Ted B UK
> Ted D US
> Jim A UK
> Dave C UK
>
> Combined table: Result
> Author Title Publication Date Publisher Sales Country
> Ted B J me bb UK
> Ted D L me dd US
> Jim A I him aa UK
> Dave C K me cc UK
> Gary null null him null null
> Will null null you null null
>
>
>
>
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-pgadmin-support-f2191615.html
>
>