Обсуждение: Merge rows based on Levenshtein distance
I am new to PostgreSQL and I have the following table: Name, City "Alex", "Washington" "Aleex1", "Washington" "Bob", "NYC" "Booob", "NYC" I want to "merge" similar rows based on levenshtein distance between names so that I have the following table: id, Name, City 1,"Alex", "Washington" 1,"Aleex1", "Washington" 2,"Bob", "NYC" 2,"Booob", "NYC" How could I do that on PostgreSQL? Is there an SQL command for this? Thnsls -- View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
mongoose wrote > I am new to PostgreSQL and I have the following table: > > Name, City > "Alex", "Washington" > "Aleex1", "Washington" > "Bob", "NYC" > "Booob", "NYC" > > I want to "merge" similar rows based on levenshtein distance between names > so that I have the following table: > > id, Name, City > 1,"Alex", "Washington" > 1,"Aleex1", "Washington" > 2,"Bob", "NYC" > 2,"Booob", "NYC" > > How could I do that on PostgreSQL? Is there an SQL command for this? > Thnsls So you have a table of N names and you want to evaluate (N-1)^2 pairs and then use the output of the levenshtein calculation to group them together. SELECT l_names.name_value, r_names.name_value, leven[...](l_names.name_value, r_names.name_value) AS pair_group FROM table_of_names AS l_names CROSS JOIN table_of_names AS r_names WHERE l_names.name_value <> r_names.name_value ; Feel free to add "group by city" or "WHERE substring(l_names.name_value, 0, 1) = substring(r_names.name_value, 0, 1)" since it seems you need more than just a name-distance to generate the desired groups. You'd likely want to add the same "substring" call to the SELECT-list and "GROUP BY" clauses... David J. -- View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5828847.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
David, Thank you for your prompt reply. I believe your answer helped a lot but it seems I was not clear enough on my description. Basically I want a counter (id) to show if two or more names are similar (i.e. levenshtein distance less than 3) So in the previous example: From this table: Name, City "Booob", "NYC" "Alex", "Washington" "Alexj2", "Washington" "Bob", "NYC" "Aleex1", "Washington" to get this table: id, Name, City 1,"Alex", "Washington" 1,"Aleex1", "Washington" 1,"Alexj2", "Washington" 2,"Bob", "NYC" 2,"Booob", "NYC" So basically the id is a counter that starts from "1" and increments only when there is a different name. Please notice that the table has its names in a completely random order. -- View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829030.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Tuesday, December 2, 2014, mongoose [via PostgreSQL] <[hidden email]> wrote:
View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
David,
Thank you for your prompt reply. I believe your answer helped a lot but it seems I was not clear enough on my description. Basically I want a counter (id) to show if two or more names are similar (i.e. levenshtein distance less than 3) So in the previous example:
From this table:
Name, City
"Booob", "NYC"
"Alex", "Washington"
"Alexj2", "Washington"
"Bob", "NYC"
"Aleex1", "Washington"
to get this table:
id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
1,"Alexj2", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"
So basically the id is a counter that starts from "1" and increments only when there is a different name. Please notice that the table has its names in a completely random order.
Write and combine a few subqueries that use window functions (namely lag and row_number) to identify groups, label them, and assign rows to each group (using a between condition on a join)
Pondering some (not tested) if you identify the boundary records in a subquery you can assign them a value of 1 while all others take on null. In the outer query you should be able to assign groups by simply applying the sum function over the entire result such that at each boundary value the presence of the 1 will increment the sum while the null rows will use the sum value from the prior row.
David J.
View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
There is nice extension in postgres: fuzzystrmatch <http://www.postgresql.org/docs/9.3/static/fuzzystrmatch.html> I have used to calculate the distance. From documetation: SELECT levenshtein_less_equal('extensive', 'exhaustive',2); You can use it then with your group by query. -- View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829111.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
David, Thanks for the useful feedback. Since I am not an experienced developer it is too complicated for me to come up with the queries. Besides I wonder if this is going to be efficient to do this processing on PostgreSQL. -- View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829132.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
I play with it when I get a chance but you should at least try to code something.
Dave
On Wed, Dec 3, 2014 at 11:08 AM, mongoose [via PostgreSQL] <[hidden email]> wrote:
David,
Thanks for the useful feedback. Since I am not an experienced developer it is too complicated for me to come up with the queries. Besides I wonder if this is going to be efficient to do this processing on PostgreSQL.If you reply to this email, your message will be added to the discussion below:http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829132.html
View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
There is nice extension in postgres: fuzzystrmatch I have used to calculate the distance. From documetation:
SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
You can use it then with your group by query.
Something like this - replace the substring(...) comparison with legenshtein_less_equal(...) or whatever comparison you find applicable.
In the case below new groups are started whenever the first letter of the value changes.
The first group would be NULL so I add a COALESCE() call to make it 0 - subsequent groups start with 1 and increment properly.
WITH src (val) AS (
VALUES ('A1'::varchar),('A2'),('B1'),('B2'),('B3'),('C1'),('D1')
)
, grp AS (
SELECT val
, CASE WHEN
substring(val,1,1) <> substring(lag(val) OVER (ORDER BY val),1,1)
THEN 1
ELSE NULL
END AS changed
, ROW_NUMBER() OVER (ORDER BY val) AS val_idx
FROM src
)
SELECT val, COALESCE(sum(changed) OVER (ORDER BY val_idx), 0) AS group_id
FROM grp
;
David J.
View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
I don't think you've defined your problem very clearly.
Suppose you have 1000 names in your database. Are you planning to compare each name to the other 999 names to see which is closest? What if two names are equally close to a third name but not to each other, how do you decide which is better?
Suppose you have 1000 names in your database. Are you planning to compare each name to the other 999 names to see which is closest? What if two names are equally close to a third name but not to each other, how do you decide which is better?
--
Mike Nolan
Thanks for the help. I will give your code a try. Btw I know how to solve this in a different language but unfortunately I am a very rookie with databases. -- View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829145.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Hi Mike, I was planning to do something like David suggested: I would sort the rows based on name and then I would use a window (i.e. 100 rows) to compare each individual name to the previous 100. All I want to do is create groups of similar rows based on some criteria. -- View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829146.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Have you considered using a soundex function to sort names into similarity groups? In my experience it works fairly well with Western European names, not quite as well with names from other parts of the world. It also doesn't deal well with many nicknames (Mike instead of Michael, etc.)
--
Mike Nolan