Обсуждение: Combine Top-k with similarity search extensions
I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
Shmagi,
Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.
From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions
I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
It will add complexity and also no idea how to do it. Is there any alternative?
On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:
Shmagi,Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensionsI am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
OK its does add complexity,
Here is a functional md5 index on the whole string
drop table if exists text_table;
create table text_table
(
mystring text
);
create index text_md5 on text_table(md5(mystring));
insert into text_table (mystring) values
('John Smith'), ('John Smith'), ('John Smith'), ('John Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Jim Smith'), ('J Smith');
select md5, count from
( select md5(mystring) md5 , count(*) count from text_table group by md5(mystring) ) subq
where count > 1 ;
create table text_table
(
mystring text
);
create index text_md5 on text_table(md5(mystring));
insert into text_table (mystring) values
('John Smith'), ('John Smith'), ('John Smith'), ('John Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Jim Smith'), ('J Smith');
select md5, count from
( select md5(mystring) md5 , count(*) count from text_table group by md5(mystring) ) subq
where count > 1 ;
From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: "tim child" <tim.child@comcast.net>
Cc: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 8:13:15 AM
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions
To: "tim child" <tim.child@comcast.net>
Cc: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 8:13:15 AM
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions
It will add complexity and also no idea how to do it. Is there any alternative?
On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:
Shmagi,Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensionsI am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
Dump to file, run shell script or C program to sort (sort -u). Searches and
comparisons work much better on sorted sets.
Matt
-------- Original Message --------
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions
From: Shmagi Kavtaradze <kavtaradze.s@gmail.com>
Date: Fri, November 20, 2015 8:13 am
To: tim.child@comcast.net
Cc: pgsql-novice@postgresql.orgIt will add complexity and also no idea how to do it. Is there any alternative?On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:Shmagi,Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensionsI am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.