Обсуждение: Combine Top-k with similarity search extensions

Поиск

Список

Период

Сортировка

Combine Top-k with similarity search extensions

От

Shmagi Kavtaradze

Дата:

20 ноября 2015 г., 10:21:40

I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.

Re: Combine Top-k with similarity search extensions

От

tim.child@comcast.net

Дата:

20 ноября 2015 г., 16:00:40

Shmagi,

Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions

Re: Combine Top-k with similarity search extensions

От

Shmagi Kavtaradze

Дата:

20 ноября 2015 г., 16:13:27

It will add complexity and also no idea how to do it. Is there any alternative?

On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:

Shmagi,

Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions

I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.

Re: Combine Top-k with similarity search extensions

От

tim.child@comcast.net

Дата:

20 ноября 2015 г., 16:42:51

OK its does add complexity,

Here is a functional md5 index on the whole string

drop table if exists text_table;
create table text_table
(
mystring text

);

create index text_md5 on text_table(md5(mystring));

insert into text_table (mystring) values
('John Smith'), ('John Smith'), ('John Smith'), ('John Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Jim Smith'), ('J Smith');

select md5, count from
( select md5(mystring) md5 , count(*) count from text_table group by md5(mystring) ) subq
where count > 1 ;

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: "tim child" <tim.child@comcast.net>
Cc: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 8:13:15 AM
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions

It will add complexity and also no idea how to do it. Is there any alternative?

On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:

Shmagi,

Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions

I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.

Re: Combine Top-k with similarity search extensions

От

"马修/Матвей/Mateo/Matt Buse"

Дата:

21 ноября 2015 г., 02:00:25

Dump to file, run shell script or C program to sort (sort -u). Searches and

comparisons work much better on sorted sets.

Matt

-------- Original Message --------
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions
From: Shmagi Kavtaradze <kavtaradze.s@gmail.com>
Date: Fri, November 20, 2015 8:13 am
To: tim.child@comcast.net
Cc: pgsql-novice@postgresql.org

It will add complexity and also no idea how to do it. Is there any alternative?

On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:
Shmagi,

Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions

I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Combine Top-k with similarity search extensions

Combine Top-k with similarity search extensions

Re: Combine Top-k with similarity search extensions

Re: Combine Top-k with similarity search extensions

Re: Combine Top-k with similarity search extensions

Re: Combine Top-k with similarity search extensions