Обсуждение: big data - slow select (speech search)

Поиск
Список
Период
Сортировка

big data - slow select (speech search)

От
Michal Fapso
Дата:
Hi,

I have quite a simple query but a lot of data and the SELECT query is
too slow. I will be really grateful for any advice on this.

--------------------------------------------------
The background info:

I work on a speech search engine which differs from text search in
having more words (hypotheses) on the same position and each
hypothesis has some weight (probability) of occurrence.

When a word 'hello' appears in a document 'lecture_1', there is a row
in the table hyps (see below) which contains an array of all positions
of word 'hello' in the document 'lecture_1' and for each position it
contains a weight as well.

I need the positions to be able to search for phrases. However, here I
simplified the query as much as I could without a significant
reduction in speed.

I know there is tsearch extension which could be more appropriate for
this but I didn't try that yet. The size of my data will be the same
which seems to be the issue in my case. But maybe I am wrong and with
tsearch it will be much faster. What do you think?

--------------------------------------------------
Preconditions:

First I cleared the disk cache:
  sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Then run the postgresql deamon and with psql client I connected to my
database. The first thing I did then was executing the SELECT query
described below. It took about 4.5 seconds. If I rerun it, it takes
less than 2 miliseconds, but it is because of the cache. I need to
optimize the first-run.

--------------------------------------------------
Hardware:

laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM

--------------------------------------------------
Version:

PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
4.4.1-4ubuntu9) 4.4.1, 32-bit

compiled from sources, only --prefix=... argument given to ./configure

--------------------------------------------------
Schema:

CREATE TABLE hyps (
    docid INT,
    wordid INT,
    positions INT[],
    weights REAL[],
    length INT,
    total_weight REAL
);
COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
CREATE INDEX hyps_docid_index ON hyps USING hash (docid);

shared_buffers = 300MB ...this is the only thing I changed in the config

I tried that also with btree indices instead of hash and surprisingly
the SELECT query was a bit faster. I would expect hash to be faster.

The index on 'docid' column is there because I need to be able to
search also in a particular document or in a set of documents.
--------------------------------------------------
Table info:

- rows = 5490156
- average length of positions vectors = 19.5
- total number of items in positions vectors = 107444304
- positions and weights in one row have the same number of items, but
for each row the number may differ.
- table data are loaded only once (using COPY) and are not modified anymore
- there are 369 various docid and 161460 various wordid
- VACUUM was executed after COPY of data

--------------------------------------------------
Query:

EXPLAIN ANALYZE SELECT h1.docid
FROM hyps AS h1
WHERE h1.wordid=65658;

 Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
(actual time=62.106..4416.864 rows=343 loops=1)
   Recheck Cond: (wordid = 65658)
   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
         Index Cond: (wordid = 65658)
 Total runtime: 4432.015 ms

The result has 343 rows and there are 9294 items in positions vectors in total.

--------------------------------------------------
Comparison with Lucene:

If I run the same query in Lucene search engine, it takes 0.105
seconds on the same data which is quite a huge difference.

--------------------------------------------------
Synthetic data set:

If you want to try it yourself, here is a script which generates the
data for COPY command. I don't know whether it is possible to send
attachments here, so I put the script inline. Just save it as
create_synthetic_data.pl and run it by 'perl
create_synthetic_data.pl'. With these synthetic data the SELECT query
times are around 2.5 seconds. You can try the SELECT query with
'wordid' equal 1, 2, 3, ...10000.


#!/usr/bin/perl
# Create synthetic data for PostgreSQL COPY.

$rows = 5490156;
$docs = 369;
$words = 161460;
$docid = 0;
$wordid = 0;

for ($row=0; $row<$rows; $row++) {

    my $sep          = "";
    my $positions    = "";
    my $weights      = "";
    my $total_weight = 0;
    my $items        = int(rand(39))+1;

    if ($row % int($rows/$docs) == 0) {
        $docid++;
        $wordid = 0;
    }
    $wordid++;

    for ($i=0; $i<$items; $i++) {
        $position      = int(rand(20000));
        $weight        = rand(1);
        $positions    .= $sep.$position;
        $weights      .= $sep.sprintf("%.3f", $weight);
        $total_weight += $weight;
        $sep           = ",";
    }
    print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n";
}


If you need any other info, I will gladly provide it.

Thank You for Your time.
Miso Fapso

Re: big data - slow select (speech search)

От
Michal Fapso
Дата:
I forgot to mention one thing. If you want to generate data using the
perl script, do this:

   perl create_synthetic_data.pl > synthetic_data.sqlcopy

and then after you create the 'hyps' table, use the COPY command with
the generated file:

   COPY hyps FROM '/the/full/path/synthetic_data.sqlcopy';

Best regards,
Miso Fapso


On 2 July 2010 00:34, Michal Fapso <michal.fapso@gmail.com> wrote:
> Hi,
>
> I have quite a simple query but a lot of data and the SELECT query is
> too slow. I will be really grateful for any advice on this.
>
> --------------------------------------------------
> The background info:
>
> I work on a speech search engine which differs from text search in
> having more words (hypotheses) on the same position and each
> hypothesis has some weight (probability) of occurrence.
>
> When a word 'hello' appears in a document 'lecture_1', there is a row
> in the table hyps (see below) which contains an array of all positions
> of word 'hello' in the document 'lecture_1' and for each position it
> contains a weight as well.
>
> I need the positions to be able to search for phrases. However, here I
> simplified the query as much as I could without a significant
> reduction in speed.
>
> I know there is tsearch extension which could be more appropriate for
> this but I didn't try that yet. The size of my data will be the same
> which seems to be the issue in my case. But maybe I am wrong and with
> tsearch it will be much faster. What do you think?
>
> --------------------------------------------------
> Preconditions:
>
> First I cleared the disk cache:
>  sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
>
> Then run the postgresql deamon and with psql client I connected to my
> database. The first thing I did then was executing the SELECT query
> described below. It took about 4.5 seconds. If I rerun it, it takes
> less than 2 miliseconds, but it is because of the cache. I need to
> optimize the first-run.
>
> --------------------------------------------------
> Hardware:
>
> laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM
>
> --------------------------------------------------
> Version:
>
> PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
> 4.4.1-4ubuntu9) 4.4.1, 32-bit
>
> compiled from sources, only --prefix=... argument given to ./configure
>
> --------------------------------------------------
> Schema:
>
> CREATE TABLE hyps (
>        docid INT,
>        wordid INT,
>        positions INT[],
>        weights REAL[],
>        length INT,
>        total_weight REAL
> );
> COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
> CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
> CREATE INDEX hyps_docid_index ON hyps USING hash (docid);
>
> shared_buffers = 300MB ...this is the only thing I changed in the config
>
> I tried that also with btree indices instead of hash and surprisingly
> the SELECT query was a bit faster. I would expect hash to be faster.
>
> The index on 'docid' column is there because I need to be able to
> search also in a particular document or in a set of documents.
> --------------------------------------------------
> Table info:
>
> - rows = 5490156
> - average length of positions vectors = 19.5
> - total number of items in positions vectors = 107444304
> - positions and weights in one row have the same number of items, but
> for each row the number may differ.
> - table data are loaded only once (using COPY) and are not modified anymore
> - there are 369 various docid and 161460 various wordid
> - VACUUM was executed after COPY of data
>
> --------------------------------------------------
> Query:
>
> EXPLAIN ANALYZE SELECT h1.docid
> FROM hyps AS h1
> WHERE h1.wordid=65658;
>
>  Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
> (actual time=62.106..4416.864 rows=343 loops=1)
>   Recheck Cond: (wordid = 65658)
>   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
> rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
>         Index Cond: (wordid = 65658)
>  Total runtime: 4432.015 ms
>
> The result has 343 rows and there are 9294 items in positions vectors in total.
>
> --------------------------------------------------
> Comparison with Lucene:
>
> If I run the same query in Lucene search engine, it takes 0.105
> seconds on the same data which is quite a huge difference.
>
> --------------------------------------------------
> Synthetic data set:
>
> If you want to try it yourself, here is a script which generates the
> data for COPY command. I don't know whether it is possible to send
> attachments here, so I put the script inline. Just save it as
> create_synthetic_data.pl and run it by 'perl
> create_synthetic_data.pl'. With these synthetic data the SELECT query
> times are around 2.5 seconds. You can try the SELECT query with
> 'wordid' equal 1, 2, 3, ...10000.
>
>
> #!/usr/bin/perl
> # Create synthetic data for PostgreSQL COPY.
>
> $rows = 5490156;
> $docs = 369;
> $words = 161460;
> $docid = 0;
> $wordid = 0;
>
> for ($row=0; $row<$rows; $row++) {
>
>        my $sep          = "";
>        my $positions    = "";
>        my $weights      = "";
>        my $total_weight = 0;
>        my $items        = int(rand(39))+1;
>
>        if ($row % int($rows/$docs) == 0) {
>                $docid++;
>                $wordid = 0;
>        }
>        $wordid++;
>
>        for ($i=0; $i<$items; $i++) {
>                $position      = int(rand(20000));
>                $weight        = rand(1);
>                $positions    .= $sep.$position;
>                $weights      .= $sep.sprintf("%.3f", $weight);
>                $total_weight += $weight;
>                $sep           = ",";
>        }
>        print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n";
> }
>
>
> If you need any other info, I will gladly provide it.
>
> Thank You for Your time.
> Miso Fapso
>

Re: big data - slow select (speech search)

От
Robert Haas
Дата:
On Thu, Jul 1, 2010 at 6:34 PM, Michal Fapso <michal.fapso@gmail.com> wrote:
> It took about 4.5 seconds. If I rerun it, it takes
> less than 2 miliseconds, but it is because of the cache. I need to
> optimize the first-run.
>
> laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM
>
> EXPLAIN ANALYZE SELECT h1.docid
> FROM hyps AS h1
> WHERE h1.wordid=65658;
>
>  Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
> (actual time=62.106..4416.864 rows=343 loops=1)
>   Recheck Cond: (wordid = 65658)
>   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
> rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
>         Index Cond: (wordid = 65658)
>  Total runtime: 4432.015 ms
>
> If I run the same query in Lucene search engine, it takes 0.105
> seconds on the same data which is quite a huge difference.

So PostgreSQL is reading 343 rows from disk in 4432 ms, or about 12
ms/row.  I'm not an expert on seek times, but that might not really be
that unreasonable, considering that those rows may be scattered all
over the index and thus it may be basically random I/O.  Have you
tried clustering hyps on hyps_wordid_index?  If you had a more
sophisticated disk subsystem you could try increasing
effective_io_concurrency but that's not going to help with only one
spindle.

If you run the same query in Lucene and it takes only 0.105 s, then
Lucene is obviously doing a lot less I/O.  I doubt that any amount of
tuning of your existing schema is going to produce that kind of result
on PostgreSQL.  Using the full-text search stuff, or a gin index of
some kind, might get you closer, but it's hard to beat a
special-purpose engine that implements exactly the right algorithm for
your use case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: big data - slow select (speech search)

От
Michal Fapso
Дата:
Hi Robert,

thank you for your help. I tried to cluster the table on
hyps_wordid_index and the query execution time dropped from 4.43 to
0.19 seconds which is not that far from Lucene's performance of 0.10
second.

Thanks a lot!
Miso Fapso

On 6 July 2010 02:25, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jul 1, 2010 at 6:34 PM, Michal Fapso <michal.fapso@gmail.com> wrote:
>> It took about 4.5 seconds. If I rerun it, it takes
>> less than 2 miliseconds, but it is because of the cache. I need to
>> optimize the first-run.
>>
>> laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM
>>
>> EXPLAIN ANALYZE SELECT h1.docid
>> FROM hyps AS h1
>> WHERE h1.wordid=65658;
>>
>>  Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
>> (actual time=62.106..4416.864 rows=343 loops=1)
>>   Recheck Cond: (wordid = 65658)
>>   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
>> rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
>>         Index Cond: (wordid = 65658)
>>  Total runtime: 4432.015 ms
>>
>> If I run the same query in Lucene search engine, it takes 0.105
>> seconds on the same data which is quite a huge difference.
>
> So PostgreSQL is reading 343 rows from disk in 4432 ms, or about 12
> ms/row.  I'm not an expert on seek times, but that might not really be
> that unreasonable, considering that those rows may be scattered all
> over the index and thus it may be basically random I/O.  Have you
> tried clustering hyps on hyps_wordid_index?  If you had a more
> sophisticated disk subsystem you could try increasing
> effective_io_concurrency but that's not going to help with only one
> spindle.
>
> If you run the same query in Lucene and it takes only 0.105 s, then
> Lucene is obviously doing a lot less I/O.  I doubt that any amount of
> tuning of your existing schema is going to produce that kind of result
> on PostgreSQL.  Using the full-text search stuff, or a gin index of
> some kind, might get you closer, but it's hard to beat a
> special-purpose engine that implements exactly the right algorithm for
> your use case.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>

Re: big data - slow select (speech search)

От
Robert Haas
Дата:
On Wed, Jul 7, 2010 at 9:31 AM, Michal Fapso <michal.fapso@gmail.com> wrote:
> thank you for your help. I tried to cluster the table on
> hyps_wordid_index and the query execution time dropped from 4.43 to
> 0.19 seconds which is not that far from Lucene's performance of 0.10
> second.

Dang.  Nice!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company