big data - slow select (speech search)

Поиск
Список
Период
Сортировка
От Michal Fapso
Тема big data - slow select (speech search)
Дата
Msg-id AANLkTilAMfcbnEpk_7eAr-ikWzFptS4VtvIT0s5RvWmH@mail.gmail.com
обсуждение исходный текст
Ответы Re: big data - slow select (speech search)  (Michal Fapso <michal.fapso@gmail.com>)
Re: big data - slow select (speech search)  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-performance
Hi,

I have quite a simple query but a lot of data and the SELECT query is
too slow. I will be really grateful for any advice on this.

--------------------------------------------------
The background info:

I work on a speech search engine which differs from text search in
having more words (hypotheses) on the same position and each
hypothesis has some weight (probability) of occurrence.

When a word 'hello' appears in a document 'lecture_1', there is a row
in the table hyps (see below) which contains an array of all positions
of word 'hello' in the document 'lecture_1' and for each position it
contains a weight as well.

I need the positions to be able to search for phrases. However, here I
simplified the query as much as I could without a significant
reduction in speed.

I know there is tsearch extension which could be more appropriate for
this but I didn't try that yet. The size of my data will be the same
which seems to be the issue in my case. But maybe I am wrong and with
tsearch it will be much faster. What do you think?

--------------------------------------------------
Preconditions:

First I cleared the disk cache:
  sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Then run the postgresql deamon and with psql client I connected to my
database. The first thing I did then was executing the SELECT query
described below. It took about 4.5 seconds. If I rerun it, it takes
less than 2 miliseconds, but it is because of the cache. I need to
optimize the first-run.

--------------------------------------------------
Hardware:

laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM

--------------------------------------------------
Version:

PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
4.4.1-4ubuntu9) 4.4.1, 32-bit

compiled from sources, only --prefix=... argument given to ./configure

--------------------------------------------------
Schema:

CREATE TABLE hyps (
    docid INT,
    wordid INT,
    positions INT[],
    weights REAL[],
    length INT,
    total_weight REAL
);
COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
CREATE INDEX hyps_docid_index ON hyps USING hash (docid);

shared_buffers = 300MB ...this is the only thing I changed in the config

I tried that also with btree indices instead of hash and surprisingly
the SELECT query was a bit faster. I would expect hash to be faster.

The index on 'docid' column is there because I need to be able to
search also in a particular document or in a set of documents.
--------------------------------------------------
Table info:

- rows = 5490156
- average length of positions vectors = 19.5
- total number of items in positions vectors = 107444304
- positions and weights in one row have the same number of items, but
for each row the number may differ.
- table data are loaded only once (using COPY) and are not modified anymore
- there are 369 various docid and 161460 various wordid
- VACUUM was executed after COPY of data

--------------------------------------------------
Query:

EXPLAIN ANALYZE SELECT h1.docid
FROM hyps AS h1
WHERE h1.wordid=65658;

 Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
(actual time=62.106..4416.864 rows=343 loops=1)
   Recheck Cond: (wordid = 65658)
   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
         Index Cond: (wordid = 65658)
 Total runtime: 4432.015 ms

The result has 343 rows and there are 9294 items in positions vectors in total.

--------------------------------------------------
Comparison with Lucene:

If I run the same query in Lucene search engine, it takes 0.105
seconds on the same data which is quite a huge difference.

--------------------------------------------------
Synthetic data set:

If you want to try it yourself, here is a script which generates the
data for COPY command. I don't know whether it is possible to send
attachments here, so I put the script inline. Just save it as
create_synthetic_data.pl and run it by 'perl
create_synthetic_data.pl'. With these synthetic data the SELECT query
times are around 2.5 seconds. You can try the SELECT query with
'wordid' equal 1, 2, 3, ...10000.


#!/usr/bin/perl
# Create synthetic data for PostgreSQL COPY.

$rows = 5490156;
$docs = 369;
$words = 161460;
$docid = 0;
$wordid = 0;

for ($row=0; $row<$rows; $row++) {

    my $sep          = "";
    my $positions    = "";
    my $weights      = "";
    my $total_weight = 0;
    my $items        = int(rand(39))+1;

    if ($row % int($rows/$docs) == 0) {
        $docid++;
        $wordid = 0;
    }
    $wordid++;

    for ($i=0; $i<$items; $i++) {
        $position      = int(rand(20000));
        $weight        = rand(1);
        $positions    .= $sep.$position;
        $weights      .= $sep.sprintf("%.3f", $weight);
        $total_weight += $weight;
        $sep           = ",";
    }
    print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n";
}


If you need any other info, I will gladly provide it.

Thank You for Your time.
Miso Fapso

В списке pgsql-performance по дате отправления:

Предыдущее
От: Rajesh Kumar Mallah
Дата:
Сообщение: Re: how to (temporarily) disable/minimize benefits of disk block cache or postgresql shared buffer
Следующее
От: Eliot Gable
Дата:
Сообщение: Highly Efficient Custom Sorting