Обсуждение: Question about caching on full table scans

Поиск
Список
Период
Сортировка

Question about caching on full table scans

От
Markus Innerebner
Дата:
Hello PG Performance group,

I am doing some runtime experiments in my implementation, which is computing multi-modal range queries for a query
point(if you want to know details check the website: www.isochrones.inf.unibz.it). 
The network is explored using Dijkstra Shortest Path algorithm that starts from the query point and starts to explore
allconnected vertices until the time is exceeded. 
The network is stored on a postgres (postgis) database consisting of vertices and edges.
relation: edge(id int, source int, target int, length double, segment geometry,..)

I have different implementations how the network is loaded in main memory:
approach 1: loads initially the entire edge table (full table scan) in main memory and then starts to expand the
networkand doing some computation. 
approach 2: loads only the adjacent edges of the current expanded vertex
approach 3: loads junks using the euclidean distance upper bound

I have different datasets: 6000 tuples (small), 4,000,000 tuples (large)

I repeat each experiment at least 10 times.
When observing the runtime I realized following:
- in the first iteration approach 1 takes long time, and its runtime starts to perform better after each iteration:
    e.g. with large dataset
    - iteration 1:   60.0s
    - iteration 2:   40.7s
    - iteration 3:   40,s
    - iteration 4:   39.7s
    - iteration 5:   39.5s
    - iteration 6:   39.3s
    - iteration 7:   40.0s
    - iteration 8:   34.8s
    - iteration 9:   39.1s
    - iteration 10: 38.0s

In the other approaches I do not see that big difference.

I know that postgres (and OS) is caching that dataset. But is there a way to force the database to remove that values
fromthe cache? 
I also tried to perform after each iteration a scan on a dummy table (executing it at least 10 times to force the
optimizedto keep that dummy data in main memory).  
But  I do not see any difference.

I thing the comparison is not right fair, if the caching in the main memory approach brings that big advantage.

What can you as experts suggest me?

Cheers Markus


****************************
My environment is:

OS: linux ubuntu

CPU  dual Core
model name      : Intel(R) Xeon(R) CPU E7- 2850  @ 2.00GHz
stepping        : 1
cpu MHz         : 1997.386
cache size      : 24576 KB

RAM: 5GB

postgres settings: version 8.4

shared_buffers = 650MB
work_mem = 512MB
maintenance_work_mem = 256MB
effective_cache_size = 500MB

--
Ph D. Student Markus Innerebner

DIS Research Group - Faculty of Computer Science
Free University Bozen-Bolzano

Dominikanerplatz 3 - Room 211
I - 39100 Bozen
Phone:  +39-0471-016143
Mobile: +39-333-9392929


gpg --fingerprint
-------------------------------------
pub   1024D/588F6308 2007-01-09
      Key fingerprint = 6948 947E CBD2 89FD E773  E863 914F EB1B 588F 6308
sub   2048g/BF4877D0 2007-01-09



Re: Question about caching on full table scans

От
"Albe Laurenz"
Дата:
Markus Innerebner wrote:
> I am doing some runtime experiments in my implementation, which is
computing multi-modal range queries
> for a query point (if you want to know details check the website:
www.isochrones.inf.unibz.it).
> The network is explored using Dijkstra Shortest Path algorithm that
starts from the query point and
> starts to explore all connected vertices until the time is exceeded.
> The network is stored on a postgres (postgis) database consisting of
vertices and edges.
> relation: edge(id int, source int, target int, length double, segment
geometry,..)
>
> I have different implementations how the network is loaded in main
memory:
> approach 1: loads initially the entire edge table (full table scan) in
main memory and then starts to
> expand the network and doing some computation.
> approach 2: loads only the adjacent edges of the current expanded
vertex
> approach 3: loads junks using the euclidean distance upper bound
>
> I have different datasets: 6000 tuples (small), 4,000,000 tuples
(large)
>
> I repeat each experiment at least 10 times.
> When observing the runtime I realized following:
> - in the first iteration approach 1 takes long time, and its runtime
starts to perform better after
> each iteration:
>     e.g. with large dataset
>     - iteration 1:   60.0s
>     - iteration 2:   40.7s
>     - iteration 3:   40,s
>     - iteration 4:   39.7s
>     - iteration 5:   39.5s
>     - iteration 6:   39.3s
>     - iteration 7:   40.0s
>     - iteration 8:   34.8s
>     - iteration 9:   39.1s
>     - iteration 10: 38.0s
>
> In the other approaches I do not see that big difference.
>
> I know that postgres (and OS) is caching that dataset. But is there a
way to force the database to
> remove that values from the cache?
> I also tried to perform after each iteration a scan on a dummy table
(executing it at least 10 times
> to force the optimized to keep that dummy data in main memory).
> But  I do not see any difference.
>
> I thing the comparison is not right fair, if the caching in the main
memory approach brings that big
> advantage.
>
> What can you as experts suggest me?

In your approach 1 to 3, what do you mean with "load into main memory"?
Do you
a) make sure that the data you talk about are in the PostgreSQL buffer
cache
or
b) retrieve the data from PostgreSQL and store it somewhere in your
application?

To clear PostgreSQL's cache, restart the server.
That should be a fast operation.
Since version 8.3, PostgreSQL is smart enough not to evict the
whole cache for a large sequential scan.

To flush the filesystem cache (from Linux 2.6.16 on), use
sync; echo 3 > /proc/sys/vm/drop_caches

Yours,
Laurenz Albe


Re: Question about caching on full table scans

От
Markus Innerebner
Дата:
Hi Laurenz,


In your approach 1 to 3, what do you mean with "load into main memory"?


I forgot to say: I use Java and connect with JDBC.

in approach 1 I do an initial loading of the entire relation, by executing 1 SQL query to load all edges in main memory, where I create my main memory structure
as an adjacency list.

Do you
a) make sure that the data you talk about are in the PostgreSQL buffer
cache
or

b) retrieve the data from PostgreSQL and store it somewhere in your
application?

In approach 1 I do that, as described before.

But after each experiment I restart a new java process.



To clear PostgreSQL's cache, restart the server.
That should be a fast operation.
Since version 8.3, PostgreSQL is smart enough not to evict the
whole cache for a large sequential scan.



To flush the filesystem cache (from Linux 2.6.16 on), use
sync; echo 3 > /proc/sys/vm/drop_caches

I started to do that , and 
yes, this solves my problem!!

I assume that deleting file system cache implies that also postgres cache is deleted, isn't it ?

so i will invoke after each experiment this command

thanks a lot!!

Markus

Re: Question about caching on full table scans

От
Jeff Janes
Дата:
On Thu, Aug 30, 2012 at 10:34 AM, Markus Innerebner
<markus.innerebner@inf.unibz.it> wrote:
>
> > To flush the filesystem cache (from Linux 2.6.16 on), use
> > sync; echo 3 > /proc/sys/vm/drop_caches
>
>
> I started to do that , and
> yes, this solves my problem!!
>
> I assume that deleting file system cache implies that also postgres cache is
> deleted, isn't it ?


No, the postgres-managed cache is not cleared by doing that.  In order
to get rid of both layers of caching, you should restart the postgres
server and then do the drop_caches.

Cheers,

Jeff


Re: Question about caching on full table scans

От
Scott Marlowe
Дата:
On Thu, Aug 30, 2012 at 11:34 AM, Markus Innerebner
<markus.innerebner@inf.unibz.it> wrote:
> Hi Laurenz,
>
>
> In your approach 1 to 3, what do you mean with "load into main memory"?
>
>
>
> I forgot to say: I use Java and connect with JDBC.
>
> in approach 1 I do an initial loading of the entire relation, by executing 1
> SQL query to load all edges in main memory, where I create my main memory
> structure
> as an adjacency list.
>
> Do you
> a) make sure that the data you talk about are in the PostgreSQL buffer
> cache
> or
>
>
> b) retrieve the data from PostgreSQL and store it somewhere in your
> application?
>
>
> In approach 1 I do that, as described before.
>
> But after each experiment I restart a new java process.
>
>
>
> To clear PostgreSQL's cache, restart the server.
> That should be a fast operation.
> Since version 8.3, PostgreSQL is smart enough not to evict the
> whole cache for a large sequential scan.
>
>
>
>
> To flush the filesystem cache (from Linux 2.6.16 on), use
> sync; echo 3 > /proc/sys/vm/drop_caches
>
>
> I started to do that , and
> yes, this solves my problem!!
>
> I assume that deleting file system cache implies that also postgres cache is
> deleted, isn't it ?

NO.  PostgreSQL maintains its own cache.  To flush it you need to
restart postgresql.  However, in a previous post you stated this:

> I know that postgres (and OS) is caching that dataset. But is there a way to force the database
> to remove that values from the cache?

It is NOT guaranteed that postgresql will be caching your data in a
full table scan.  To keep from blowing out the shared buffers
postgresql uses as a cache, it uses a ring buffer for sequential scans
so it is quite likely that on a sequential scan postgresql is not
caching your data.


Re: Question about caching on full table scans

От
Markus Innerebner
Дата:
thanks a lot for your feedback.
It helped me a lot and I have now a better overview in very specific hints, which I wasn't able to find in any
documentation.

Cheers Markus