Обсуждение: Question about caching on full table scans
Hello PG Performance group, I am doing some runtime experiments in my implementation, which is computing multi-modal range queries for a query point(if you want to know details check the website: www.isochrones.inf.unibz.it). The network is explored using Dijkstra Shortest Path algorithm that starts from the query point and starts to explore allconnected vertices until the time is exceeded. The network is stored on a postgres (postgis) database consisting of vertices and edges. relation: edge(id int, source int, target int, length double, segment geometry,..) I have different implementations how the network is loaded in main memory: approach 1: loads initially the entire edge table (full table scan) in main memory and then starts to expand the networkand doing some computation. approach 2: loads only the adjacent edges of the current expanded vertex approach 3: loads junks using the euclidean distance upper bound I have different datasets: 6000 tuples (small), 4,000,000 tuples (large) I repeat each experiment at least 10 times. When observing the runtime I realized following: - in the first iteration approach 1 takes long time, and its runtime starts to perform better after each iteration: e.g. with large dataset - iteration 1: 60.0s - iteration 2: 40.7s - iteration 3: 40,s - iteration 4: 39.7s - iteration 5: 39.5s - iteration 6: 39.3s - iteration 7: 40.0s - iteration 8: 34.8s - iteration 9: 39.1s - iteration 10: 38.0s In the other approaches I do not see that big difference. I know that postgres (and OS) is caching that dataset. But is there a way to force the database to remove that values fromthe cache? I also tried to perform after each iteration a scan on a dummy table (executing it at least 10 times to force the optimizedto keep that dummy data in main memory). But I do not see any difference. I thing the comparison is not right fair, if the caching in the main memory approach brings that big advantage. What can you as experts suggest me? Cheers Markus **************************** My environment is: OS: linux ubuntu CPU dual Core model name : Intel(R) Xeon(R) CPU E7- 2850 @ 2.00GHz stepping : 1 cpu MHz : 1997.386 cache size : 24576 KB RAM: 5GB postgres settings: version 8.4 shared_buffers = 650MB work_mem = 512MB maintenance_work_mem = 256MB effective_cache_size = 500MB -- Ph D. Student Markus Innerebner DIS Research Group - Faculty of Computer Science Free University Bozen-Bolzano Dominikanerplatz 3 - Room 211 I - 39100 Bozen Phone: +39-0471-016143 Mobile: +39-333-9392929 gpg --fingerprint ------------------------------------- pub 1024D/588F6308 2007-01-09 Key fingerprint = 6948 947E CBD2 89FD E773 E863 914F EB1B 588F 6308 sub 2048g/BF4877D0 2007-01-09
Markus Innerebner wrote: > I am doing some runtime experiments in my implementation, which is computing multi-modal range queries > for a query point (if you want to know details check the website: www.isochrones.inf.unibz.it). > The network is explored using Dijkstra Shortest Path algorithm that starts from the query point and > starts to explore all connected vertices until the time is exceeded. > The network is stored on a postgres (postgis) database consisting of vertices and edges. > relation: edge(id int, source int, target int, length double, segment geometry,..) > > I have different implementations how the network is loaded in main memory: > approach 1: loads initially the entire edge table (full table scan) in main memory and then starts to > expand the network and doing some computation. > approach 2: loads only the adjacent edges of the current expanded vertex > approach 3: loads junks using the euclidean distance upper bound > > I have different datasets: 6000 tuples (small), 4,000,000 tuples (large) > > I repeat each experiment at least 10 times. > When observing the runtime I realized following: > - in the first iteration approach 1 takes long time, and its runtime starts to perform better after > each iteration: > e.g. with large dataset > - iteration 1: 60.0s > - iteration 2: 40.7s > - iteration 3: 40,s > - iteration 4: 39.7s > - iteration 5: 39.5s > - iteration 6: 39.3s > - iteration 7: 40.0s > - iteration 8: 34.8s > - iteration 9: 39.1s > - iteration 10: 38.0s > > In the other approaches I do not see that big difference. > > I know that postgres (and OS) is caching that dataset. But is there a way to force the database to > remove that values from the cache? > I also tried to perform after each iteration a scan on a dummy table (executing it at least 10 times > to force the optimized to keep that dummy data in main memory). > But I do not see any difference. > > I thing the comparison is not right fair, if the caching in the main memory approach brings that big > advantage. > > What can you as experts suggest me? In your approach 1 to 3, what do you mean with "load into main memory"? Do you a) make sure that the data you talk about are in the PostgreSQL buffer cache or b) retrieve the data from PostgreSQL and store it somewhere in your application? To clear PostgreSQL's cache, restart the server. That should be a fast operation. Since version 8.3, PostgreSQL is smart enough not to evict the whole cache for a large sequential scan. To flush the filesystem cache (from Linux 2.6.16 on), use sync; echo 3 > /proc/sys/vm/drop_caches Yours, Laurenz Albe
Hi Laurenz,
In your approach 1 to 3, what do you mean with "load into main memory"?
I forgot to say: I use Java and connect with JDBC.
in approach 1 I do an initial loading of the entire relation, by executing 1 SQL query to load all edges in main memory, where I create my main memory structure
as an adjacency list.
Do you
a) make sure that the data you talk about are in the PostgreSQL buffer
cache
or
b) retrieve the data from PostgreSQL and store it somewhere in your
application?
In approach 1 I do that, as described before.
But after each experiment I restart a new java process.
To clear PostgreSQL's cache, restart the server.
That should be a fast operation.
Since version 8.3, PostgreSQL is smart enough not to evict the
whole cache for a large sequential scan.
To flush the filesystem cache (from Linux 2.6.16 on), use
sync; echo 3 > /proc/sys/vm/drop_caches
I started to do that , and
yes, this solves my problem!!
I assume that deleting file system cache implies that also postgres cache is deleted, isn't it ?
so i will invoke after each experiment this command
thanks a lot!!
Markus
On Thu, Aug 30, 2012 at 10:34 AM, Markus Innerebner <markus.innerebner@inf.unibz.it> wrote: > > > To flush the filesystem cache (from Linux 2.6.16 on), use > > sync; echo 3 > /proc/sys/vm/drop_caches > > > I started to do that , and > yes, this solves my problem!! > > I assume that deleting file system cache implies that also postgres cache is > deleted, isn't it ? No, the postgres-managed cache is not cleared by doing that. In order to get rid of both layers of caching, you should restart the postgres server and then do the drop_caches. Cheers, Jeff
On Thu, Aug 30, 2012 at 11:34 AM, Markus Innerebner <markus.innerebner@inf.unibz.it> wrote: > Hi Laurenz, > > > In your approach 1 to 3, what do you mean with "load into main memory"? > > > > I forgot to say: I use Java and connect with JDBC. > > in approach 1 I do an initial loading of the entire relation, by executing 1 > SQL query to load all edges in main memory, where I create my main memory > structure > as an adjacency list. > > Do you > a) make sure that the data you talk about are in the PostgreSQL buffer > cache > or > > > b) retrieve the data from PostgreSQL and store it somewhere in your > application? > > > In approach 1 I do that, as described before. > > But after each experiment I restart a new java process. > > > > To clear PostgreSQL's cache, restart the server. > That should be a fast operation. > Since version 8.3, PostgreSQL is smart enough not to evict the > whole cache for a large sequential scan. > > > > > To flush the filesystem cache (from Linux 2.6.16 on), use > sync; echo 3 > /proc/sys/vm/drop_caches > > > I started to do that , and > yes, this solves my problem!! > > I assume that deleting file system cache implies that also postgres cache is > deleted, isn't it ? NO. PostgreSQL maintains its own cache. To flush it you need to restart postgresql. However, in a previous post you stated this: > I know that postgres (and OS) is caching that dataset. But is there a way to force the database > to remove that values from the cache? It is NOT guaranteed that postgresql will be caching your data in a full table scan. To keep from blowing out the shared buffers postgresql uses as a cache, it uses a ring buffer for sequential scans so it is quite likely that on a sequential scan postgresql is not caching your data.
thanks a lot for your feedback. It helped me a lot and I have now a better overview in very specific hints, which I wasn't able to find in any documentation. Cheers Markus