Обсуждение: about multiprocessingmassdata
Hi list!
i have a table which has 8500000 rows records. i write a java program to update these records.
i use 100 threads to update the records. For example, thread-1 update 1~85000 records; thread-2 update 85001~170000 and so on.
The update sql's aim is remove the space in the column and it is simple: update poi set py=replace(py,' ','') where id=?;
By the program log, i find the database of processing data speed so slow, per thread updating 1000 rows need take 260s.
BTW: The PG Server is running on a PC. The PC's total memory is 2G and CPU is "Intel(R) Core(TM)2 Duo E7500 2.93GHz".
When the program's running, CPU just be used 1% and Memory left 112MB.
Is the PC configuration too low cause the problem ?
Please help ~~
Thanks for any tips,
superman0920
On 4.4.2012 17:52, superman0920 wrote: > Hi list! > > i have a table which has 8500000 rows records. i write a java program to > update these records. > i use 100 threads to update the records. For example, thread-1 update > 1~85000 records; thread-2 update 85001~170000 and so on. > The update sql's aim is remove the space in the column and it is simple: > update poi set py=replace(py,' ','') where id=?; That's a very naive approach. It's very likely each thread will do an index scan for each update (to evaluate the 'id=?' condition. And that's going to cost you much more than you gain because index scans are quite CPU and I/O intensive. Simply update the whole table by UPDATE poi SET py = replace(py, ' ',''); Have you actually tried how this performs or did you guess 'it's definitely going to be very slow so I'll use multiple threads to make that faster'? If you really need to parallelize this, you need to do that differently - e.g. use 'ctid' to skip to update a whole page like this: UPDATE poi SET py = replace(py, ' ','') WHERE ctid >= '(n,0)'::tid AND ctid < '(n+1,0)'::tid AND; where 'n' ranges between 0 and number of pages the table (e.g. in pg_class). But try the simple UPDATE first, my guess is it's going to be much faster than you expect. Tomas
On 4.4.2012 18:49, superman0920 wrote: > Thank you for your reply > I tried executing "UPDATE poi SET py = replace(py, ' ','');", that took > long long time(about 20+ hours) and no error report. Just like locked. OK, that's weird. So we need a bit more details - what PostgreSQL version is this? How much space does the table actually occupy? Try this: SELECT relname, relpages, reltuples FROM pg_class WHERE relname = 'poi'; And finally we need EXPLAIN output for both UPDATE commands. Don't post them here directly - put them to explain.depesz.com and post just the link. Further, we need to see the actual table definition. Especially if there are any triggers or foreign keys on the table? Tomas
On 5.4.2012 15:44, superman0920 wrote: > Sure, i will post that at tomorrow. > > Today I install PG and MySQL at a Server. I insert 850000 rows record > to each db. > I execute "select count(*) from poi_all_new" at two db. > MySQL takes 0.9s > PG takes 364s First of all, keep the list (pgsql-performance@postgresql.org) on the CC. You keep responding to me directly, therefore others can't respond to your messages (and help you). Are you sure the comparison was fair, i.e. both machines containing the same amount of data (not number of rows, amount of data), configured properly etc.? Have you used the same table structure (how did you represent geometry data type in MySQL)? For example I bet you're using MyISAM. In that case, it's comparing apples to oranges (or maybe cats, so different it is). MyISAM does not do any MVCC stuff (visibility checking, ...) and simply reads the number of rows from a catalogue. PostgreSQL actually has to scan the whole table - that's a big difference. This is probably the only place where MySQL (with MyISAM beats PostgreSQL). But once you switch to a proper storage manager (e.g. InnoDB) it'll have to scan the data just like PostgreSQL - try that. Anyway, this benchmark is rubbish because you're not going to do this query often - use queries that actually make sense for the application. Nevertheless, it seems there's something seriously wrong with your machine or the environment (OS), probably I/O. I've done a quick test - I've created the table (without the 'geometry' column because I don't have postgis installed), filled it with one million of rows and executed 'select count(*)'. See this: http://pastebin.com/42cAcCqu This is what I get: ====================================================================== test=# SELECT pg_size_pretty(pg_relation_size('test_table')); pg_size_pretty ---------------- 1302 MB (1 row) test=# test=# \timing on Timing is on. test=# test=# SELECT count(*) from test_table; count --------- 1000000 (1 row) Time: 2026,695 ms ====================================================================== so it's running the 'count(*)' in two seconds. If I run it again, I get this: ====================================================================== test=# SELECT count(*) from test_table; count --------- 1000000 (1 row) Time: 270,020 ms ====================================================================== Yes, that's 0,27 seconds. And this is *only* my workstation - Core i5 (4 cores), 8GB of RAM, nothing special. These results obviously depend on the data being available in page cache. If that's not the case, PostgreSQL needs to read them from the drive (and then it's basically i/o bound) - I can get about 250 MB/s from my drives, so I get this: ====================================================================== test=# SELECT count(*) from test_table; count --------- 1000000 (1 row) Time: 5088,739 ms ====================================================================== If you have slower drives, the dependency is about linear (half the speed -> twice the time). So either your drives are very slow, or there's something rotten. I still haven's seen iostat / vmstat output ... that'd tell us much more about the causes. Tomas
Tomas Vondra <tv@fuzzy.cz> wrote: > On 5.4.2012 15:44, superman0920 wrote: >> Today I install PG and MySQL at a Server. I insert 850000 rows >> record to each db. >> I execute "select count(*) from poi_all_new" at two db. >> MySQL takes 0.9s >> PG takes 364s > Are you sure the comparison was fair, i.e. both machines > containing the same amount of data (not number of rows, amount of > data), configured properly etc.? Don't forget the "hint bits" issue -- if the count(*) was run immediately after the load (without a chance for autovacuum to get in there), all the data was re-written in place to save hint information. I remember how confusing that was for me the first time I saw it. It's very easy to get a false impression of overall PostgreSQL performance from that type of test, and it's the sort of test a lot of people will do on an ad hoc basis. -Kevin
On Thu, Apr 5, 2012 at 9:47 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 5.4.2012 15:44, superman0920 wrote: >> Sure, i will post that at tomorrow. >> >> Today I install PG and MySQL at a Server. I insert 850000 rows record >> to each db. >> I execute "select count(*) from poi_all_new" at two db. >> MySQL takes 0.9s >> PG takes 364s > > First of all, keep the list (pgsql-performance@postgresql.org) on the > CC. You keep responding to me directly, therefore others can't respond > to your messages (and help you). > > Are you sure the comparison was fair, i.e. both machines containing the > same amount of data (not number of rows, amount of data), configured > properly etc.? Have you used the same table structure (how did you > represent geometry data type in MySQL)? > > For example I bet you're using MyISAM. In that case, it's comparing > apples to oranges (or maybe cats, so different it is). MyISAM does not > do any MVCC stuff (visibility checking, ...) and simply reads the number > of rows from a catalogue. PostgreSQL actually has to scan the whole > table - that's a big difference. This is probably the only place where > MySQL (with MyISAM beats PostgreSQL). But once you switch to a proper > storage manager (e.g. InnoDB) it'll have to scan the data just like > PostgreSQL - try that. > > Anyway, this benchmark is rubbish because you're not going to do this > query often - use queries that actually make sense for the application. > > Nevertheless, it seems there's something seriously wrong with your > machine or the environment (OS), probably I/O. > > I've done a quick test - I've created the table (without the 'geometry' > column because I don't have postgis installed), filled it with one > million of rows and executed 'select count(*)'. See this: > > http://pastebin.com/42cAcCqu > > This is what I get: > > ====================================================================== > test=# SELECT pg_size_pretty(pg_relation_size('test_table')); > pg_size_pretty > ---------------- > 1302 MB > (1 row) > > test=# > test=# \timing on > Timing is on. > test=# > test=# SELECT count(*) from test_table; > count > --------- > 1000000 > (1 row) > > Time: 2026,695 ms > ====================================================================== > > so it's running the 'count(*)' in two seconds. If I run it again, I get > this: > > ====================================================================== > test=# SELECT count(*) from test_table; > count > --------- > 1000000 > (1 row) > > Time: 270,020 ms > ====================================================================== > > Yes, that's 0,27 seconds. And this is *only* my workstation - Core i5 (4 > cores), 8GB of RAM, nothing special. > > These results obviously depend on the data being available in page > cache. If that's not the case, PostgreSQL needs to read them from the > drive (and then it's basically i/o bound) - I can get about 250 MB/s > from my drives, so I get this: > > ====================================================================== > test=# SELECT count(*) from test_table; > count > --------- > 1000000 > (1 row) > > Time: 5088,739 ms > ====================================================================== > > If you have slower drives, the dependency is about linear (half the > speed -> twice the time). So either your drives are very slow, or > there's something rotten. > > I still haven's seen iostat / vmstat output ... that'd tell us much more > about the causes. geometry column can potentially quite wide. one thing we need to see is the table has any indexes -- in particular gist/gin on the geometry. merlin
On 10.4.2012 00:37, Merlin Moncure wrote: > On Thu, Apr 5, 2012 at 9:47 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> If you have slower drives, the dependency is about linear (half the >> speed -> twice the time). So either your drives are very slow, or >> there's something rotten. >> >> I still haven's seen iostat / vmstat output ... that'd tell us much more >> about the causes. > > geometry column can potentially quite wide. one thing we need to see > is the table has any indexes -- in particular gist/gin on the > geometry. Yeah, but in one of the previous posts the OP posted this: relname | relpages | reltuples -------------+----------+------------- poi_all_new | 2421133 | 6.53328e+06 which means the table has ~ 19GB for 6.5 million rows, so it's like 2.8GB per 1 million of rows, i.e. ~3kB per row. I've been working with 1 million rows and 1.3GB of data, so it's like 50% of the expected amount. But this does not explain why the SELECT COUNT(*) takes 364 seconds on that machine. That'd mean ~8MB/s. Regarding the indexes, the the OP already posted a description of the table and apparently there are these indexes: Indexes: "poi_all_new_pk" PRIMARY KEY, btree (ogc_fid) "poi_all_new_flname_idx" btree (flname) "poi_all_new_geom_idx" btree (wkb_geometry) "poi_all_new_ogc_fid_idx" btree (ogc_fid) "poi_all_new_pinyin_idx" btree (pinyin) So none of them is GIN/GIST although some one of them is on the geometry column. T.
On Mon, Apr 9, 2012 at 6:50 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 10.4.2012 00:37, Merlin Moncure wrote: >> On Thu, Apr 5, 2012 at 9:47 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> If you have slower drives, the dependency is about linear (half the >>> speed -> twice the time). So either your drives are very slow, or >>> there's something rotten. >>> >>> I still haven's seen iostat / vmstat output ... that'd tell us much more >>> about the causes. >> >> geometry column can potentially quite wide. one thing we need to see >> is the table has any indexes -- in particular gist/gin on the >> geometry. > > Yeah, but in one of the previous posts the OP posted this: > > relname | relpages | reltuples > -------------+----------+------------- > poi_all_new | 2421133 | 6.53328e+06 > > which means the table has ~ 19GB for 6.5 million rows, so it's like > 2.8GB per 1 million of rows, i.e. ~3kB per row. I've been working with 1 > million rows and 1.3GB of data, so it's like 50% of the expected amount. > > But this does not explain why the SELECT COUNT(*) takes 364 seconds on > that machine. That'd mean ~8MB/s. > > Regarding the indexes, the the OP already posted a description of the > table and apparently there are these indexes: > > Indexes: > "poi_all_new_pk" PRIMARY KEY, btree (ogc_fid) > "poi_all_new_flname_idx" btree (flname) > "poi_all_new_geom_idx" btree (wkb_geometry) > "poi_all_new_ogc_fid_idx" btree (ogc_fid) > "poi_all_new_pinyin_idx" btree (pinyin) > > So none of them is GIN/GIST although some one of them is on the geometry > column. hm. well, there's a duplicate index in there: ogc_fid is indexed twice. how much bloat is on the table (let's see an ANALYZE VERBOSE)? what's the storage for this database? merlin