Обсуждение: POSTGRES DB 3 800 000 rows table, speed up?
Hello! I've managed to import into postgre DB 3 800 000 rows of data (500 MB pure CSV ~ 2 GB SQL DB) It looks like this "69110784","69111807","US","UNITED STATES","ILLINOIS","BLOOMINGTON","40.4758","-88.9894","61701","LEVEL 3 COMMUNICATIONS INC","DSL-VERIZON.NET" "69111808","69112831","US","UNITED STATES","TEXAS","GRAPEVINE","32.9309","-97.0755","76051","LEVEL 3 COMMUNICATIONS INC","DSL-VERIZON.NET" "69112832","69113087","US","UNITED STATES","TEXAS","DENTON","33.2108","-97.1231","76201","LEVEL 3 COMMUNICATIONS INC","DSL-VERIZON.NET" CREATE TABLE ipdb2 ( ipFROM int4 NOT NULL, ipTO int4 NOT NULL , countrySHORT CHARACTER(2) NOT NULL, countryLONG VARCHAR(64) NOT NULL, ipREGION VARCHAR(128) NOT NULL, ipCITY VARCHAR(128) NOT NULL, ipLATITUDE DOUBLE PRECISION, ipLONGITUDE DOUBLE PRECISION, ipZIPCODE VARCHAR(5), ipISP VARCHAR(255) NOT NULL, ipDOMAIN VARCHAR(128) NOT NULL ); I've indexed first two columns they are IPfrom, IPto also table is btree version of postgre is 7.4.8, on hosting I ask db like this SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom AND ipto; and get answer after 3-10 seconds, is there a way to speed it up somehow? any tweaks and tuneups possible with it? thanks! ---------------- eugene
On Tue, Dec 27, 2005 at 11:25:37PM +0200, Eugene wrote: > I've indexed first two columns they are IPfrom, IPto also table is btree > version of postgre is 7.4.8, on hosting You should ask them to upgrade; 7.4 is getting pretty old. > I ask db like this SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom > AND ipto; I'm pretty sure PostgreSQL won't be able to use any indexes for this (EXPLAIN ANALYZE would verify that). Instead, expand the between out: WHERE ipfrom >= '...' AND ipto <= '...' Also, generally speaking, databases and CamelCase don't mix too well; you'll probably be happier doing something like ip_from and ip_to. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
I have a question about this, shouldn't it be the query should look like SELECT * FROM ipdb2 WHERE 3229285376 BETWEEN ipfrom AND ipto Note the query doesn't quote the filter, since the ipfrom and ipto are both integer types? -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jim C. Nasby Sent: Wednesday, December 28, 2005 11:33 AM To: Eugene Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] POSTGRES DB 3 800 000 rows table, speed up? On Tue, Dec 27, 2005 at 11:25:37PM +0200, Eugene wrote: > I've indexed first two columns they are IPfrom, IPto also table is > btree version of postgre is 7.4.8, on hosting You should ask them to upgrade; 7.4 is getting pretty old. > I ask db like this SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN > ipfrom AND ipto; I'm pretty sure PostgreSQL won't be able to use any indexes for this (EXPLAIN ANALYZE would verify that). Instead, expand the between out: WHERE ipfrom >= '...' AND ipto <= '...' Also, generally speaking, databases and CamelCase don't mix too well; you'll probably be happier doing something like ip_from and ip_to. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
On 12/27/05, Eugene <evgenius@hot.ee> wrote:
1st. of all - change ipfrom and ipto column types to int8.
integer types in postgresql are signed, so their effective "max" is around 2000000000, which makes your example with over 3000000000 technically not working.
2nd. do vacuum analyze
3rd. show explain analyze. how can we tell you how to speed it up, when we dont know what/how postgres is doing with it.
depesz
Hello!
I've managed to import into postgre DB 3 800 000 rows of data (500 MB pure
CSV ~ 2 GB SQL DB)
It looks like this
"69110784","69111807","US","UNITED
STATES","ILLINOIS","BLOOMINGTON"," 40.4758","-88.9894","61701","LEVEL 3
COMMUNICATIONS INC","DSL-VERIZON.NET"
"69111808","69112831","US","UNITED
STATES","TEXAS","GRAPEVINE","32.9309","-97.0755","76051","LEVEL 3
COMMUNICATIONS INC","DSL-VERIZON.NET"
"69112832","69113087","US","UNITED
STATES","TEXAS","DENTON","33.2108","-97.1231","76201","LEVEL 3
COMMUNICATIONS INC"," DSL-VERIZON.NET"
CREATE TABLE ipdb2
(
ipFROM int4 NOT NULL,
ipTO int4 NOT NULL ,
countrySHORT CHARACTER(2) NOT NULL,
countryLONG VARCHAR(64) NOT NULL,
ipREGION VARCHAR(128) NOT NULL,
ipCITY VARCHAR(128) NOT NULL,
ipLATITUDE DOUBLE PRECISION,
ipLONGITUDE DOUBLE PRECISION,
ipZIPCODE VARCHAR(5),
ipISP VARCHAR(255) NOT NULL,
ipDOMAIN VARCHAR(128) NOT NULL
);
1st. of all - change ipfrom and ipto column types to int8.
integer types in postgresql are signed, so their effective "max" is around 2000000000, which makes your example with over 3000000000 technically not working.
2nd. do vacuum analyze
3rd. show explain analyze. how can we tell you how to speed it up, when we dont know what/how postgres is doing with it.
depesz
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Tue, Dec 27, 2005 at 11:25:37PM +0200, Eugene wrote: >> I ask db like this SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom >> AND ipto; > I'm pretty sure PostgreSQL won't be able to use any indexes for this > (EXPLAIN ANALYZE would verify that). Instead, expand the between out: > WHERE ipfrom >= '...' AND ipto <= '...' That won't help (it is in fact exactly the same query, because BETWEEN is just rewritten into that). The real problem is that btree indexes are ill-suited to this type of condition. If the typical row has only a small distance between ipfrom and ipto then the query is actually pretty selective, but there is no way to capture that selectivity in a btree search, because neither of the single-column comparisons are selective at all. The planner realizes this and doesn't bother with the index, instead it just does a seqscan. You could probably get somewhere by casting the problem as an rtree or GIST overlap/containment query, but with the currently available tools it would be a pretty unnatural-looking query ... probably something like box(point(ipfrom,ipfrom),point(ipto,ipto)) ~ box(point(3229285376,3229285376),point(3229285376,3229285376)) after creating an rtree or GIST index on box(point(ipfrom,ipfrom),point(ipto,ipto)) (haven't tried this but there is a solution lurking somewhere in this general vicinity). Is there a good reason why the data is stored this way, and not as say a single "cidr" column containing subnet addresses? Querying WHERE '192.122.252.0' << cidrcolumn would be a much more transparent way of expressing your problem. We don't currently have an easy indexing solution for that one either, but we might in the future. regards, tom lane
Could you explain this a little bit more? What are the conditions of this situation that makes b-tree ineffective? Thanks SWK -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane Sent: 2005. december 28. 20:04 To: Jim C. Nasby Cc: Eugene; pgsql-general@postgresql.org Subject: Re: [GENERAL] POSTGRES DB 3 800 000 rows table, speed up? "Jim C. Nasby" <jnasby@pervasive.com> writes: > On Tue, Dec 27, 2005 at 11:25:37PM +0200, Eugene wrote: >> I ask db like this SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom >> AND ipto; > I'm pretty sure PostgreSQL won't be able to use any indexes for this > (EXPLAIN ANALYZE would verify that). Instead, expand the between out: > WHERE ipfrom >= '...' AND ipto <= '...' That won't help (it is in fact exactly the same query, because BETWEEN is just rewritten into that). The real problem is that btree indexes are ill-suited to this type of condition. If the typical row has only a small distance between ipfrom and ipto then the query is actually pretty selective, but there is no way to capture that selectivity in a btree search, because neither of the single-column comparisons are selective at all. The planner realizes this and doesn't bother with the index, instead it just does a seqscan. You could probably get somewhere by casting the problem as an rtree or GIST overlap/containment query, but with the currently available tools it would be a pretty unnatural-looking query ... probably something like box(point(ipfrom,ipfrom),point(ipto,ipto)) ~ box(point(3229285376,3229285376),point(3229285376,3229285376)) after creating an rtree or GIST index on box(point(ipfrom,ipfrom),point(ipto,ipto)) (haven't tried this but there is a solution lurking somewhere in this general vicinity). Is there a good reason why the data is stored this way, and not as say a single "cidr" column containing subnet addresses? Querying WHERE '192.122.252.0' << cidrcolumn would be a much more transparent way of expressing your problem. We don't currently have an easy indexing solution for that one either, but we might in the future. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
=?iso-8859-1?Q?Klein_Bal=E1zs?= <Balazs.Klein@axelero.hu> writes: > Could you explain this a little bit more? > What are the conditions of this situation that makes b-tree ineffective? Well, what he's trying to do is (abstracting a little) WHERE low_bound_col <= probe_value AND probe_value <= high_bound_col Given a btree index on low_bound_col, the best you could do with this is scan all the index entries from the start of the index up to probe_value ... or about half the table, on average, which makes the index pretty much useless. On the assumption that low_bound_col and high_bound_col are usually close together, all of the useful hits will occur near the end of that scan, or the beginning if you scan backwards --- but there's no way to know when it's OK to stop looking. Making a double-column index on (low_bound_col, high_bound_col) does not improve the situation much, because the additional condition high_bound_col >= probe_value doesn't let you avoid scanning small values of low_bound_col. You might save some trips to the table proper but you're still scanning half the index. And of course indexing (high_bound_col, low_bound_col) isn't any better. If you are willing to impose a hard-wired assumption about the possible size of the low-bound-to-high-bound distance, you can extend the query to something like WHERE low_bound_col <= probe_value AND probe_value <= high_bound_col AND low_bound_col >= (probe_value - max_distance) which creates an efficiently indexable range limitation on low_bound_col. Of course this is a very sucky kluge. You can do a lot better with index types that are designed for two-dimensional data instead of one-dimensional data. Btree is a great data structure for one-dimensional searches, but that doesn't make it the answer to everything. regards, tom lane
THanks for quick reply this is what i get now Index Scan using ipt on ipdb2 (cost=0.00..74265.76 rows=989636 width=118) (actual time=0.216..2379.608 rows=1 loops=1) Index Cond: (3229285376::bigint <= ipto) Filter: (3229285376::bigint >= ipfrom) Total runtime: 2379.666 ms (4 rows) detectlo_db=> EXPLAIN ANALYZE SELECT * FROM ipdb2 WHERE '999998376' BETWEEN ipfrom AND ipto; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Index Scan using ipf on ipdb2 (cost=0.00..6796.64 rows=154129 width=118) (actual time=211.298..211.301 rows=1 loops=1) Index Cond: (999998376::bigint >= ipfrom) Filter: (999998376::bigint <= ipto) Total runtime: 211.371 ms (4 rows) detectlo_db=> drop index ipt detectlo_db-> ; DROP INDEX detectlo_db=> analyze ipdb2; ANALYZE detectlo_db=> EXPLAIN SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom AND ipto; QUERY PLAN ---------------------------------------------------------------------------- Index Scan using ipf on ipdb2 (cost=0.00..95179.56 rows=989181 width=117) Index Cond: (3229285376::bigint >= ipfrom) Filter: (3229285376::bigint <= ipto) (3 rows) detectlo_db=> EXPLAIN ANALYZE SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom AND ipto; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Index Scan using ipf on ipdb2 (cost=0.00..95179.56 rows=989181 width=117) (actual time=3223.344..3223.347 rows=1 loops=1) Index Cond: (3229285376::bigint >= ipfrom) Filter: (3229285376::bigint <= ipto) Total runtime: 3223.410 ms (4 rows) detectlo_db=> SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom AND ipto; ipfrom | ipto | countryshort | countrylong | ipregion | ipcity | iplatitude | iplongitude | ipzipcode | ipisp | ipdomain ------------+------------+--------------+---------------+----------+--------+------------+-------------+-----------+-----------------------------------+---------- 3229285376 | 3229285631 | US | UNITED STATES | TEXAS | TYLER | 32.3511 | -95.2922 | 75701 | HOWE-BAKER ENGINEERS INCORPORATED | - (1 row) detectlo_db=> On Wed, 28 Dec 2005 23:33:41 +0200, James Robinson <jlrobins@socialserve.com> wrote: > > On Dec 28, 2005, at 4:24 PM, Eugene wrote: > >> THIS is what I get > > [snip ] > > Sorry -- use 'explain select ...', not 'analyze select ...' my bad. > > But anyway, looks like your query is still being served by your > preexisting 'ipt' index. Try > > drop index ipt; > > then > > analyze ipdb2 > > then > explain analyze select * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom > AND ipto; > > [ Assuming this is not a currently running produciotn server ] > > ---- > James Robinson > Socialserve.com > -- ---------------- eugene
On Tue, 27 Dec 2005 23:25:37 +0200, Eugene <evgenius@hot.ee> wrote: >SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom AND ipto; If your [ipfrom, ipto] ranges are non-overlapping, you might be able to exploit that fact by adding something like ... ORDER BY ipfrom DESC LIMIT 1 Servus Manfred
On 2005-12-27, Eugene <evgenius@hot.ee> wrote: > I've indexed first two columns they are IPfrom, IPto also table is btree > version of postgre is 7.4.8, on hosting > I ask db like this SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom > AND ipto; > > and get answer after 3-10 seconds, is there a way to speed it up somehow? > any tweaks and tuneups possible with it? http://pgfoundry.org/projects/ip4r Designed specifically for storing IP ranges as in your example, it will (if correctly used) reduce that query time down to a few milliseconds. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services