Обсуждение: application of KNN code to US zipcode searches?

Поиск
Список
Период
Сортировка

application of KNN code to US zipcode searches?

От
Mark Stosberg
Дата:
We perform over 1,000,000 searches each day for "adoptable shelter pets
near your zipcode". We already have adequate performance for these
searches using the "cube" contrib, but the new KNN work in 9.1 seemed
like it might be a promising way to speed this up even further.

I installed PostgreSQL 9.1 on my laptop to evaluate it, using this post
as a reference:
http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/

The first task was to translate a geo-spatial search to use the new KNN
syntax.

I'm most familiar with two approaches to geo-spatial searching with
PostgreSQL. The first is the older "earthdistance" approach, using
"point" types and the "<@>" operator.

The second is the one we are using now, which uses a cube type, the
"cube_distance()" and "earth_box()" method and a GIST index on the cube
type.

Immediately there is a hurdle in that KNN only appears to work with
point types and the <-> operator, which does simple point-to-point
distance, instead of the distance-around-the-earth. Still, I thought
that could be enough of an approximation to test the waters.

I started with some "real world" queries that involved some table joins,
and when those failed to show improvement, I worked with some
reduced-test-case queries.

While I could confirm the new GIST index was being used on the point
type, I couldn't get a query to benchmark better when it was invoked.
I'm wondering if perhaps US zipcode searches aren't good use of this
technology, perhaps because the data set is too small ( About 40,000
zipcodes ).

Given that we can already do GIST-indexed searches with the cube type
that provide good reasonable approximations for zipcode-radius searches,
are others planning to eventually apply the KNN work to US zipcode
searches?

Sample EXPLAIN output and query times are below.

    Mark

EXPLAIN ANALYZE SELECT zipcode,
    lon_lat <-> '(-118.412426,34.096629)' AS radius
    FROM zipcodes ;
-------------------------------------------
 Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483 width=22) (actual
time=0.019..84.543 rows=41483 loops=1)
 Total runtime: 148.129 ms


EXPLAIN ANALYZE SELECT zipcode,
    lon_lat <-> '(-118.412426,34.096629)' As radius
    FROM zipcodes
    ORDER BY lon_lat <-> '(-118.412426,34.096629)';
--------------------------------------------------
 Index Scan using zipcodes_knn on zipcodes  (cost=0.00..5365.93
rows=41483 width=22) (actual time=0.451..141.590 rows=41483 loops=1)
   Order By: (lon_lat <-> '(-118.412426,34.096629)'::point)
 Total runtime: 206.392 ms







Re: application of KNN code to US zipcode searches?

От
"Kevin Grittner"
Дата:
Mark Stosberg <mark@summersault.com> wrote:

> Sample EXPLAIN output and query times are below.

>  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483 width=22)
> (actual time=0.019..84.543 rows=41483 loops=1)

>  Index Scan using zipcodes_knn on zipcodes  (cost=0.00..5365.93
> rows=41483 width=22) (actual time=0.451..141.590 rows=41483
> loops=1)

I thought the benefit of KNN was that you could retrieve the rows in
distance order, so that a query for the closest 20 locations (for
example) would be very fast.  I wouldn't have expected it to be
helpful when you're selecting all the rows regardless of distance.

-Kevin

Re: application of KNN code to US zipcode searches?

От
Mark Stosberg
Дата:
> I thought the benefit of KNN was that you could retrieve the rows in
> distance order, so that a query for the closest 20 locations (for
> example) would be very fast.  I wouldn't have expected it to be
> helpful when you're selecting all the rows regardless of distance.

Kevin,

Thanks for the feedback. You are right that my "reduced test case"
wasn't a good approximation. I added a limit, to simulate finding the
100 zipcodes closest to 90210.

Below I compare 4 approaches to the same query:

1. Cube search
2. Earth Distance Search
3. Simple point distance (no index)
4. Simple point distance (KNN)

Now KNN benchmarks to be almost 100x faster! That's very promising.
Then there's only the issue that simple point distance is not expected
to be a good enough approximation of earth-distances. Perhaps that can
be solved by pre-computing coordinates based on the lat/long pairs....
much like the map projections used to present a curved surface on a flat
map? Given that's OK to be be a few miles off, it seems we have some
leeway here.

Recommendations?

    Mark

EXPLAIN ANALYZE
SELECT zipcode,
    cube_distance( '(-2513120.64361786, -4645511.0460328,
3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius
    FROM zipcodes ORDER BY radius LIMIT 100;

---------------------------------------------------------------
 Limit  (cost=2946.70..2946.95 rows=100 width=62) (actual
time=167.650..168.064 rows=100 loops=1)
   ->  Sort  (cost=2946.70..3050.40 rows=41483 width=62) (actual
time=167.644..167.829 rows=100 loops=1)
         Sort Key: ((cube_distance('(-2513120.64361786,
-4645511.0460328, 3575538.9507084)'::cube, earth_coords) /
1609.344::double precision))
         Sort Method: top-N heapsort  Memory: 20kB
         ->  Seq Scan on zipcodes  (cost=0.00..1361.24 rows=41483
width=62) (actual time=0.030..90.807 rows=41483 loops=1)
 Total runtime: 168.300 ms

############################################################3

-- Using Earthdistance
EXPLAIN ANALYZE SELECT zipcode,
    lon_lat <@> '(-118.412426,34.096629)' As radius
    FROM zipcodes
    ORDER BY lon_lat <@> '(-118.412426,34.096629)'
    LIMIT 100;

------------------------------------------------------------
 Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
time=187.995..188.451 rows=100 loops=1)
   ->  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=187.989..188.149 rows=100 loops=1)
         Sort Key: ((lon_lat <@> '(-118.412426,34.096629)'::point))
         Sort Method: top-N heapsort  Memory: 20kB
         ->  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.033..108.203 rows=41483 loops=1)
 Total runtime: 188.660 ms

##########################################

Using simple point distance, but with no Gist Index:

EXPLAIN ANALYZE SELECT zipcode,
    lon_lat <-> '(-118.412426,34.096629)' As radius
    FROM zipcodes
    ORDER BY lon_lat <-> '(-118.412426,34.096629)'
    LIMIT 100;

--------------------------------------------------------
 Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
time=160.574..161.057 rows=100 loops=1)
   ->  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=160.568..160.691 rows=100 loops=1)
         Sort Key: ((lon_lat <-> '(-118.412426,34.096629)'::point))
         Sort Method: top-N heapsort  Memory: 20kB
         ->  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.027..84.610 rows=41483 loops=1)
 Total runtime: 161.226 ms

#########################################

-- Using KNN-GIST index
EXPLAIN ANALYZE SELECT zipcode,
    lon_lat <-> '(-118.412426,34.096629)' As radius
    FROM zipcodes
    ORDER BY lon_lat <-> '(-118.412426,34.096629)'
    LIMIT 100;
------------------------------------------------------------------
 Limit  (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892
rows=100 loops=1)
   ->  Index Scan using zipcodes_knn on zipcodes  (cost=0.00..5365.93
rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1)
         Order By: (lon_lat <-> '(-118.412426,34.096629)'::point)
 Total runtime: 2.198 ms

Re: application of KNN code to US zipcode searches?

От
Stephen Frost
Дата:
* Mark Stosberg (mark@summersault.com) wrote:
> Recommendations?

PostGIS, geometry columns, and UTM..  I'm not sure where they are wrt
adding KNN support, but it's something they've been anxious to have for
a while, so I expect support will come quickly.

    Thanks,

        Stephen

Вложения

Re: application of KNN code to US zipcode searches?

От
Mark Stosberg
Дата:
> PostGIS, geometry columns, and UTM..  I'm not sure where they are wrt
> adding KNN support, but it's something they've been anxious to have for
> a while, so I expect support will come quickly.

I've looked into this a little more.

One approach seems to be to project the lat/long pairs on to a flat
plane using the Albers projection (which would be a one-time
calculation), and then the current KNN point/distance calculations could
be used.

Here's a Perl module that references the Albers projection (although
it's not yet clear to me how to use it):

http://search.cpan.org/dist/PDL/

And a Wikipedia page on various calculation possibilities:
http://en.wikipedia.org/wiki/Geographical_distance#Flat-surface_formulae

Further suggestions welcome.

   Thanks,

    Mark

Re: application of KNN code to US zipcode searches?

От
Heikki Linnakangas
Дата:
On 17.02.2011 17:20, Mark Stosberg wrote:
>> I thought the benefit of KNN was that you could retrieve the rows in
>> distance order, so that a query for the closest 20 locations (for
>> example) would be very fast.  I wouldn't have expected it to be
>> helpful when you're selecting all the rows regardless of distance.
>
> Kevin,
>
> Thanks for the feedback. You are right that my "reduced test case"
> wasn't a good approximation. I added a limit, to simulate finding the
> 100 zipcodes closest to 90210.
>
> Below I compare 4 approaches to the same query:
>
> 1. Cube search
> 2. Earth Distance Search
> 3. Simple point distance (no index)
> 4. Simple point distance (KNN)
>
> Now KNN benchmarks to be almost 100x faster! That's very promising.
> Then there's only the issue that simple point distance is not expected
> to be a good enough approximation of earth-distances. Perhaps that can
> be solved by pre-computing coordinates based on the lat/long pairs....
> much like the map projections used to present a curved surface on a flat
> map? Given that's OK to be be a few miles off, it seems we have some
> leeway here.
>
> Recommendations?

The existing opclasses only support distance-to-a-point, but I believe
the KNN gist code is flexible enough that it could be used for distance
to the edge of a shape as well. Someone just needs to write the
operators and support functions.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: application of KNN code to US zipcode searches?

От
Mark Stosberg
Дата:
I tried again to use KNN for a real-world query, and I was able to get
it to add an approximately 6x speed-up vs the cube search or
earthdistance methods ( from 300 ms to 50ms ).

I had to make some notable changes for the KNN index to be considered.

- Of course, I had to switch to using basic point/distance calculation.
  As previously noted, this still needs more work to confirm the
  accuracy and get the "distance" reported in miles.

- The query planner didn't like it when the "ORDER BY" referred to a
  column value instead of a static value, even when I believe it should
  know that the column value never changes. See this pseudo-query where
  we look-up the coordinates for 90210 once:

  EXPLAIN ANALYZE
  SELECT pets.pet_id,
      zipcodes.lon_lat <-> center.lon_lat AS radius
      FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS
center, pets
      JOIN shelters USING (shelter_id)
      JOIN zipcodes USING (zipcode)
       ORDER BY postal_codes.lon_lat <-> center.lon_lat limit 1000;

  This didn't use the KNN index until I changed the "center.lon_lat" in
  the ORDER BY to an explicit point value. I'm not sure if that's
  expected, or something I should take up with -hackers.

  This could be worked around by doing a initial query to look-up this
  value, and then feed a static value into this query. That's not ideal,
  but the combination would still be faster.

- I had to drop the part of the WHERE clause which restricted the
  results to shelters within 50 miles from the target zipcode. However,
  I could set the "LIMIT" so high that I could get back "enough" pets,
  and then the application could trim out the results. Or, perhaps
  I could push this query down into a sub-select, and let PostgreSQL
  do a second pass to throw out some of the results.

In any case, with a real-world speed-up of 6x, this looks like it will
be worth it to us to continue to investigate.


Re: application of KNN code to US zipcode searches?

От
Tom Lane
Дата:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> The existing opclasses only support distance-to-a-point, but I believe
> the KNN gist code is flexible enough that it could be used for distance
> to the edge of a shape as well. Someone just needs to write the
> operators and support functions.

The distance has to be exactly computable from the index entry, so you'd
need to store the whole shape in the index, not just a bounding box.
Not sure how practical that will be for complex shapes.

            regards, tom lane

Re: application of KNN code to US zipcode searches?

От
Tom Lane
Дата:
Mark Stosberg <mark@summersault.com> writes:
> - The query planner didn't like it when the "ORDER BY" referred to a
>   column value instead of a static value, even when I believe it should
>   know that the column value never changes. See this pseudo-query where
>   we look-up the coordinates for 90210 once:

>   EXPLAIN ANALYZE
>   SELECT pets.pet_id,
>       zipcodes.lon_lat <-> center.lon_lat AS radius
>       FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS
> center, pets
>       JOIN shelters USING (shelter_id)
>       JOIN zipcodes USING (zipcode)
>        ORDER BY postal_codes.lon_lat <-> center.lon_lat limit 1000;

As phrased, that's a join condition, so there's no way that an index on
a single table can possibly satisfy it.  You could probably convert it
to a sub-select though:

       ORDER BY postal_codes.lon_lat <-> (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') limit 1000;

            regards, tom lane

Re: application of KNN code to US zipcode searches?

От
Oleg Bartunov
Дата:
Mark,

we investigating pgsphere http://pgsphere.projects.postgresql.org/, if we could add KNN support.


Oleg
On Thu, 17 Feb 2011, Mark Stosberg wrote:

>
>> I thought the benefit of KNN was that you could retrieve the rows in
>> distance order, so that a query for the closest 20 locations (for
>> example) would be very fast.  I wouldn't have expected it to be
>> helpful when you're selecting all the rows regardless of distance.
>
> Kevin,
>
> Thanks for the feedback. You are right that my "reduced test case"
> wasn't a good approximation. I added a limit, to simulate finding the
> 100 zipcodes closest to 90210.
>
> Below I compare 4 approaches to the same query:
>
> 1. Cube search
> 2. Earth Distance Search
> 3. Simple point distance (no index)
> 4. Simple point distance (KNN)
>
> Now KNN benchmarks to be almost 100x faster! That's very promising.
> Then there's only the issue that simple point distance is not expected
> to be a good enough approximation of earth-distances. Perhaps that can
> be solved by pre-computing coordinates based on the lat/long pairs....
> much like the map projections used to present a curved surface on a flat
> map? Given that's OK to be be a few miles off, it seems we have some
> leeway here.
>
> Recommendations?
>
>    Mark
>
> EXPLAIN ANALYZE
> SELECT zipcode,
>    cube_distance( '(-2513120.64361786, -4645511.0460328,
> 3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius
>    FROM zipcodes ORDER BY radius LIMIT 100;
>
> ---------------------------------------------------------------
> Limit  (cost=2946.70..2946.95 rows=100 width=62) (actual
> time=167.650..168.064 rows=100 loops=1)
>   ->  Sort  (cost=2946.70..3050.40 rows=41483 width=62) (actual
> time=167.644..167.829 rows=100 loops=1)
>         Sort Key: ((cube_distance('(-2513120.64361786,
> -4645511.0460328, 3575538.9507084)'::cube, earth_coords) /
> 1609.344::double precision))
>         Sort Method: top-N heapsort  Memory: 20kB
>         ->  Seq Scan on zipcodes  (cost=0.00..1361.24 rows=41483
> width=62) (actual time=0.030..90.807 rows=41483 loops=1)
> Total runtime: 168.300 ms
>
> ############################################################3
>
> -- Using Earthdistance
> EXPLAIN ANALYZE SELECT zipcode,
>    lon_lat <@> '(-118.412426,34.096629)' As radius
>    FROM zipcodes
>    ORDER BY lon_lat <@> '(-118.412426,34.096629)'
>    LIMIT 100;
>
> ------------------------------------------------------------
> Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
> time=187.995..188.451 rows=100 loops=1)
>   ->  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
> time=187.989..188.149 rows=100 loops=1)
>         Sort Key: ((lon_lat <@> '(-118.412426,34.096629)'::point))
>         Sort Method: top-N heapsort  Memory: 20kB
>         ->  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
> width=22) (actual time=0.033..108.203 rows=41483 loops=1)
> Total runtime: 188.660 ms
>
> ##########################################
>
> Using simple point distance, but with no Gist Index:
>
> EXPLAIN ANALYZE SELECT zipcode,
>    lon_lat <-> '(-118.412426,34.096629)' As radius
>    FROM zipcodes
>    ORDER BY lon_lat <-> '(-118.412426,34.096629)'
>    LIMIT 100;
>
> --------------------------------------------------------
> Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
> time=160.574..161.057 rows=100 loops=1)
>   ->  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
> time=160.568..160.691 rows=100 loops=1)
>         Sort Key: ((lon_lat <-> '(-118.412426,34.096629)'::point))
>         Sort Method: top-N heapsort  Memory: 20kB
>         ->  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
> width=22) (actual time=0.027..84.610 rows=41483 loops=1)
> Total runtime: 161.226 ms
>
> #########################################
>
> -- Using KNN-GIST index
> EXPLAIN ANALYZE SELECT zipcode,
>    lon_lat <-> '(-118.412426,34.096629)' As radius
>    FROM zipcodes
>    ORDER BY lon_lat <-> '(-118.412426,34.096629)'
>    LIMIT 100;
> ------------------------------------------------------------------
> Limit  (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892
> rows=100 loops=1)
>   ->  Index Scan using zipcodes_knn on zipcodes  (cost=0.00..5365.93
> rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1)
>         Order By: (lon_lat <-> '(-118.412426,34.096629)'::point)
> Total runtime: 2.198 ms
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: application of KNN code to US zipcode searches?

От
Mark Stosberg
Дата:
On 02/17/2011 03:17 PM, Oleg Bartunov wrote:
> Mark,
>
> we investigating pgsphere http://pgsphere.projects.postgresql.org/, if
> we could add KNN support.

Great, thanks Oleg.

I'll be happy to test it when something is ready.

    Mark

Re: application of KNN code to US zipcode searches?

От
bricklen
Дата:
On Thu, Feb 17, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Mark Stosberg <mark@summersault.com> writes:
>> - The query planner didn't like it when the "ORDER BY" referred to a
>>   column value instead of a static value, even when I believe it should
>>   know that the column value never changes. See this pseudo-query where
>>   we look-up the coordinates for 90210 once:
>
>>   EXPLAIN ANALYZE
>>   SELECT pets.pet_id,
>>       zipcodes.lon_lat <-> center.lon_lat AS radius
>>       FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS
>> center, pets
>>       JOIN shelters USING (shelter_id)
>>       JOIN zipcodes USING (zipcode)
>>        ORDER BY postal_codes.lon_lat <-> center.lon_lat limit 1000;
>
> As phrased, that's a join condition, so there's no way that an index on
> a single table can possibly satisfy it.  You could probably convert it
> to a sub-select though:
>
>       ORDER BY postal_codes.lon_lat <-> (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') limit 1000;
>
>                        regards, tom lane

Would pushing that subquery to a WITH clause be helpful at all?

Re: application of KNN code to US zipcode searches?

От
Mark Stosberg
Дата:
Hello,

I want to report that I have now solved the challenges I ran into using
KNN for US zipcode searching. I've found the new approach to not only be
viable, but to benchmark about 3x faster for our own real-world
application than the previous approach we used, involving
cube_distance() and earth_box().

Here's some details about my research so far.

To evaluate it, I installed PostgreSQL 9.1 and a current PostGIS 2.0
snapshot (not yet released as stable).

A primary challenge I had to solve was that KNN is designed for a
slightly different problem than what I needed to solve. I need to answer
the question:

 "What are all the objects that are in zipcodes with 50 miles of a given
zipcode?"

However, KNN only directly provides a performance boost to this
variation:

 "What are the N nearest objects to this point?"

Just adding a "WHERE clause" to check the 50 mile rule would erase the
benefits of KNN, which works through an "ORDER BY" clause.

I solved my issue by using a "WITH" clause that creates a pseudo-table
called "nearby_zipcodes". In this example, I select all the zipcodes
that are within 50 miles of the "47374" zipcode. The trick I've
employed is that I've set the LIMIT value to 286-- exactly the number of
zipcodes within 50 miles of 47374. My plan is to add another column to
my "zipcodes" table for each of the small number distances I need to
search. Then, when I load new zipcodes I can pre-compute how many
zipcodes would be found at this distance.

This have approach would not have worked without a "WITH" clause, or
some equivalent, because the number of objects within the radius is not
known, but the number of nearby zipcodes is fixed.

This approach allows me to get the performance benefits of KNN, while
also returning exactly those objects within 50 miles of my
target zipcode, by JOINing on the "nearby_zipcodes" table:

 WITH nearby_zipcodes AS (
     SELECT zipcode,
         st_distance_sphere(lonlat_point, (SELECT lonlat_point from
zipcodes WHERE zipcode = '47374')) / 1609.344 as radius
     FROM zipcodes
     ORDER BY lonlat_point  <-> (SELECT lonlat_point from zipcodes WHERE
zipcode = '47374')
     LIMIT 286
 )
 SELECT ...

You might also notice that "st_distance_sphere()" doesn't mean exactly
the same thing as the "<->" operator. That's something I could refine
going forward.

That's what I've got so far. How could I improve this further?

For reference, here are the key parts of the "zipcodes" table:

# \d zipcodes
             Table "public.zipcodes"
    Column    |         Type          | Modifiers
--------------+-----------------------+-----------
 zipcode      | character varying(5)  | not null
 lonlat_point | geometry(Point,4326)  |
Indexes:
    "zipcodes_pkey" PRIMARY KEY, btree (zipcode)
    "lonlat_point_idx" gist (lonlat_point)

Thanks for the peer review!

        Mark Stosberg