Обсуждение: How many views...

Поиск
Список
Период
Сортировка

How many views...

От
"Uwe C. Schroeder"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi,

a (maybe/probably) stupid idea just popped to my mind:
Problem:
I need to search a lot of locations based on distance (simple zipcode match
based on longitude and latitude). However I need to calculate the distance
between each of the nodes, so if you are in xxx I need to get the distance to
all others in the database. I'm currently doing this with a stored procedure
that gets the originating zipcode and a maximum distance in miles which then
selects all other nodes within that search radius. This is pretty unhandy,
but it works.

The idea:
I could create a view for every node in the system which calculates the
distance in the result set, making it easy to handle for the application:
select * from <view> where distance <= 50
The problem is, that the data will possibly contain thousands of nodes. I'd
also need 2 or 3 views per node - which could lead to 50.000 or even 100.000
views.

The question:
1) does it make sense to do this performance-wise?
2) does this make sense at all?
3) can postgresql handle that many views?

Thanks for any opinions (or better ideas than a stored proc or the views
concept)


    UC

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFBqoxujqGXBvRToM4RAusrAJ9e/7jljmE+wNVkeltvErxffCa+xACfba0X
b5ClK8BKCdg5cWaWCnqQklE=
=iiDR
-----END PGP SIGNATURE-----


Re: How many views...

От
Michael Fuhr
Дата:
On Sun, Nov 28, 2004 at 06:41:50PM -0800, Uwe C. Schroeder wrote:

> I need to search a lot of locations based on distance (simple zipcode match
> based on longitude and latitude). However I need to calculate the distance
> between each of the nodes, so if you are in xxx I need to get the distance to
> all others in the database. I'm currently doing this with a stored procedure
> that gets the originating zipcode and a maximum distance in miles which then
> selects all other nodes within that search radius. This is pretty unhandy,
> but it works.

What's unhandy about this approach?  I've written stored procedures
that do exactly what you're talking about; they work fine and are
easy to use.

Are you using a bounding box to limit the number of nodes that
you need to check?  For example, if the originating zipcode is
at 40.0N 90.0W and you want to find all other zipcodes within
50 miles, then you'd only need to check the distance to those
zipcodes with a latitude between about 39.27N - 40.73N and a
longitude between about 89.05W and 90.95W.  No zipcode outside
that box could possibly be within 50 miles of the origin, so
there's no need to calculate and check the distances to them.
If you have indexes on latitude and longitude then the search
should be fast.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: How many views...

От
Greg Stark
Дата:
"Uwe C. Schroeder" <uwe@oss4u.com> writes:

> I could create a view for every node in the system which calculates the
> distance in the result set, making it easy to handle for the application:
> select * from <view> where distance <= 50
> The problem is, that the data will possibly contain thousands of nodes. I'd
> also need 2 or 3 views per node - which could lead to 50.000 or even 100.000
> views.

Normalize.

Make your view construct the distance for _every_ node. So your query looks
like:

  select * from <view> where node_id = 99 and distance <= 50

The danger here is you have to be absolutely sure that Postgres is pushing
that clause down into your view or the performance will be bad. But if your
view is at all sane then I think it will. And you'll certainly know if it
isn't.

--
greg

Re: How many views...

От
"Uwe C. Schroeder"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sunday 28 November 2004 10:43 pm, Michael Fuhr wrote:
> On Sun, Nov 28, 2004 at 06:41:50PM -0800, Uwe C. Schroeder wrote:
> > I need to search a lot of locations based on distance (simple zipcode
> > match based on longitude and latitude). However I need to calculate the
> > distance between each of the nodes, so if you are in xxx I need to get
> > the distance to all others in the database. I'm currently doing this with
> > a stored procedure that gets the originating zipcode and a maximum
> > distance in miles which then selects all other nodes within that search
> > radius. This is pretty unhandy, but it works.
>
> What's unhandy about this approach?  I've written stored procedures
> that do exactly what you're talking about; they work fine and are
> easy to use.
>
> Are you using a bounding box to limit the number of nodes that
> you need to check?  For example, if the originating zipcode is
> at 40.0N 90.0W and you want to find all other zipcodes within
> 50 miles, then you'd only need to check the distance to those
> zipcodes with a latitude between about 39.27N - 40.73N and a
> longitude between about 89.05W and 90.95W.  No zipcode outside
> that box could possibly be within 50 miles of the origin, so
> there's no need to calculate and check the distances to them.
> If you have indexes on latitude and longitude then the search
> should be fast.

The "unhandy" part is maybe a personal perception. I like stored procs, but in
this case the calculation is repeated over and over again (ok, it would be
the same with views).  Basically the part I don't like is that the proc
currently calculates way more values than needed.  Because something like
.... where sqrt(pow((lat1 - lat2),2) + pow((long1 - long2),2)) >= 50
certainly calculates the distance of all the records and then compares the
result to the 50 mile radius. I'd rather have something that excludes most of
the records that aren't in question anyways. How do you come to the lat/long
values for the max difference? Is there a general formula for that? This
looks like I could omit records too far away from the calculation in the
first place. I know - maybe I should dig for those old geometry books that
are somewhere in a box, but if you have the base for that handy I'd
appreciate if you tell me (I hated math all my life ;-) )

    UC

- --
Open Source Solutions 4U, LLC    2570 Fleetwood Drive
Phone:  +1 650 872 2425        San Bruno, CA 94066
Cell:   +1 650 302 2405        United States
Fax:    +1 650 872 2417
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFBqsp/jqGXBvRToM4RAg/xAJ497wF1pwbzLFHbC/f1UehOoG2iGwCfWKYQ
5cNIUb984sPLtBGudDqspF8=
=hsl2
-----END PGP SIGNATURE-----


Re: How many views...

От
"Uwe C. Schroeder"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sunday 28 November 2004 10:49 pm, Greg Stark wrote:
> "Uwe C. Schroeder" <uwe@oss4u.com> writes:
> > I could create a view for every node in the system which calculates the
> > distance in the result set, making it easy to handle for the application:
> > select * from <view> where distance <= 50
> > The problem is, that the data will possibly contain thousands of nodes.
> > I'd also need 2 or 3 views per node - which could lead to 50.000 or even
> > 100.000 views.
>
> Normalize.
>
> Make your view construct the distance for _every_ node. So your query looks
> like:
>
>   select * from <view> where node_id = 99 and distance <= 50

Well, in my case a "node_id" would refer to a zipcode (for the basic version,
later on maybe even pushing it down to street level using more detailed gps
data). The problem I see is that the records the view sits on top of may and
will contain several similar records. Think of UPS: they would want to know
the distance to any recipient of a parcel, however a lot of those recipients
have the same zipcode. I just can't think of a view that retrieves a specific
person in that zipcode based on the zipcode. So there would have to be a
second parameter to it - or a view for each record.
To put it better: The application records customers.
table customer (
customer_id int4 primary key,
zipcode varchar(10),
other stuff about the customer
)

A normalized view just wouldn't return that specific customer plainly based on
the zipcode, because there could/will be a second or third customer in the
same zipcode. For the distance that wouldn't matter, but for the customer
info it would.
I'm just playing with options right now. Maybe/likely I have to revise the
database model. The stored proc works fine, it just could become slow with a
lot of customer records. I think Michael's prior post is the better answer -
limit the possible coordinates to a subset before starting to calculate the
actual distance.

> The danger here is you have to be absolutely sure that Postgres is pushing
> that clause down into your view or the performance will be bad. But if your
> view is at all sane then I think it will. And you'll certainly know if it
> isn't.

That's what I'm afraid of. The database will potentially contain 100.000
customer records once productive - in the US alone, leaving aside what has to
be done about the rest of the world. So, yeah - one will notice the drag on
an expensive calculation, particularly because the application has lists that
show ALL customers sorted by distance to one (changing) location.

    UC

- --
Open Source Solutions 4U, LLC    2570 Fleetwood Drive
Phone:  +1 650 872 2425        San Bruno, CA 94066
Cell:   +1 650 302 2405        United States
Fax:    +1 650 872 2417
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFBqs99jqGXBvRToM4RAsv+AKCsM05f9JR0yMIXfbELrArJ6z9WKACeKfYa
nAsM0NRh09R+Zl7eu+FDS/g=
=DiJv
-----END PGP SIGNATURE-----


Re: How many views...

От
Holger Klawitter
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

> .... where sqrt(pow((lat1 - lat2),2) + pow((long1 - long2),2)) >= 50

I assume that you dont want to deal with distances > 1000km?!

> of the records that aren't in question anyways. How do you come to the
> lat/long values for the max difference? Is there a general formula for
> that?

You can problably resort to simplified formulas like this:

latitude: 1 degree ~~ 111km
longitude: 1 degree ~~ 111km * cos( reference latitude )

> (I hated math all my life ;-) )

The whole thing is likely to get tough for you then ;-)

Mit freundlichem Gruß / With kind regards
    Holger Klawitter
- --
lists <at> klawitter <dot> de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFBqtqg1Xdt0HKSwgYRAmgJAJ9+PCiWIbnA2HR6QiGBlLXp9OQT6wCgg+gF
rL3xzK6UP9AUFsnWxpk4zoE=
=GQ0I
-----END PGP SIGNATURE-----

Re: How many views...

От
Tino Wildenhain
Дата:
Am Sonntag, den 28.11.2004, 23:06 -0800 schrieb Uwe C. Schroeder:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Sunday 28 November 2004 10:43 pm, Michael Fuhr wrote:
> > On Sun, Nov 28, 2004 at 06:41:50PM -0800, Uwe C. Schroeder wrote:
> > > I need to search a lot of locations based on distance (simple zipcode
> > > match based on longitude and latitude). However I need to calculate the
> > > distance between each of the nodes, so if you are in xxx I need to get
> > > the distance to all others in the database. I'm currently doing this with
> > > a stored procedure that gets the originating zipcode and a maximum
> > > distance in miles which then selects all other nodes within that search
> > > radius. This is pretty unhandy, but it works.
> >
> > What's unhandy about this approach?  I've written stored procedures
> > that do exactly what you're talking about; they work fine and are
> > easy to use.
> >
> > Are you using a bounding box to limit the number of nodes that
> > you need to check?  For example, if the originating zipcode is
> > at 40.0N 90.0W and you want to find all other zipcodes within
> > 50 miles, then you'd only need to check the distance to those
> > zipcodes with a latitude between about 39.27N - 40.73N and a
> > longitude between about 89.05W and 90.95W.  No zipcode outside
> > that box could possibly be within 50 miles of the origin, so
> > there's no need to calculate and check the distances to them.
> > If you have indexes on latitude and longitude then the search
> > should be fast.
>
> The "unhandy" part is maybe a personal perception. I like stored procs, but in
> this case the calculation is repeated over and over again (ok, it would be
> the same with views).  Basically the part I don't like is that the proc
> currently calculates way more values than needed.  Because something like
> .... where sqrt(pow((lat1 - lat2),2) + pow((long1 - long2),2)) >= 50

Did you actually try to use PGs geometric datatypes here? My guess would
be you can optimize indices that way - or at least use maybe overlapping
grid where gridsize is 2*max(distance you ask for):

+-----Grid Cell ----------+
|                         |
|                         |
|+--max bb---+            |
||           |            |
||           |            |
||     *<max>|            |
||           |            |
|+-----------+            |
|                         |
+-------------------------+
       | 2*max      |

where max is max distance. You could build
a functional index based on the grid cells
(maybe give ID to any of them) and therefore
quickly filter out a whole lot of points which
will never be reached by the distance.

Btw. If you dont use the geometric types,
you can speed up if you skip the sqrt.

if sqrt(x) >= y then x >= y²

where 50²=2500 can be calculated
beforehand and saves a lot of calculating
later. (sqrt is expensive!)


> certainly calculates the distance of all the records and then compares the
> result to the 50 mile radius. I'd rather have something that excludes most of
> the records that aren't in question anyways. How do you come to the lat/long
> values for the max difference? Is there a general formula for that? This
> looks like I could omit records too far away from the calculation in the
> first place. I know - maybe I should dig for those old geometry books that
> are somewhere in a box, but if you have the base for that handy I'd
> appreciate if you tell me (I hated math all my life ;-) )
                             ^ ^ ^ ^ ^ ^

why so? Math is just cool :-)

Regards
Tino


Re: How many views...

От
Michael Fuhr
Дата:
On Sun, Nov 28, 2004 at 11:06:38PM -0800, Uwe C. Schroeder wrote:

> The "unhandy" part is maybe a personal perception. I like stored procs, but in
> this case the calculation is repeated over and over again (ok, it would be
> the same with views).  Basically the part I don't like is that the proc
> currently calculates way more values than needed.  Because something like
> .... where sqrt(pow((lat1 - lat2),2) + pow((long1 - long2),2)) >= 50
> certainly calculates the distance of all the records and then compares the
> result to the 50 mile radius.

The formula you mention is for calculating distances on a plane, not on
the surface of a sphere.  Google for more appropriate formulae (e.g.,
the Haversine Formula) or use the functions in contrib/earthdistance.

> I'd rather have something that excludes most of the records that
> aren't in question anyways. How do you come to the lat/long values
> for the max difference? Is there a general formula for that?

For the longitude bounds, calculate how far away a point one degree
due east or west would be.  For example, the distance between (40, -90)
and (40, -91) is about 53mi, depending on what value you use for the
Earth's radius (the Earth isn't a perfect sphere).  If you want to
find points within 50 miles, limit your search to longitudes within
50/53 (0.94) degrees of -90, or -90.94 to -89.06.

Repeat for latitude.  The distance between (40, -90) and (41, -90)
is about 69mi, so limit your search to latitudes within 50/69 (0.72)
degrees of 40, or 39.28 to 40.72.

Note that one degree of longitude doesn't cover the same distance
as one degree of latitude: that's because longitude lines converge
as they approach the poles.  At the equator, one degree of longitude
covers about 69mi, while at 40N it's only 53mi.

> This looks like I could omit records too far away from the calculation
> in the first place.

That's the idea.  Using the above calculations, you'd make a query
like this:

SELECT ...
FROM ...
WHERE latitude BETWEEN 39.28 AND 40.72
  AND longitude BETWEEN -90.94 AND -89.06
  AND distance(latitude, longitude, 40, -90) <= 50;

Substitute an appropriate function for distance().

The latitude and longitude checks find the candidate points and the
distance check makes the final selection.  With indexes on latitude
and longitude (or a multicolumn index on both latitude and longitude),
this query should be reasonably fast.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/