Обсуждение: Searching data across tables, some large

Поиск
Список
Период
Сортировка

Searching data across tables, some large

От
Robert Fitzpatrick
Дата:
Thanks to Richard for the help earlier this week with performance
questions, once I was able to get pgsql and mssql using the same
resources and doing some tuning, I was able to get comparable results.

The issue still though, I have this view that I designed with the
thought in mind to provide all fields the user will want to search in
the PHP web app. However, I have found when joining a couple of large
tables, the view count reaches the 40+ million records, this does not
seem to work in pgsql (or in mssql), especially if distinct is needed.
Should I even be attempting this approach?

I have never used TSearch and wonder if that is the solution to this
type of search? From a quick read of some TSearch info I see indexes are
setup on a column basis in a table? So, I could create an index column
for say the first name and last name fields in a contact table, but
cannot create a field with information from different tables? Also, not
all fields are text that I need to search, some boolean for instance,
can I search TSearch index fields and other fields at the same time. Not
sure how TSearch works, just wanted to get an opinion that that may be
what I need before delving into it too much.

--
Robert


Re: Searching data across tables, some large

От
"Ericson Smith"
Дата:
Yeah, we've used Tsearch with joins and searches on other fields on
the tsearch table no problem. Tsearches are  just another part of a
WHERE clause.

Regards
- Ericson Smith
Developer
http://www.funadvice.com

On 5/23/07, Robert Fitzpatrick <lists@webtent.net> wrote:
> Thanks to Richard for the help earlier this week with performance
> questions, once I was able to get pgsql and mssql using the same
> resources and doing some tuning, I was able to get comparable results.
>
> The issue still though, I have this view that I designed with the
> thought in mind to provide all fields the user will want to search in
> the PHP web app. However, I have found when joining a couple of large
> tables, the view count reaches the 40+ million records, this does not
> seem to work in pgsql (or in mssql), especially if distinct is needed.
> Should I even be attempting this approach?
>
> I have never used TSearch and wonder if that is the solution to this
> type of search? From a quick read of some TSearch info I see indexes are
> setup on a column basis in a table? So, I could create an index column
> for say the first name and last name fields in a contact table, but
> cannot create a field with information from different tables? Also, not
> all fields are text that I need to search, some boolean for instance,
> can I search TSearch index fields and other fields at the same time. Not
> sure how TSearch works, just wanted to get an opinion that that may be
> what I need before delving into it too much.
>
> --
> Robert
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

Geographic data sources, queries and questions

От
"Chuck D."
Дата:
Greetings all,

I have a couple issues regarding geographic names databases.

1) The first is this.  I have 3 tables.  Country, state and city.  Country has
a country_id to identify a country, state has a state_id and country_id to
identify a state, and city has a city_id, state_id and country_id (for easy
reference) to identify it.  I then have a table for users that stores their
city, state and country ID's along with other info about them.

My problem came recently when I questioned the integrity of the data and
needed to make some changes.  I thought to myself that maybe storing the ID
wasn't as good as storing the ISO or FIPS 2 letter abbreviation.  The only
problem the abbreviation could changed at some point by the regulating bodies
and all rows in all tables would need to be updated.

The question is, for the purposes of querying or searching is it better to
store and search a 2 byte integer that is indexed for country or state ID's,
or is it better to store and search a 2 byte CHAR abbreviation?

2) I've spent an accumulated total of around a month and a half trying to
consolidate geographic name data from several free sources on the net and
realize this isn't the best use of my time and errors will be had.  Does
anyone know of a reliable source of geo data that isn't costly?  Most want to
charge a server license, annual rate, etc.  I'm not sure about the free
sources because one I used actually had mixed values in a column and drove me
nuts.  I primarily need:

country
state
county if applicable
city
latitude
longitude

This is primarily input from an HTML form to calculate distances between
users.

Anyone who has any experience with geo name data I would appreciate hearing
your solution.

Re: Searching data across tables, some large

От
Robert Fitzpatrick
Дата:
On Wed, 2007-05-23 at 18:53 -0400, Ericson Smith wrote:
> Yeah, we've used Tsearch with joins and searches on other fields on
> the tsearch table no problem. Tsearches are  just another part of a
> WHERE clause.
>

And can there be Tsearch fields built based on fields in different
tables? Where can I find the best docs for TSearch2? I'm looking for
information pertaining to how TSearch can help in my situation, instead
of building a view with a lot of joins causing absurd amount of rows
returned.

Or am I going to need to limit the tables a user can search at one time
regardless? For instance, seems if I join my table of clients with
contacts to return all the contacts with company info, no problem. I can
also join my clients with activity *or* comments table. But if I try to
left join all three of those tables in one view against clients, seems
to return way too many rows. Of course, I will not want all those rows,
but I can't get it to respond to queries. Meanwhile, I can create a view
with just the comments table joined to clients, while the count is over
1 million records, select statements execute quickly.

Is TSearch for me? Thanks for the advice!

--
Robert


Re: Geographic data sources, queries and questions

От
"George Pavlov"
Дата:
seems hard to enforce integrity in your model. how are you going to
ensure that the user's city-state-country combo a valid one? (well, you
can, but it is a pain). ask yourself: can a city be in more than one
country? probably not (even if the name is the same it is not the same
city!). can a state be in more than one country? etc., etc.

seems much cleaner to have cities have a key to states, states to
countries. otherwise might as well just have a big denormalized table
and skip the whole relational thing...

numeric ids vs chars, when properly indexed, should perform about the
same (even if there is a small difference this is not something one
should really worry about; hey, there aren't even that many cities in
the world!)

i would go with a unique internal id (in fact that IS what i do) you can
store the FIPS/ISO code in a neighboring field, but i am not sure it is
good enough for a primary key.



> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chuck D.
> Sent: Wednesday, May 23, 2007 4:22 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Geographic data sources, queries and questions
>
> Greetings all,
>
> I have a couple issues regarding geographic names databases.
>
> 1) The first is this.  I have 3 tables.  Country, state and
> city.  Country has
> a country_id to identify a country, state has a state_id and
> country_id to
> identify a state, and city has a city_id, state_id and
> country_id (for easy
> reference) to identify it.  I then have a table for users
> that stores their
> city, state and country ID's along with other info about them.
>
> My problem came recently when I questioned the integrity of
> the data and
> needed to make some changes.  I thought to myself that maybe
> storing the ID
> wasn't as good as storing the ISO or FIPS 2 letter
> abbreviation.  The only
> problem the abbreviation could changed at some point by the
> regulating bodies
> and all rows in all tables would need to be updated.
>
> The question is, for the purposes of querying or searching is
> it better to
> store and search a 2 byte integer that is indexed for country
> or state ID's,
> or is it better to store and search a 2 byte CHAR abbreviation?
>
> 2) I've spent an accumulated total of around a month and a
> half trying to
> consolidate geographic name data from several free sources on
> the net and
> realize this isn't the best use of my time and errors will be
> had.  Does
> anyone know of a reliable source of geo data that isn't
> costly?  Most want to
> charge a server license, annual rate, etc.  I'm not sure
> about the free
> sources because one I used actually had mixed values in a
> column and drove me
> nuts.  I primarily need:
>
> country
> state
> county if applicable
> city
> latitude
> longitude
>
> This is primarily input from an HTML form to calculate
> distances between
> users.
>
> Anyone who has any experience with geo name data I would
> appreciate hearing
> your solution.

Re: Geographic data sources, queries and questions

От
Jorge Godoy
Дата:
"Chuck D." <pgsql-list@nullmx.com> writes:

> 1) The first is this.  I have 3 tables.  Country, state and city.  Country has
> a country_id to identify a country, state has a state_id and country_id to
> identify a state, and city has a city_id, state_id and country_id (for easy
> reference) to identify it.  I then have a table for users that stores their
> city, state and country ID's along with other info about them.

I don't believe this is good design.  You'll have to have a trigger or
something to verify that the country_id+state_id on the city table are
exactly equal to the country_id+state_id on the state table.  If you
don't, you might have something like (using US city names...) "country:
USA -> state: NY" -> "country: Zimbabwe -> state: NY -> city: New
York".

It isn't a problem of "any country and any state" on the city table, but
a problem of "this state inside that particular country".  I'd drop the
country column.

> My problem came recently when I questioned the integrity of the data and
> needed to make some changes.  I thought to myself that maybe storing the ID
> wasn't as good as storing the ISO or FIPS 2 letter abbreviation.  The only
> problem the abbreviation could changed at some point by the regulating bodies
> and all rows in all tables would need to be updated.

You have integrity problems because you denormalized your model too much
and tried to attach the same information on two different places without
requiring those to be equal.

> The question is, for the purposes of querying or searching is it better to
> store and search a 2 byte integer that is indexed for country or state ID's,
> or is it better to store and search a 2 byte CHAR abbreviation?

It all depends: surrogate primary keys or ... :-)  (old flame starter)


--
Jorge Godoy      <jgodoy@gmail.com>

Re: Searching data across tables, some large

От
Robert Fitzpatrick
Дата:
On Wed, 2007-05-23 at 19:48 -0400, Robert Fitzpatrick wrote:
> On Wed, 2007-05-23 at 18:53 -0400, Ericson Smith wrote:
> > Yeah, we've used Tsearch with joins and searches on other fields on
> > the tsearch table no problem. Tsearches are  just another part of a
> > WHERE clause.
> >
>
> And can there be Tsearch fields built based on fields in different
> tables? Where can I find the best docs for TSearch2? I'm looking for
> information pertaining to how TSearch can help in my situation, instead
> of building a view with a lot of joins causing absurd amount of rows
> returned.

Of course, what was I thinking! After reading through the TSearch stuff
I see the objective is to index multiple columns. There would be no way
to reference a unique record from an index of columns over multiple
tables.

So, it is looking like a build of the query string is going to have to
take place on whichever tables the fields are in while limiting what
fields the user can search (depending whether fields are in different
tables), that versus a view with everything is what I should be doing?

--
Robert


Re: Geographic data sources, queries and questions

От
"Chuck D."
Дата:
On Wednesday 23 May 2007 17:56, you wrote:
> seems hard to enforce integrity in your model. how are you going to
> ensure that the user's city-state-country combo a valid one? (well, you
> can, but it is a pain). ask yourself: can a city be in more than one
> country? probably not (even if the name is the same it is not the same
> city!). can a state be in more than one country? etc., etc.

No, a city state combination belongs to one country as far as I know.  I
really just left the country id in the city table as an easy means of getting
around while I was trying to understand the different data sources.  I was
doing a lot of typing at the console and it was easier than joins :)  It
would be dropped for sure as it adds another 4 MEGS to the table not
including indexes, and it violates BC normal form.

>
> numeric ids vs chars, when properly indexed, should perform about the
> same (even if there is a small difference this is not something one
> should really worry about; hey, there aren't even that many cities in
> the world!)

Ya, about 2 million city entries total.

>
> i would go with a unique internal id (in fact that IS what i do) you can
> store the FIPS/ISO code in a neighboring field, but i am not sure it is
> good enough for a primary key.

OK, cool.  I *prefer* the id as that is what I've been used to, but when I
used oracle in school we would have been taught to use the CHAR.  I just
don't like using the CHAR because some data sets use FIPS and some use ISO
and both are bound to change.  After all those are just abbreviations.  Seems
like something to be referenced.  I think I'll stick with the ID's instead of
CHARs then as they seem more comfortable.

Speaking of which, if you are using a similar database what source did you use
for geographic data?  I'm having troubles with a reliable set.

>
> > -----Original Message-----
> > From: pgsql-general-owner@postgresql.org
> > [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chuck D.
> > Sent: Wednesday, May 23, 2007 4:22 PM
> > To: pgsql-general@postgresql.org
> > Subject: [GENERAL] Geographic data sources, queries and questions
> >
> > Greetings all,
> >
> > I have a couple issues regarding geographic names databases.
> >
> > 1) The first is this.  I have 3 tables.  Country, state and
> > city.  Country has
> > a country_id to identify a country, state has a state_id and
> > country_id to
> > identify a state, and city has a city_id, state_id and
> > country_id (for easy
> > reference) to identify it.  I then have a table for users
> > that stores their
> > city, state and country ID's along with other info about them.
> >
> > My problem came recently when I questioned the integrity of
> > the data and
> > needed to make some changes.  I thought to myself that maybe
> > storing the ID
> > wasn't as good as storing the ISO or FIPS 2 letter
> > abbreviation.  The only
> > problem the abbreviation could changed at some point by the
> > regulating bodies
> > and all rows in all tables would need to be updated.
> >
> > The question is, for the purposes of querying or searching is
> > it better to
> > store and search a 2 byte integer that is indexed for country
> > or state ID's,
> > or is it better to store and search a 2 byte CHAR abbreviation?
> >
> > 2) I've spent an accumulated total of around a month and a
> > half trying to
> > consolidate geographic name data from several free sources on
> > the net and
> > realize this isn't the best use of my time and errors will be
> > had.  Does
> > anyone know of a reliable source of geo data that isn't
> > costly?  Most want to
> > charge a server license, annual rate, etc.  I'm not sure
> > about the free
> > sources because one I used actually had mixed values in a
> > column and drove me
> > nuts.  I primarily need:
> >
> > country
> > state
> > county if applicable
> > city
> > latitude
> > longitude
> >
> > This is primarily input from an HTML form to calculate
> > distances between
> > users.
> >
> > Anyone who has any experience with geo name data I would
> > appreciate hearing
> > your solution.

Re: Geographic data sources, queries and questions

От
"Chuck D."
Дата:
On Wednesday 23 May 2007 18:59, you wrote:

>
> I don't believe this is good design.  You'll have to have a trigger or
> something to verify that the country_id+state_id on the city table are
> exactly equal to the country_id+state_id on the state table.  If you
> don't, you might have something like (using US city names...) "country:
> USA -> state: NY" -> "country: Zimbabwe -> state: NY -> city: New
> York".

>
> It isn't a problem of "any country and any state" on the city table, but
> a problem of "this state inside that particular country".  I'd drop the
> country column.

You are right, this is a bad design.  The country_id on the city table has to
go.

Re: Geographic data sources, queries and questions

От
Tilmann Singer
Дата:
* Chuck D. <pgsql-list@nullmx.com> [20070524 01:26]:
> 2) I've spent an accumulated total of around a month and a half trying to
> consolidate geographic name data from several free sources on the net and
> realize this isn't the best use of my time and errors will be had.  Does
> anyone know of a reliable source of geo data that isn't costly?  Most want to
> charge a server license, annual rate, etc.  I'm not sure about the free
> sources because one I used actually had mixed values in a column and drove me
> nuts.  I primarily need:
>
> country
> state
> county if applicable
> city
> latitude
> longitude
>
> This is primarily input from an HTML form to calculate distances between
> users.
>
> Anyone who has any experience with geo name data I would appreciate hearing
> your solution.

We are using this data which seems to be fairly extensive and
accurate, and is free:

http://earth-info.nga.mil/gns/html/gis_countryfiles.htm

I haven't fully understood the meaning of all the fiels in there
however, we're using it only to compute alternative spellings for city
and country names and came up with these conditions that seem to
return the desired results:

For city alternatives:

select lower(full_name) as full_name from geo_names gn1 where gn1.ufi in
      (select ufi from geo_names gn2 where lower(gn2.full_name)=:city
      and gn2.fc='P')

For country  alternatives:

select lower(full_name) as full_name from geo_names gn1 where
      gn1.ufi in
      (select ufi from geo_names gn2 where
      lower(gn2.full_name)=:country and gn2.fc='A' and gn2.dsg='PCLI')


Til

Re: Geographic data sources, queries and questions

От
"John D. Burger"
Дата:
Tilmann Singer wrote:

> We are using this data which seems to be fairly extensive and
> accurate, and is free:
>
> http://earth-info.nga.mil/gns/html/gis_countryfiles.htm

We use that, but it is only non-US, so we combine it with this:

   http://geonames.usgs.gov/domestic/download_data.htm

We also have a hodge-podge of other sources, but those are the main
ones.  (By the way, we have found USGS to very amenable to dumping
their data in arbitrary ways.  Those state files essentially try to
fit everything into a single CSV format, but they have given us other
custom dumps.)

Note that both of these main sources have multiple names for the same
location, so our schema is highly normalized - we have a separate
table for names (so the string "Springfield" occurs in only one
place :).  Because we are interested in all sorts of geographic
entities, not just city/state/country, we have only a single table
for these, with fields for type, lat/long, primary name, and a few
other things.  All other relationships are represented in separate
linking tables, using our internal IDs for locations and names, e.g.,
location_has_name, location_contained_in_location, etc.  As far as
FIPS and ISO codes are concerned, we have a separate table mapping
(locationID, standards body) to codes.

We are interested in sharing this stuff, so I'd be happy to pass
along the schema and/or the data, although all of it is kind of beta.

- John D. Burger
   MITRE



Re: Geographic data sources, queries and questions

От
"Chuck D."
Дата:
On Thursday 24 May 2007 13:02, John D. Burger wrote:
>
> We also have a hodge-podge of other sources, but those are the main
> ones.  (By the way, we have found USGS to very amenable to dumping
> their data in arbitrary ways.  Those state files essentially try to
> fit everything into a single CSV format, but they have given us other
> custom dumps.)
>
> Note that both of these main sources have multiple names for the same
> location, so our schema is highly normalized - we have a separate
> table for names (so the string "Springfield" occurs in only one
> place :).  Because we are interested in all sorts of geographic
> entities, not just city/state/country, we have only a single table
> for these, with fields for type, lat/long, primary name, and a few
> other things.  All other relationships are represented in separate
> linking tables, using our internal IDs for locations and names, e.g.,
> location_has_name, location_contained_in_location, etc.  As far as
> FIPS and ISO codes are concerned, we have a separate table mapping
> (locationID, standards body) to codes.
>
> We are interested in sharing this stuff, so I'd be happy to pass
> along the schema and/or the data, although all of it is kind of beta.

John,

I'd be happy to take a look at the schema and data.

I decided to put together the USGS stuff, the maxmind free stuff and the
GeoNames project files and in the end I had countries with no states, states
with no cities and cities with no states.  Some data sources said a country
had 40 states, another said it had 50.  It was difficult to try and figure
out because I don't know geo stuff enough to verify it.

How can I go about getting a peek at the schema and the sources at least?


Re: Geographic data sources, queries and questions

От
"John D. Burger"
Дата:
Chuck D. wrote:

> I decided to put together the USGS stuff, the maxmind free stuff
> and the
> GeoNames project files and in the end I had countries with no
> states, states
> with no cities and cities with no states.  Some data sources said a
> country
> had 40 states, another said it had 50.  It was difficult to try and
> figure
> out because I don't know geo stuff enough to verify it.

Yeah, all of our sources data has various degrees of noise.  There
were even locations mis-typed as =countries= in the official NGA
downloads - you'd think their validation would at least identify
spurious countries :).  We developed a set of heuristics for deciding
when two locations (usually but not always from two different
sources) were in fact the same entity.  This was an area that needed
more work, however, when the project ended.  In addition, different
sources had made different ontological decisions about what was
what.  For instance, does the US have 50 states - what about the US
Virgin Islands, etc?

This was a few years ago - if we were to start up again, I suspect we
would investigate working with whoever is behind geonames.org, as
they seem to have the same kind of goals we did.  Anyway, I will send
our schema under separate cover, and I will investigate sending you
the data as well.

- John D. Burger
   MITRE



Re: Geographic data sources, queries and questions

От
brian
Дата:
John D. Burger wrote:
> For instance, does the US have 50 states - what about the US  Virgin Islands, etc?

Off-topic, but the US Virgin Islands are an "unincorporated United
States insular area" ("territory"--note the lowercase t). The Dept. Of
the Interrior addresses this in their FAQ:

http://www.doi.gov/oia/FAQ/FAQindex.htm#4

You'll be surprised and amazed at the number of US "terms of
sovereignty" that exist.

b

Re: Geographic data sources, queries and questions

От
"John D. Burger"
Дата:
> Off-topic, but the US Virgin Islands are an "unincorporated United
> States insular area" ("territory"--note the lowercase t). The Dept.
> Of the Interrior addresses this in their FAQ:
>
> http://www.doi.gov/oia/FAQ/FAQindex.htm#4
>
> You'll be surprised and amazed at the number of US "terms of
> sovereignty" that exist.

Yah, that's my point - some data sources might lump all these
together as state/province level entities, and some might not.

- John D. Burger
   MITRE



Re: Geographic data sources, queries and questions

От
mkwong8888@hotmail.com
Дата:
You can try the free sample database from http://www.geodatasource.com


Re: Geographic data sources, queries and questions

От
"Chuck D."
Дата:
On May 24, 2007 01:02:42 pm John D. Burger wrote:
> Tilmann Singer wrote:
> > We are using this data which seems to be fairly extensive and
> > accurate, and is free:
> >
> > http://earth-info.nga.mil/gns/html/gis_countryfiles.htm
>
> We use that, but it is only non-US, so we combine it with this:
>
>    http://geonames.usgs.gov/domestic/download_data.htm
>
> We also have a hodge-podge of other sources, but those are the main
> ones.  (By the way, we have found USGS to very amenable to dumping
> their data in arbitrary ways.  Those state files essentially try to
> fit everything into a single CSV format, but they have given us other
> custom dumps.)
>
> Note that both of these main sources have multiple names for the same
> location, so our schema is highly normalized - we have a separate
> table for names (so the string "Springfield" occurs in only one
> place :).  Because we are interested in all sorts of geographic
> entities, not just city/state/country, we have only a single table
> for these, with fields for type, lat/long, primary name, and a few
> other things.  All other relationships are represented in separate
> linking tables, using our internal IDs for locations and names, e.g.,
> location_has_name, location_contained_in_location, etc.  As far as
> FIPS and ISO codes are concerned, we have a separate table mapping
> (locationID, standards body) to codes.
>
> We are interested in sharing this stuff, so I'd be happy to pass
> along the schema and/or the data, although all of it is kind of beta.
>
> - John D. Burger
>    MITRE

This has been a while since I've written discussing this but I am looking for
some help.

Has anyone successfully imported any of the newer
http://earth-info.nga.mil/gns/html/gis_countryfiles.htm world cities files?

I keep getting problems with the import using COPY.  Just when I thought I'd
solved the problem of the ^M$ in some fields (notably China's adm2's) I've
come across a problem with cc1=SG in which there appears to be extra ^I tabs.

Anyone have success with these?  What preprocessing needs to be done to these
files to get them to COPY correctly?