Re: Table Design for Hierarchical Data

Поиск
Список
Период
Сортировка
От Steve Crawford
Тема Re: Table Design for Hierarchical Data
Дата
Msg-id 4BBBD086.1000205@pinpointresearch.com
обсуждение исходный текст
Ответ на Table Design for Hierarchical Data  (Lee Hachadoorian <lee.hachadoorian@gmail.com>)
Список pgsql-sql
Lee Hachadoorian wrote:
>
> I am trying to come up with a structure to store employment data by 
> NAICS (North American Industrial Classification System). The data uses 
> a hierarchical encoding scheme ranging between 2 and 5 digits. That 
> is, each 2-digit code includes all industries beginning with the same 
> two digits. 61 includes 611 which includes 6111, 6112, 6113, etc. A 
> portion of the hierarchy is shown after the sig.From the http://www.census.gov/eos/www/naics/ website:

"NAICS is a two- through six-digit hierarchical classification system, 
offering five levels of detail. Each digit in the code is part of a 
series of progressively narrower categories, and the more digits in the 
code signify greater classification detail. The first two digits 
designate the economic sector, the third digit designates the subsector, 
the fourth digit designates the industry group, the fifth digit 
designates the NAICS industry, and the sixth digit designates the 
national industry. The five-digit NAICS code is the level at which there 
is comparability in code and definitions for most of the NAICS sectors 
across the three countries participating in NAICS (the United States, 
Canada, and Mexico). The six-digit level allows for the United States, 
Canada, and Mexico each to have country-specific detail. A complete and 
valid NAICS code contains six digits."

I think I'd be inclined to store it as defined above with tables for 
sector, subsector, industry-group and NAICS-industry. So the NAICS table 
might have a primary key of industry_code (11131, Orange Groves) and a 
industry_group column with a foreign-key constraint to the 
industry-group table (1113, Fruit and Tree Nut  Farming). You might add 
a constraint to ensure that the industry-group is the appropriate 
substring of the naics code and so on up the heirarchy. If you are 
dealing with importing a large amount of static source data for 
analysis, these tables will also be tailor-made places to do 
pre-aggregation.

Adjacency lists work well in certain cases where the depths of the trees 
are variable or indeterminate. For example, think of an employee->boss 
org-chart for a large company. The maintenance supervisor for an area 
might be a dozen levels below the CEO and be a several levels above the 
branch night janitor while the CEO's personal assistant is just one 
level down but with no direct reports. The CTE/recursive-query features 
in 8.4 are great for this. But in the case you have described, the 
number of levels is well defined as is the type of information 
associated with each level.

But this all depends on the nature of your source data, how often it is 
updated, how big it is and the questions you want answered. It might be 
perfectly acceptable to just have the 5-digit code on all your 
individual data records and do something like select ... group by 
substr(full_naics_code,1,3) where substr(full_naics_code,1,2)='61'). In 
this case you will still want to keep the NAICS definition table 
separate and link to it.

One question that might impact this is the coding of your source data. 
Is it all full 5-digit coding or are some records coded at a high level 
of detail and others only to the top-level?

>
> One way to store this data would be to store at the most granular 
> level (5-digit NAICS) and then aggregate up if I wanted employment at 
> the 4-, 3-, or 2-digit level. The problem is that because of 
> nondisclosure rules, the data is sometimes censored at the more 
> specific level. I might, for example, have data for 6114, but not 
> 61141, 61142, 61143. For a different branch of the tree, I might have 
> data at the 5-digit level while for yet another branch I might have 
> data only to the 3-digit level (not 4 or 5). I think that means I have 
> to store all data at multiple levels, even if some of the higher-level 
> data could be reconstructed from other, lower-level data.
>
What do you mean by censored? Is the data supplied to you pre-aggregated 
to some level and censored to preserve confidentiality or are do you 
have the record-level source data and the responsibility to suppress 
data in your reports? Is the data suppression ad-hoc (i.e. someone comes 
to you and says don't display these five aggregates), based on simple 
rules (don't display any aggregate with fewer than 15 records) or on 
more complex rules (don't display any data that would allow calculation 
of a group of fewer than 15)? My guess is that the multi-table scenario 
will be better suited to flagging aggregates for suppression.

Cheers,
Steve


В списке pgsql-sql по дате отправления:

Предыдущее
От: Richard Broersma
Дата:
Сообщение: Re: Table Design for Hierarchical Data
Следующее
От: silly sad
Дата:
Сообщение: Re: Table Design for Hierarchical Data