When we look at foreign keys in a database model (keeping in mind that we’re trying to get these into a geodatabase) we can see a few patterns:
- Implemented for integrity (this would be all of them!)
- Link lookup values to a base table
- Single column foreign keys
- Multi-column foreign keys
- Foreign keys implemented for history (self-joins)
These patterns can be handled in a few ways:
- Keep the relational foreign keys (boring)
- Translate the foreign keys for lookup values into a domain
- Use relationship classes
The self join concept is something that is only do-able via the relational database so not much can be done about this one, although for the number of times that this actually occurs it isn’t that bad of an option.
Before we go any further, let’s look at a query that will bring back the all the foreign keys that are a) not self-joins and b) are defined on a single column. Something like this will do the trick:
SELECT a.constraint_name,
a.r_constraint_name,
a.delete_rule,
b.table_name fk_table_name,
b.column_name fk_column_name,
c.table_name pk_table_name,
c.column_name pk_column_name
FROM (SELECT *
FROM sys.dba_constraints
WHERE dba_constraints.constraint_name IN (SELECT constraint_name
FROM sys.dba_cons_columns
WHERE dba_cons_columns.constraint_name IN (SELECT constraint_name
FROM sys.dba_constraints
WHERE owner = 'PPDM37' AND constraint_type = 'R')
GROUP BY dba_cons_columns.constraint_name
HAVING COUNT(constraint_name) = 1)) a,
sys.dba_cons_columns b,
sys.dba_cons_columns c
WHERE a.owner = b.owner
AND a.constraint_name = b.constraint_name
AND a.r_owner = c.owner
AND a.r_constraint_name = c.constraint_name
AND b.table_name != c.table_name
The bad thing here is that this query doesn’t distinguish between single-column foreign keys that are implemented for integrity only or those that are implemented for lookups. That said, since we’re limiting our scope to PODS and PPDM we can use the naming conventions of these data models to tease apart the results. In the PPDM model most of the lookup tables are prefixed with R_ whereas in PODS the lookup tables are suffixed with CL. Another thing we can bank on because these databases use standard naming approaches is that in the PPDM lookup tables we’ll have a LONG_NAME column and in PODS we’ll have DESCRIPTION column.
Given these assumptions we can separate the results fairly easily. From this point I’ll assume that you know how to query a database using Python (if not then refer to the Constraints and Comprehensions post) and I’ll begin with a snippet that filters the results to return just the records that represent foreign keys implemented for lookup values:
resultsFilter = [r for r in results if r[5][:2] == 'R_']
and the other set would be returned by
resultsFilter = [r for r in results if r[5][:2] != 'R_']
Pretty simple hey? Not everything has to be difficult…but then again, we really haven’t accomplished much yet. The next steps are where this get cool, where we translate this into domains, simple relationship classes, and composite relational classes. That’s enough for one night, more later.
When designing a spatial data model we have many more concerns that those of just pure relational models. Sure we can inherit some of the base modeling approaches and in most cases this makes sense, but what concepts do we need to understand and write rules around for migrating from a relational model into a spatial model. Or more specifically, how do we map concepts and objects from the relational realm into that of the spatial world?
Let’s begin by laying down some base assumptions rooted in relational database modeling:
- Normalization is your friend, the order is to be determined
- Super classes and classes can be useful when behavior or attribution differs
- Check constraints and restrictive vocabulary (look-up tables) are important for data integrity
- Primary and Foreign keys are an absolute must for referential integrity and data maintenance
- Naming conventions for objects and attributes simplify the implementation
- More…
Hard to argue many, if any, of those — just good common sense when it comes to data modeling — and, for the most part, these are going to be the assumptions we use for spatial database modeling as well.
Instead of building an exhaustive list of assumptions (there’s a pile of great books available on these topics) our time is probably better invested in building up a set of assumptions that challenge the relational rules and/or patterns used in well developed relational database models such as PPDM and PODS.
Over the years we’ve “spatialized” many databases so we’ve seen a few patterns emerge and, remarkably, the more databases we see the fewer patterns emerge. Here’s a collection of what we normally encounter and need to handle when moving from relational into spatial, for now we’ll just identify the key items and elaborate mildly, just enough to give you a flavor, the details of each item will be handled in its own upcoming blog entry.
Musical interlude….If I haven’t already mentioned it, the goal here is to develop a series of processes or Python modules that can magically crawl through the metadata of your database (Oracle in my case) and spit out a Geodatabase structure with minimal intervention….elevator music concludes…
Relational Patterns
- Coordinate pairs stored as numbers with associated projection/datum code – normally used in cases where point geometry would represent the object location
- Listing of coordinate pairs as with an associate grouping identifier — normally used for lines or polygons that have their full geometry defined as an ordered set of coordinates, could be in WKT or a 1:M relationship.
- No coordinate pairs, locations are referenced by a legally defined survey system, address, linear reference system, and even linear reference systems within linear reference system (events of events)
- Minimal coordinates (e.g. bounding box) with offset definitions from a local orthogonal coordinate system
- Coordinate Collections can have different spatial references defined or may have unknown spatial reference (how is this information even remotely useful — direct linkage with quality)
- Multi-column primary keys, very common in PPDM, not so common in PODS. In either case, this is a bit of a limitation in the Geodatabase world but thankfully the use of GUID’s can help.
- Super classes — a legacy approach to spatializing PPDM suggested that all Polygons be managed in one feature class and then joined to the necessary business table using the key values, makes sense in theory but performance is a pain (same was suggested for points and lines). Things have improved…tied in directly with this item is the natural ask, what about sub-classes…yup, that’s where subtypes come into the mix
- Versioning — ’nuff said
- Quality — capturing this at the element level versus the classic GIS approach of generalizing across the whole dataset. Another concept here is returning preferred data only, multiple sources and hierarchical business rules on which elements are the most valuable for use
- Constraints — when should these be migrated and how? As relational database constraints only? As relationship classes? Or maybe as domains?
- Lookup tables — classic approach in databases to ensure efficient storage and high fidelity data — do domains provide everything needed in the geodatabase world?
- Business objects with variable spatial location depending on aggregation level, the roll-up of information may make the underlying data applicable to a more well defined location or a more general location — think summary statistics for census data as an example.
- More…
Alright, that should be enough to keep us busy for the next few months. Tune in for updates as we hack through these ideas and present our findings.
In this installation we’ll explore how it might be possible to choose a study area for use in a decision support system that is founded on GIS capabilities.
Supply and Demand
The extents of supply and demand (stop and pause for a moment to reflect on the oddness of giving spatial dimension to economic factors) are quite easily definable since we are looking at the use of decision support systems and GIS with respect to siting of linear assets. In this case the bounding box of the start and end points is in effect the supply and demand extent. The supply is usually obvious but the demand could in fact be a couple of markets away. But let’s not wax philosophical too much on this point since the asset to be sited will not be all things to all markets, but rather will meet a need and then its inertia will take hold after that.
Using the supply and demand extent for the study area does require a few caveats, for instance, if the supply and demand points line-up in a north-south or east-west direction then the bounding box actually becomes a bounding line. Huh? So clearly there has to be a little more science applied in these cases to properly define the extent of supply and demand.
The interesting thing (at least interesting to me) is that in the limit of extents we begin to observe a classic case of edge effects and their influence on the study area, corridor definition, and any analysis that occurs within the study area. We’ll cover some of the edge effects typically found in linear asset siting algorithms in a future babble-gram.
“Fish aren’t the only things with scales”
Working in the world of decision support means that many stakeholders have many concerns about many phenomena. The unfortunate part of this is that we can have competing interests at a variety of locations, a variety of timelines, and , what we’ll focus on in this section, a variety of scales.
Imagine a town hall session that starts with three people stepping up to the microphone to express their concerns. The first mentions their worry about loss of property value to their home, the next is concerned about losing a sense of community, and the last has more haughty reservations about the future expansion of the town into an industrialized area. All are valid “keep me awake at night” issues and to address these issues it would require, or at least traditionally have required, data to support each question individually. The data used would be a variety of qualities and/or resolutions representing extents to a suitable address at a detailed analysis level (my house), local level (my neighborhood), and regional level (my town).
What does this tell us? Or how is this useful? Well if we had a myopic view of the task at hand, that is, seeking to develop the linear asset, then we might address each of these concerns as separate entities, gathering data for each on their own and then trying (with much flailing) to amalgamate the datasets after the fact and hoping the planets were aligned enough to ensure outputs from the analyses produced consistent results. Ha! Anyone who has run the same spatial analysis/process at different scales or with different resolutions of data will know immediately that each level of resolution produces a unique set of results. It takes a great level of planning and preparation to ensure consistency of results across multiple scales.
But we do not want to take a myopic view! We need to address all questions and concerns with the same amount of effort and ideally the same level of fidelity. So the danger here is that when you work with many scales of data and then try to aggregate at the results level you’ll often end up trying to talk your way (aka wave your hands) into convincing the stakeholders. Bewildering stakeholders with endless charts and graphs and maps does not build consensus — it builds confusion and apprehension.
How does this affect study area? Quite simply, the study area could be “arbitrarily” chosen based on the availability of good data, but good data is not available for all extents areas of the world. We can either spend millions in collecting data to address a ten cent question or we can learn to mix and match and manage the transition between high and low quality datasets. A wonderful balance to be struck between dollars and decisions, doing enough to show that a particular concern has been addressed in detail and not just in general.
“You Can’t Handle the Data”
Somehow the drama of that court room scene just wouldn’t have been so Oscar-like if spoken that way. The real phrase, of course, was “You Can’t Handle the Truth” and we pick on this because even today there seems to be a fear (mostly from the uninformed who also happen to be in positions of power – thanks Peter for your principle) that using high fidelity data in regional level studies is too expensive (time, data costs, computational).
This argument has been raised several times with us over the years and mostly from the standpoint of air and water modeling (sorry guys, not trying to pick on you) where high quality data was seen as a crippling issue, why? because their modeling software has ridiculous limitations like no raster inputs larger than 1024×1024. So really, honestly now, is this really an issue of high quality data or is this an issue of poor quality models that don’t scale nicely into the real world.
We’ve run a couple of analyses over the past few years that could be considered pushing the envelope with respect to data quality and extents. Yes it took time to prepare the data, write the code, manage memory and disk space (what! imagine caring about this) and run the analyses but so what? Wouldn’t you expect the quality of the answer to be higher as well? Wouldn’t you want the higher quality answer? Wouldn’t you want a known quality answer?
Unity with Bias
The classic approach to defining the study area in practice has a lot to do with unity. Not trying to imply that solidarity and a sit-in are required to choose the study area, although consensus will be a key component in the decision making process, but rather that the value of one is used as a defining metric. For instance, if you’re a believer in the imperial measurement system then it makes sense to study within 1 mile of the proposed corridor (hmmm, does someone already have the answer?) or if you’re more of a metric kind of person then why not study within 1 kilometer of the proposed corridor (again, does someone already have the solution in hand?). Unity!
Before moving on to address bias it is worth mentioning that the measure of distance from the proposed corridor does not need to be in units of length and could be measured in units of deviation cost (e.g. dollars or effort). This will be another upcoming topic.
What should we say about bias? Well the main thing we want to say about it at this point in this series is that measuring anything within proximity to the solution (i.e. the corridor) sure seems backwards. From a practical standpoint, there has to be some rudimentary starting point (think of it as seeding an iterative loop) but, and here’s the conundrum, if your study area is defined by a set distance from the assumed solution, and then the underlying assumption changes, then do you have adequate coverage in your study area any longer? Does your study area now become a dynamic creature, ever expanding and contracting throughout the project life? How do you propose to manage data throughout the lifecycle of a project whose study area is continually in flux? These are all good and valid questions, but just think for a moment of the implementation issues that subtle changes in study areas can cause.
That’s it for study areas for now until we begin working some examples in our upcoming installments, now it’s time for us to focus on the concept of costs in the context of decision support.
Ouch. Bright lights, too much caffeine, and a head cold. What a great way to travel from coast to coast. Red-eye flights are a great convenience but the pain on the other end of the time zone change? Ouch. REM sleep, that’s the time of the night where in-seat entertainment monitors flash and burn into your retinas?
Enough complaining for now..somewhere between insomnia and head bobs I had a chance to look into the next part of our data modeling adventure! That being how to handle translation of a primary and foreign keys into a Geodatabase model. The happiest path is we just take them as-is and re-apply these constraints to the database tables…but that would be the easy way out.
More to come…must acclimatize…
No more denial! I am a Python junkie — one of my favorite languages for rapid development and for performing data/database integration. Recently I re-read several chapters of the Dive Into Python book (www.diveintopython.org) and came across a section I had no recollection of reading before, and (spoken in a Chris Farley voice) “It was awesome”.
“Data-centric Programming” (section 16.5 of Dive Into Python) is a great example of an embedded rant, a brief manifesto on the merits of Python over other languages with a slight slam on VB (nice!). More than just a slam on VB, this part of the book is a strong reminder to those who have chosen Python to embrace its elegance and, what may seem to be counter to everyday practice, harness its capabilities and form new habits.
The Python language has some way-cool features such as map, filter, reduce that have been around in the language for quite some time. Map and filter (assuming I’ve read the Python 3.0 changes correctly) are being deprecated and the recommended path forward, and the Python docs have included this for some time now, is to use List Comprehensions. Depending on what you’re planning to accomplish, list comprehensions have the benefit of working at almost-C speed and are considered to be be way more Pythonic than for loops.
Why that preamble? Well partly to ask for forgiveness on what comes next.
One of my pet projects is to build some automated tools for translating relational database models into ESRI Geodatabase models. In another post (one that is in the works but still needs polish) I’m identifying the components that I plan to address but for now I’m putting this one out there as a taster.
Databases used check constraints for data integrity. Check constraints in an ESRI Geodatabase could be implemented as RDBMS check constraints but this would not allow for ease of use of the new feature class or table within the ESRI suite of tools. So instead of leaving the check constraint as a pure RDBMS implementation we can translated that into an ESRI Coded Domain or ESRI Range Domain.
Looking at the PPDM 3.7 database model we can easily gain access to all of the check constraints by querying, in this particular case, the Oracle metadata tables/DBA views. The query would look like:
SELECT TABLE_NAME, SEARCH_CONDITION
FROM SYS.DBA_CONSTRAINTS
WHERE OWNER = 'PPDM37'
AND CONSTRAINT_TYPE = 'C'
Without the CONSTRAINT_TYPE set to ‘C’ this will return all of the constraints in the database which is fine but for now we’ll just handle check constraints.
This brings back results that look something like:
and gives us a lot of useful information like the table and field name (although it is embedded in the constraint text) as well as the type of check that is being performed. In the output above most of what show are IS NOT NULL checks — just so happens that the best way to handle these in a Geodatabase is by using the exact same approach, that is, set the property of the underlying field so that is cannot accept NULL values.
However, the types of check constraints that we’re seeking are those that use range or coded values. After trolling through the results for a while the patterns of the returned results became pretty obvious, only coded domains were found and within these coded domains we saw that many of the checks used the same values (e.g. value must be Y or N). From a Geodatabase perspective we can create a coded domain and apply it to many fields, a little bit of a deviation from how it works in a pure relational sense.
Let’s build a a mini framework for what we plan to accomplish:
- Find tables that have constraints
- Identify fields that have constraints
- Extract the constraints
- Collect common constraints together as a coded domain
- Keep track of table/field combinations that require the coded domain
- Implement as ESRI Geodatabase objects**
**this item we’ll keep for a later date since it relies heavily on implementation of other concepts and objects.
Now for some code…where to begin? For this particular example we will be accessing an Oracle database so cx_Oracle is a great module use in Python. Another thing we want to do is group the coded domains based on their values, sure we can write code for this (a dictionary is a great structure to use) but we can also use the itertools module. We also need to perform some string manipulation so let’s include the string module as well.
import cx_Oracle, string, itertools
connection = cx_Oracle.Connection('UNAME/PWORD@ORA')
cursor = connection.cursor()
#This is the SQL statement above
cursor.execute(sqlConstraint)
results = cursor.fetchall()
What this snippet does is import the modules we need, connects to the database, opens a cursor and then uses that cursor to execute a query (the same one presented above). Last step is to retrieve, in this case, all of the results of the query. These results will include the “IS NOT NULL” constraints, they also include extra spaces and hard-returns which are great for readability but useless for the computer.
So what do we do? The results represent (as per the Python DB spec) a list of tuples where each tuple is a record and in our case holds the (TABLE_NAME, SEARCH_CONDITION). The approach we used is a list comprehension, it allows us to filter the data and also transform it at the same time.
resultsFilter = [(r[0], string.split(r[1], ' ')[0], eval(string.replace(r[1], '\n', '')[string.find(r[1], ' IN ') + 4:])) for r in results if 'IS NOT NULL' not in r[1]]
This one line of code does a wild amount of work. The if statement part of the list comprehension ensures that only those entries are not ‘IS NOT NULL’ get processed — this means we’ll only get items that are eligible for coded domains. The r[0] term is simply the table name and it gets returned as the first element of a tuple. The second item of the tuple is the field name and the third item in the tuple is magic — okay, it isn’t magic, just code — and it strips hard-returns from the search condition string, slices the string based on the location of the IN operator, and then uses an eval function to translate from string into a tuple. Sample output:
('ECOZONE_XREF', 'ACTIVE_IND', ('Y', 'N')),
('ENTITLEMENT', 'ACTIVE_IND', ('Y', 'N')),
('ENT_COMPONENT', 'ACTIVE_IND', ('Y', 'N')),
('ENT_COMPONENT', 'INFO_ITEM_TYPE', ('RM_MAP', 'RM_WELL_LOG', 'RM_DOCUMENT', 'RM_SEIS_TRACE', 'RM_FOSSIL', 'RM_COMPOSITE', 'RM_LITH_SAMPLE', 'RM_EQUIPMENT')),
('ENT_COMPONENT', 'LAND_RIGHT_TYPE', ('LAND_UNIT', 'LAND_UNIT_TRACT', 'LAND_TITLE', 'LAND_AGREEMENT', 'LAND_AGREE_PART'))
Sure, at first glance the readability may not be that high, but once you’re familiar with the syntax it is remarkably elegant.
This is really close to what we need but the last step will be to group this information together so that we have a list of unique coded domains and all of the associated table/field combinations. It is easy enough to write a loop to do this for us, pushing the results into a dictionary, but why write what is already written? The itertools module has a slick groupBy method that can be used for performing the next step, the output of an itertools method is an iterator so we wrap the result in a list comprehension to pull the data back into a ‘more useable’ form i.e. lists and tuples.
groupValues = [[key, [g[:2] for g in group]] for key, group in itertools.groupby(resultsFilter, lambda r: r[2])]
Sample output:
[[('Y', 'N'), [('APPLICATION', 'ACTIVE_IND'), ('APPLICATION', 'FEES_PAID_IND'), ('APPLICATION', 'RESUBMISSION_IND'), ('APPLICATION', 'SUBMISSION_COMPLETE_IND'), ... , ('AREA_DESCRIPTION', 'ACTIVE_IND'), ('BA_ADDRESS', 'ACTIVE_IND'), ('BA_ADDRESS', 'WITHHOLDING_TAX_IND'), ('BA_ALIAS', 'ACTIVE_IND'), ('BA_ALIAS', 'PREFERRED_IND'), ('BA_AUTHORITY', 'ACTIVE_IND'), ('BA_AUTHORITY_COMP', 'ACTIVE_IND')]],
[('LAND_UNIT', 'LAND_UNIT_TRACT', 'LAND_TITLE', 'LAND_AGREEMENT', 'LAND_AGREE_PART'), [('BA_AUTHORITY_COMP', 'LAND_RIGHT_TYPE')]]]
And that’s about it! Between some basic SQL and two list comprehensions we have all the information needed to setup ESRI Geodatabase coded domains and to assign the domains to a field in a table.
Filed under: Funny (or at least Funny to me)
Wow, it has been way too long between posts. I’ve got a few in the hopper but between going on 3 business trips and going through just as many laptops in the past month it has been, might I say, a tad lean on extra cycles! But stay tuned, the veil of silence will be lifted shortly.