(As published in the Proceedings of the Spatial Sciences 2003 Conference 22-26 September Canberra Australia)

Metadata and Timestamping in RIME

Rodney J. Thompson

Department of Natural Resources and Mines (NR&M),

Vulture and Main Sts, Woolloongabba Brisbane 4169 Australia.

Tel: +61 7 38963286 Fax: +61 7 34062361 Email: Rod.Thompson@nrm.qld.gov.au

ABSTRACT

RIME stands for the Resource Information Management Environment. It includes a database of topographic and other resource information in a spatial repository – an Informix database using the Esri developed "spatial datablade".

Spatial data within RIME is at a wide range of scales and accuracies.

Two issues in the design of RIME are worthy of note:

1. The use of ISO/TC211 (ISO 19115) compatible metadata, built into the structure of the database itself.

This metadata is an active part of the control mechanism of the database itself, providing security checking, audit logging, lock control, and history of the data, in addition to the quality and lineage statements. All access to the data is mediated by the metadata.

The metadata is also used to control the choice of data to be presented to a user, limiting access to data that would be inappropriate, for reasons of accuracy, timeliness, adherence to standards etc. 

2. The timestamping of features, allowing a view of the database “as at” any point of time (subsequent to the loading of the database). This has interesting ramifications when taken in conjunction with the notion of active metadata, and allows strong and flexible control of the spatial data.

Keywords: natural resource information, spatial information, geographic information system, topographic data, metadata, temporal data

Introduction

It is coming to be recognised that, in spatial databases, metadata is a vital component, and so it was an early decision in the development of RIME to construct the database based on the metadata. This was to be an active connection, in that the metadata is used to control the data itself.

The approach in RIME is that the metadata is recorded as part of the database. Further, the design and nomenclature of the metadata component of RIME is closely based on the International Standard IS19115 (ISO-TC211 2002) Metadata model.

In this approach, the timestamping of features for the purpose of update management, becomes an integral part of the metadata collection process, and a complete record of all updates is automatically recorded.

Metadata

On printed maps it has been recognised for many years that the metadata is vital to the usefulness of the map. No navigator would ever consider using a map without a title block containing an indication of the currency of the map, the scale, and the accuracy. This was not necessarily the case in the early days of GIS (Geographic Information Systems) when data were often used inappropriately. Fortunately the situation is improving, however even today the metadata is often seen as an “afterthought” to be added to the database after the data is loaded.

It was an early decision in the development of RIME, to construct the database based on the metadata, with the metadata to be stored within the database. This concept is known (at least in this paper) as “active metadata”.

Codd's "Rule 4" (Date 1990) for relational DBMS's requires the data dictionary to be part of the database. This concept has been so successful that most practitioners of relational database technology have forgotten the term “metadata”, and are not aware of the fact that they are actively using it in their day-to-day work.

Contrast this with the current state of spatial data. In the best of cases, a metadata tool is provided, which by the organisation’s operational standards, is to be used to record “metadata” on every file (possible restricted to files containing spatial data), in the organisation. In the worst cases, the concept of metadata is unknown. Even in the best case, the capture of metadata is, at best, variable. The only compunction on anyone to “get it right” is some kind of organisational coercion.

Where metadata is maintained, it is usually only at the dataset level, with very few tools being available for the recording of feature level metadata. Furthermore, with metadata tools working to standards such as the previous ANZLIC Metadata Guidelines (ANZLIC 1996), a large number of mandatory fields lead the users into coding less-than-useful entries such as "not applicable".

Metadata In RIME:

The philosophy of RIME is for the metadata to be part of the database structure. This approach is taken to allow the maximum automatic generation of metadata, combined with a necessary minimum of data capture effort. For example, such fields as “timestamp” are automatically captured, while fields such as “purpose” must be supplied by the user. The ISO standard IS19115 has the advantage that most fields are optional, so that there is no need to code items with values such as “not applicable”.

The IS19115 allows metadata to be recorded against a “dataset”, a “dataset series”, or against any collection of features (including a single feature). The standard allows metadata to apply to a specific attribute, or set of attributes of a feature. In addition, a set of metadata may exist within a parent set of metadata.

Within RIME the term "qualset" has been coined, (from "quality set") which is defined as any collection of features with any metadata in common. "Qualset" is thus a generic term for "database", "dataset series", "dataset", and "feature collection". A Qualset may represent the whole database, a dataset within the database, or any collection of features. It could consist of a feature collection containing only one feature.

Figure 1 Metadata within RIME

It is allowable for a data set series to be contained within another dataset series. (e.g. a series at 1:25000 within the general topographic series.). 

Within RIME, qualsets are identified by a "fileIdentifier" (as per the ISO standard). The convention is that a file identifier consists of a number of “facets” separated by “/” characters. Thus “NRM/GDS/TOPO/19115” is a valid fileIdentifier.

By this convention, any qualsets that form a subset of an item have a fileIdentifier which is based on its, by appending “/” and a further facet. For example, the above fileIdentifier is a dataset series within the dataset series “NRM/GDS/TOPO”, and in turn within “NRM/GDS” etc.

The database allows for any metadata values to be stored at any level of the hierarchy, with the lower level data taking precedence over the higher. For example, a scale value of 1:10000 in a high level metadata item OWNER/CUSTODIAN/TEST, would be overridden by the value of 1:12000 in a lower level metadata item OWNER/CUSTODIAN /TEST/OFFSCALE.

The metadata hierarchy should ideally be defined by the form and characteristics of the features themselves – i.e. the divisions should reflect the type of data stored, NOT the work areas of the department that create or use the data. The exception to this is that the first two levels do determine the ownership and custodianship of the data.

Metadata items at the highest dataset series level (e.g. "NRM", with no "/" character) record amongst other information, the owner of the data.

Metadata items at the next level (e.g. "NRM/GDS", with a single "/" character) record the data custodian. (Geographic Data Services)

The structure of the lower levels depends on the upper levels, and will vary according to the custodian, but for example:

"NRM/GDS/TOPO" contains metadata common to most topographic data within GDS such as the fact that feature names are not available.

"NRM/GDS/TOPO/25K" contains scale and accuracy data that applies to most 1:25000 topographic data.

"NRM/GDS/TOPO/25K/944723" contains metadata that applies to all features in this map sheet region.

The complete set of metadata which applies to a particular feature is comprised of information from many of the above levels. Thus the feature’s own individual metadata (if any) will be the first place to look for metadata about the feature. Then each level is searched right up to the database metadata, with the lower levels taking priority for any particular item of information.

As a result of this, it is only necessary to record information such as name and address of owner, copyright information etc, once, at a high level in the hierarchy, allowing it to apply to all features in the database (unless overridden at a lower level).

Active Metadata.

The metadata is not merely intended to document the data; it actually controls the access to, and the updating of the data, and records access to and operations on the database itself.

Access To Data:

Figure 2 Access constraints in metadata

The ISO standard has a section on data access/security requirements. In particular the “legalConstraints” and “securityConstraints” and generic “constraints” objects are used in RIME.

“Legal constraints” include copyright and ownership information, vital in a database like RIME with features from many different sources.

“Security Constraints” allow restrictions to be placed on collections of features for many reasons – but in RIME, they are being used to restrict to authorised users only. The security constraints that are placed on a particular collection of features will be determined by its owner (with the owner being in itself recorded as a metadata item), and so it is possible for a true “multi-custodian” database to be created. That is to say, a custodian may place a collection of data into RIME, but keep control of to whom it is made available.

The generic “constraints” is used to record locking of data for specific purposes. At present, it records those features which are currently locked for update purposes (see below).

Lineage of Data:

(This is a component of data quality, but treated separately here because it is a significant topic in its own right).

Figure 3 Data Quality and Lineage

The aim in keeping lineage metadata is to record as complete as possible a record of the source(s) of the data. Significant events in the capture and update of the data should be recorded so that the user may make informed decisions as to its usefulness.

The kind of events of interest include: The initial photography, stereoplotting, verification, ground truthing, conversion to digital form, revision, comparison with independent sources, etc.

Some of the above events will need to be manually entered by an operator, but some can be captured automatically, as the data resulting from the activity is entered into the database. The most obvious case is that when features are updated, the update event becomes part of the lineage statement (recording date, time, operator, and the features affected automatically). The action of loading the data into the RIME database also constitutes an automatically generated "process step".

Quality of Data:

In addition to the above lineage information, there are two major components of quality.

Figure 4 Data Quality results

The first is a summary of the data quality of the particular collection of features. The most useful and obvious entries are “scale” (meaning the largest effective scale at which the data should be used), and “revision date” (giving the timeliness of the data). RIME makes a simplification of the ISO metadata model, requiring that only a single data quality summary can be recorded for a single metadata item. 

The remaining majority of quality information is in the “data quality results” tables. These tables allow specific tests to be carried out on collections of features; with the results of these tests being recorded, and made available to prospective users of the data.

Thus, for example, a random verification of 2% of the feature names in a data set could be carried out against alternate sources. The result (e.g. 97% correct) could be recorded against the dataset (along with the sample size, and a citation of the alternate source used). This dataset now is available for use by anyone for whom 97% correctness of names is acceptable. If a user cannot accept 97% correctness of names, the dataset is not available without further action.

The sample size used in making the accuracy measure is also made available, in order to allow the user to be satisfied as to the justification of the accuracy measure.

This approach can provide a solution to the question of validity of geometry. Different GIS’s have their own definitions of validity, and data which satisfy one may not necessarily satisfy others. Most have a “clean” operation, but the parameters of each are different (and the concepts do not exactly agree).

For example, ESRI ArcInfo 8 uses "fuzzy tolerance" and "dangle length" (Booth 1999) in generating topology. The ISO standard IS19107 (ISO-TC211 2001) allows implementers to define various tolerances – for example in determining if an object "is_simple".

A possible solution – using the data quality result is as follows:

For each GIS to be supported, each "qualset" is run against the validation operation.

If it passes, a dq_conformanceResult record is generated, recording a "pass", with a link to a citation for the specific GIS validation documentation.

If it is possible to run a "clean" operation, which results for example in a clean topology build, a dq_conformanceResult record is generated containing in addition, the parameters required in the "explanation" field (in a form readable by a GIS user).

If it is not possible to force validity for the GIS in question, this fact is recorded as a dq_conformanceResult record with pass = false.

 The fact that validity can/cannot be generated could be used as a selection criterion in searching for available data. (At present this concept has not been proven, but it is theoretically possible).

This approach has significant advantages over the more usual one, of rejecting from the database any data that fails any conceivable validity check, up to full topology testing (based on every GIS client to be supported):

Many users can make valid use of data which is not topologically pure.

It is possible that a user of a particular GIS may have less stringent requirements than others.

In this metadata-based approach, these users can be accommodated, without compromising the users who require pure data.

Citation:

An important part of the standard is the citation. This table allows the title, author, location information, URL, etc. to be recorded of any book, paper, publication, web page etc. that is of interest to RIME users.

For example, if a topological test is applied as above, the documentation of that test (including what the parameters mean) can be cited, with the details in the citation table.

In addition, data sets can be cited. For example, external datasets which are used for cross-validation of RIME data, and internal data sets. In particular, any collection of features within RIME can be cited. For example, if a study is run using data extracted from RIME, the collection of features extracted can be cited as a data source for the study. (In fact any "qualset" may be cited).

Responsible Party:

This table is used to carry information on any person or entity that has an interest in RIME.

Interested parties include:

Data Owners (e.g. Geoscience Australia)

Data custodians (e.g. NRM Topo section)

Data users.

Update operators.

Feature Timestamping

All feature information (including ancillary information such as feature names) is timestamped, in line with the ICSM (Intergovernmental Committee on Surveying and Mapping) guidelines for incremental update (ICSM 2003).

It has been demonstrated that the timestamping approach is very flexible, and uniquely suitable for the storage of spatial data in a relational or object-relational database. In fact, even complex topological encoding can be accommodated (van Oosterom 1997).

The approach we take uses, instead of a date/time combination as a time stamp, a "lock number". This is sequentially allocated, and is thus useable as a time stamp, but has further advantages as will be seen. This approach has been used successfully in Queensland's DCDB (Digital Cadastral Data Base) for many years.

Every entry in the main data tables (but not the metadata tables at present – see below), has two special columns – creating_lock_nr and destroying_lock_nr. These integers are allocated in time sequence (see updating – below).

Figure 5 Update of a single feature

 

Figure 6 Internal database table contents

A feature, place name, etc, is only considered to be current if the creating lock number is in the “past” and the destroying lock number is in the “future”.

If a feature is changed, as at a specific lock number, its representation is “retired”, by placing the lock number in its destroying lock number. A new representation is created with a creating lock number of the update lock number, and a destroying lock number of “infinity” (actually 2000000000).

The possibly unique feature of the RIME model is that the lock numbers that are used as creating and destroying time stamps are actually metadata identifiers. Thus, full details of the updates, including before and after images of all database objects affected, are available as part of the lineage statement within the database.

Database Updates:

When a set of features is to be updated, a new metadata item is created containing those features. The metadata item identifier is allocated as the item is created, so that the metadata id is in time sequence. This metadata id becomes a "lock identifier". Part of the "constraints" information for this metadata item is a statement that the features are "locked".

Any attempt to update any of these features by any other user will be blocked by this access restriction. (This blocking action is able to inform the user who has the item(s) locked, since when, what the extent of the lock is etc.)

When the features are actually updated, the metadata id becomes in effect, an update number, and is used as the creating/destroying lock numbers (as described above).

Thus the details of the update – the person responsible, date, time, reason, environment used, etc, are captured, and become part of the metadata of all features involved in the update; and the metadata item becomes part of the lineage of all these features.

Database Views "as at time":

Using this mechanism, it is possible to view the database "rolled back" to any prior time since the creation of the database. The actual mechanism is very simple.

The metadata table is used to determine the last update that occurred before the date/time required. (finding the metadataId).

All tables being accessed have in their "where clauses" the following tests

"creating_lock_number <= metadataId AND destroying_lock_number > metadataId".

This gives the state of the database immediately following the update recorded with the identifier "metadataId".

In fact, the above clauses are added to all database accesses, even those extracting the current data, in which case a special value of metadataId is used, larger than any possible real metadata identifiers.

Note – even a join of tables is possible (and quite efficient) "as at" a lock number. For example, if the place table is to be joined to the feature table, to expand the place names, the select statement might be as follows:

select   from feature f, place p where …….  

AND p.place_id = f.place_id

AND p.creating_lock_number <= metadataId AND p.destroying_lock_number > metadataId

AND f.creating_lock_number <= metadataId AND f.destroying_lock_number > metadataId;

This looks like a significant increase in complexity, but has proved not to noticeably affect access times, and once the "formula" is accepted, does not really complicate the logic.

Figure 7 Place and Feature tables for above example

Sequence of Updates

There is a problem in the above approach. Since features are locked for the duration of the update action, and that action can require some time to apply (even several days), the sequence of update numbers is chronological on the time the features are locked, rather than the time the update is applied to the database.

This is not a problem in the feature table itself, since the features involved in an update cannot also be involved in another update that overlaps it in time. (This is the whole reason for locking features – to prevent such temporal overlap).

However, if another, related table is to be updated in the same action, and rows of that table are not locked, significant difficulties can arise. For this reason, in the DCDB, the lock number does not become the update number. At the time the update is committed to the database, a new update number is allocated, and the details transferred from the lock record to the newly created update record. Thus the updates in the DCDB are chronologically sequential in terms of the time the update is actually applied.

It was not considered necessary to do this in RIME at present, however there is a rare situation that can arise due to this decision:

User A may lock a feature. (lock number 150012)

The feature may be given a new name (stored in the place table)

User B may lock a different feature. (lock number 150014)

This feature may be given the same name as the above feature.

Lock 150014 is written back to the database – creating the place record for the name. (Place_id = 11253)

When lock 150012 is written back – one of three things could happen:

A. The program could see that the name exists (place_id = 11253) and link to it.

B. The program could recognise that the name does not exist as at lock 150012, and create it.

C. The program could recognise that the name has come into existence since lock 150012, and reject the update.

Figure 8 Before Update

 

Figure 9 After update - option A

 

Option A causes a database consistency violation, since the link from the feature to the place is to place_id 11253 as at lock number 150012, which does not exist.

Option C raises difficulties, since it is now not possible to apply what was a valid update.

Figure 10 After update - option B

Option B creates a duplicate place record. This is actually not a serious problem, and given the very low probability of the event, this is the approach currently taken.

This type of problem can arise whenever a type of object is allowed to be updated without being locked in advance. It would not occur if the place table was subordinate to the feature table (i.e. if the connection between them was one of "composition").

There is a future project planned to review the place table, and incorporate the "place names" data – rationalising the names currently in use (which are at present marked as "unofficial"). When this is done, any duplications created by the above effect will be removed. At this time, however, it will become critical to address the problem – probably taking the DCDB approach.

Linkage between Features and Metadata: 

RIME uses a very specific linkage mechanism between features and the metadata which applies to them.

Every feature belongs to one and only one specific metadata item, known as the home metadata item.

Each feature may belong to any number of other metadata items. (This linkage is used to connect a feature to its updates, and any other ad-hoc feature collections it belongs to).

Figure 11 Feature metadata linkage

As described above, any information in a metadata record also applies to any child qualsets, (unless it is overridden at a lower level). Thus, for an individual feature, a full set of metadata is obtained by taking its home metadata item, the other (linked) metadata items, and all parent items of these.

Timestamping of Metadata Updates

At present, metadata items themselves are not timestamped. Thus any metadata summary items will appear to apply "for all time".

For example, if the accuracy of a collection of features is improved so that it is effectively of larger scale, the features themselves will be retired, with new versions being created, and the lineage process step of the improvement will be recorded, and the summary record will be updated with the new scale.

Figure 12 Before update

On the 1/4/2003, the accuracy is improved – so that the effective scale is now 1:75000. The update metadata item (id = 293) is created, and a process step is created indicating the update. The changed feature(s) are retired as at 293, and new instances created as at 293. The metadata summary is now showing the better scale of 1:75000.

Figure 13 After improvement in accuracy

An enquiry on the database "as at" a time before this update (say as at 1/3/2001), will return the features' before images, but the metadata summary will show the effective scale as 1:75000.

This is not considered particularly serious, but in a later version of RIME it is planned to maintain a timestamped record of selected metadata.

(This raises the interesting issue that a change to a metadata item can create a new metadata item).

Some Issues In Timestamping

It is important to realise that timestamping, as a method of recording history, has certain limitations. Unless there is some additional update mechanism to "correct history" (and this is not the case in RIME or the DCDB), what is recorded is a history of the database representation, not of the real world.

That is to say, the database viewed "as at" time t is a snapshot of what the database contained at time t, not what it should have contained. In particular, if an error in the database is corrected at time t, the error exists in any view of the database as before t.

Even more critical, perhaps, is the fact that a database internal error (such as a topology failure), may be corrected, but will exist for all time in the history. Thus any programs dealing with historic data must be able to cope with any such errors even if they have since been corrected.

Within the concept of timestamped updates, there is a scope for variation of detail in the way the operations can be carried out. for example in updating a large feature:

Figure 14 Update of large feature

In this example, the long feature ABC is being updated, replacing the section B with section C. The operators may:

                Replace the single feature ABC with a new single feature ADC,

                Split the single feature into three new features A, D and C.

There is no restriction on this enforced by the database structure or validation rules. The operators are to be guided by the Policy and Guidelines for Incremental Update (ICSM 2003).

Conclusion

It is hoped that the use of “active metadata” within RIME and other corporate spatial databases will lead to the situation where the term “metadata” is again forgotten by spatial data users.

This will come about when the entry and use of metadata is so much a part of the daily usage of spatial databases that it is no longer seen as an imposition.

References

ANZLIC (1996). Metadata Guidelines, Australia New Zealand Land Information Council.

               

Booth, B. (1999). Getting Started with ArcInfo, ArcInfo 8, Environmental Systems Research Institute Inc.

               

Date, C. (1990). An Introduction to Database Systems, Addison-Wesley.

               

ICSM (2003). Harmonised Data Manual - Policy and Guidelines for Incremental Update, Intergovernmental Committee on Surveying and Mapping. 2003.

               

ISO-TC211 (2001). Geographic Information - Spatial Schema. ISO/IS 19107. International Organization for Standardization.

               

ISO-TC211 (2002). Geographic Information - Metadata. ISO/IS 19115. International Organization for Standardization.

               

van Oosterom, P. (1997). Maintaining Consistent Topology Including Historical Data in a Large Spatial Database. Auto Carto 13, Seattle, WA.