Parallel Projections in Databases

Roenbaeck

Background

Some database vendors are adding support for ranges as new data types in their database engines. Ranges are intervals whose boundaries are defined using some other data type. For example [22, 42) is an integer range starting at and including the value 22 and ending with but excluding the value 42. When creating a table, vendors will add some way of specifying that a data type is a range, somewhat similar to the following:

CREATE TABLE myTable (

myNormalColumn int,

myRangeColumn int range

)

For example PostgreSQL will have this support in the upcoming 9.2 version. With the new ANSI/ISO standard SQL:2011, periods are introduced, and a periods are ranges over time. Once vendors implement support for this, periods will be defined using the following syntax:

CREATE TABLE myTable (

myNormalColumn int,

myPeriodStart date,

myPeriodEnd date,

PERIOD FOR myPeriod(myPeriodStart, myPeriodEnd)

)

With the addition of ranges/periods a lot of functionality is also added in order to work with these. For example, specifying constraints that ranges may or may not overlap, conditional operators for comparing if one range follows after another and functions for getting the start or end point of a range.

Motivation

In a certain type of data modeling called bitemporal modeling, currently gaining a lot of traction due to its ability to handle data according to new legislations like the Sarbanes-Oxley Act and BASEL3, ranges play an important role. Bitemporal modeling use two temporal ranges to keep track of the history of the information. One for when different versions of the information was valid, and one for when that information was known to some agent (or stored in some memory). This way, a financial report can for example at any time be recreated exactly as it looked at its original time of creation, even if it contained errors that later were corrected. The same report can also be recreated as it should have looked at its original time of creation given the corrections made.

In order to retrieve the temporally stored information very complex queries have to be written. These can, however, be drastically simplified with the addition of a new type of operation on ranges in the database.

Innovation

Given a number of ranges and some way of ordering these, it is possible to do a parallel projection in the plane to retrieve what these ranges look like seen from the top or bottom.

Say we have the following set of ranges and an additional column used for the ordering:

1, [12, 16]

2, [10, 18]

3, [13, 20]

These ranges can be viewed as an ordered stack of lines:

10 11 12 13 14 15 16 17 18 19 20

1 --------------

2 --------------------------

3 -----------------------

Seeing this stack from the top (using a plane parallel projection) we see the following (now non-overlapping) ranges:

1, [12, 16] (the whole range can be seen)

2, [10, 11] (one of the protruding parts)

2, [17, 18] (the other protruding part)

3, [19, 20] (most of the range is hidden from view)

Seeing this stack from the bottom we instead see the following ranges:

3, [13, 20]

2, [10, 12]

Changing the ordering from ascending to descending produces the same result as for viewing from top or bottom, so these two methods are interchangeable for producing the same result.

There are several different algorithms for doing parallel projections, all of which are low cost in comparison to what would have to be done in a database lacking this functionality to achieve the same result. How to syntactically specify that a projection should be done in a query is left up to the vendor or the SQL standardization committee.

The above example of a projection does not cover all cases. We also need to deal with non-inclusive intervals, for example [12, 16), in which the end point is not included. Open-ended intervals should also be handled, like [12, *), in which there is no specified end point. Finally, the values used in the ordering may not be unique, possibly causing ranges to merge for some ordinal. Neither of these cases is particularly complicated, but the above should convey the idea for now.

Applicability

Assume that a third column is added, for example containing the rating of a financial instrument.

1, [12, 16], A

2, [10, 18], B

3, [13, 20], C

This could represent bitemporal data, where the first column indicates the time when we received the information and the second column the period during which the rating was valid. For example, at time 1 we are told that the rating was A during the period between 12 and 16. In a real-life scenario, times would most likely not be integers but dates.

A parallel projection from the top when ordering over the first column can now be used to find the latest available information:

1, [12, 16], A

2, [10, 11], B

2, [17, 18], B

3, [19, 20], C

Which seen as a timeline becomes:

10 11 12 13 14 15 16 17 18 19 20

B B A A A A A B B C C

The same procedure can be used on a subset, say where the first column is > 1, in order to find what the latest available information looked like at a previous recording time. Such a condition is very simple to add using available database language constructs. Parallel projections in a database makes it very simple to “time travel” in bitemporally modeled data. There may of course also be many other areas in which such projections can be useful.

Roenbaeck

Given the example data, finding the latest information can of course be done using currently available SQL operations, but not very efficiently or very elegantly. Because of this, data is rarely stored in this manner. Instead, you would check if your new information overlaps any existing information with respect to their valid period at insertion time. If so, the valid time period of the existing information is adjusted, or another row is inserted with the old value and a shorter valid time period.

This, of course, is costly, and all bitemporal solutions I am aware of suffer from a performance penalty at insertion time in order to gain performance at retrieval time. By having parallel projections available, data could be retrieved both efficiently and elegantly without having to do any insert checks (and extra inserts or updates). In my opinion, it would also be the most natural way to store the data. I hope this clarifies further why such an operation would be beneficial in bitemporal modeling.

Roenbaeck

What I also failed to mention above is that this operation must be able to partition, similar to windowed functions, and at the same time be applicable to a derived table. In pseudo-code:

select

from (

select

myID,

myValue,

myValidTimePeriod,

myRecordingTime

from

myTable

where

myRecordingTime <= 'point-in-time-to-travel-to'

) t

project(myValidTimePeriod) over (partition by myID order by myRecordingTime desc)

Someone can probably figure out a better syntax, but let's use this as an example.

Given a clustered index on for example (myID asc, myRecordingTime asc) it should be fast to traverse this. Using a bottom-up algorithm, start with the earliest myRecordingTime for a particular myID and keep two vectors:

values

interval_points

Going to the next myRecordingTime, check if the valid period intersects the one already stored. Assume its contained, such that myValidTimePeriod#2 is wholly contained in myValidTimePeriod#1. We then need to adjust the vectors to:

values

interval_points <

start(myValidTimePeriod#1), start(myValidTimePeriod#2),

start(myValidTimePeriod#2), end(myValidTimePeriod#2),

end(myValidTimePeriod#2), end(myValidTimePeriod#1)

Once there are no more rows for this particular myID, we are done, and can produce the result set from the two vectors. In this example three rows (given the input of two rows). If there are many columns myValue1, myValue2, myValue3 etc, the vector of values above would instead be a vector of columns.

I am sure there are better algorithms, but I just want to show that it can be done simply by traversing on a row-by-row basis. No look-ahead or look-behind is necessary.