Bi-temporal Anchor Modeling

Roenbaeck

Since I brought up the topic of bi-temporal modeling, currently being researched for Anchor Modeling, I thought I would start a thread where you can follow the developments, and provide feedback.

We have made a script that demonstrates a 'Simple Bitemporal Test' envisioned by Craig Baumunk using bitemporal Anchor Modeling. We are still developing the bitemporal extensions of Anchor Modeling and the final implementation may differ from the one we have used here. Due to the length of the example I had to link to the code externally: http://pastebin.com/SJ5h2cRu

The code is Transact-SQL and can be run in SQL Server 2008 (I used the free Express edition).

Roenbaeck

Some explanations of the implementation:

Anchor Modeling uses 6NF, so each attribute will reside in its own table. We also follow a naming convention that gives unique names (including some semantics making it possible to derive the type of the table and its immediate relationships).

FI_RAT_FinancialInstrument is an anchor table. It holds only the identities (normally surrogate keys generated by the database) of the entities, in this case different financial instruments. Identities are immutable.

FI_RAT_FinancialInstrument_Rating is an attribute table. It holds values for a single attribute of financial instruments, and in this case its rating. Values may be static or historized depending on if they can change over time. In bitemporal AM attribute values are also able to be recorded more than once over recording time. In the simple bitemporal test, the rating is historized over changing time and recorded over recording time.

Changing time keeps track of when the rating changes in the domain being modeled. FI_RAT_ChangedAt marks the starting point of the interval of time during which an instrument had a certain rating.

Recording time keeps track of when we recorded those changes (presumably in the database). FI_RAT_RecordedAt marks the starting point of the interval of time during which the rating was recorded with a certain reliability.

Information we relied on at one point in recording time may at a later time become unreliable. FI_RAT_Reliability indicates the reliability of the rating. For the example I used two values {0,1} representing completely reliable and totally unreliable information. This could be extended to values in the interval [0,1] if degrees of reliability is desired. Reliability changes over recording time.

Using only the starting points of the intervals allows for the use of only inserts to model 'logical' deletes and updates.

In Anchor Modeling views are created that cater for most use-cases when querying the information. The views are:

* latest - (l) showing the latest available information

* point-in-time - (p) showing information as it was at a given point in time

* difference - (d) showing all changes that occurred in the given interval

When taken over the two time dimensions this yields a total of nine views (ll, lp, ld, pl, pp, pd, dl, dp, dd) of an anchor and its attributes. Three of these are shown in my example.

The views also denormalize the anchor and its attributes from temporal 6NF to 3NF with the chosen temporal perspective. High performance is retained in most scenarios despite the joins thanks to a feature in query optimizers called table (or join) elimination.

In the first day of recording time we are supposed to store the rating 'A' valid from 2011-03-17 to 2011-03-19. When using implicit intervals, a new row must be inserted in order to close the interval. In Anchor Modeling null values are not allowed in the database, which is why I can only assume that the value becomes unknown (?) after the 19th.

Roenbaeck

Another type of implementation:

Another solution I was pondering before settling on this one was to keep separate a 'history' table for every 'actual' table. The 'actual' tables could have update and delete triggers that move affected rows (the history) to the 'history' tables. Views could then be created that union these in order to get a complete history.

By doing so you would always have your latest (over transaction time) information in the 'actual' tables, and query performance could be sped up for those types of queries. You only need to have valid time as part of the primary key in the 'actual' tables, but both (valid + transaction) in the 'history' tables. Deleted information would only exist in the 'history' tables.

There may be other ways to implement each approach. I'm currently looking at a 2 periods approach, and so far it looks quite elegant in comparison.

BasVanDenBerg

Hi Lars,

Your approach for making anchor modeling bi temporal sound very interesting and useful.

Let me summarize for myself, to be sure that I understand you correctly:

1. Add Recording Time to every attribute, tie and knot table (every table except anchors).

2. Also add reliability bit to every table that has Recording Time.

3. Add the recording time to the primary key.

This allows us to:

1. Store revisions of attribute and knot values. (and tie's). E.g. the rating was set to A, but that was based on a wrong calculation. It should be B.

2. Store revisions of the changing timeline. E.g. the rating was valid from the 17th of March, but it should be valid starting from the 16th of march.

It seems like an elegant solution, but I do have some questions and suggestions.

Suggestion 1: retracting and asserting.

The insert statements with reliability = 0 retract previous knowledge, while the statements with reliability = 1 assert new knowledge. (makes me think about Prolog...).

Maybe it's useful to use this terminology. A property of these two concepts is that something that's asserted can always be retracted and

something that's retracted cannot be retracted, only re-asserted.

Observation 1: value is irrelevant for retracted records

Retraction is done by inserting a new record with identical anchor key + valid date, but a new recording date and a reliability value of 0. The value

doesn't really matter.

E.g. On Day3 some late arriving facts are handled as followed

1. retract B rating on the 16th

2. retract A rating on the 18th

3. retract ? rating on the 19th

4. assert C rating on the 15th

5. assert ? rating on the 20th

1. insert into FI_RAT_FinancialInstrument_Rating values (1, 'B', '2011-03-16', '2011-03-19', 0, 42);

has the same meaning as

1. insert into FI_RAT_FinancialInstrument_Rating values (1, 'Z', '2011-03-16', '2011-03-19', 0, 42);

Suggestion 2: ETL

The ETL should be smart enough to retract all existing records that overlaps the changing time interval of the newly asserted row.

(e.g. in day 3 you insert a new C rating valid from the 16th, but this new record overlaps with rating records on the 16,18 and 19 of March,

which all need to be retracted).

So the ETL should be two phased. Retract overlapping records and insert the new record.

Maybe it's usefull to automate the retraction so that it retracts overlapping records by default.

For example by using a stored procedure retractAll (tableName, FromValidDate, AnchorKey).

Only this degrades performance, because it is row based and this will not work on a batch insert.

Same reasoning applies to triggers.

Of course you can make your ETL in such a way that it batch inserts all retractions first and then batch insert the new values.

Because the same table stores assertions and retractions, it might grow very fast and become slower, depending on the rate of retractions.

It might be better to move all outdated rows (with respect to recording time) into another table (like you suggest in another forum post) in one batch.

Then delete them in another batch. and then insert the new values in a third batch.

Question 1: End-dating an interval

On Day1 you show a way to end an attribute interval.

1. insert into FI_RAT_FinancialInstrument_Rating values (1, 'A', '2011-03-17', '2011-03-17', 1, 42);

2. insert into FI_RAT_FinancialInstrument_Rating values (1, '?', '2011-03-19', '2011-03-17', 1, 42);

The A rating is not known anymore starting from the 19th of March.

Why use the '?' character and what do we use when the datatype is datetime or int? Isn't it better to allow NULL values for these cases?

If we look at the latest view, do we want to make a distinction between values that were never known (which are NULL) and values which valid

interval ended, like in the example above?

How does this work for ties?

MariedTo

person, mariedTo, ChangedAt , RecordedAt, Reliability

1 ,2 , 2001-04-01, 2009-02-01, 1

Now at 2004-02-01 the marriage is ended. The prescribed way to model this is to make this tie a knotted tie.

person, mariedTo, ChangedAt , RecordedAt, Reliability, isMaried

1 ,2 , 2001-04-01, 2003-02-01, 1 , 1

1 ,2 , 2004-02-01, 2005-10-01, 1 , 0

It seems like someone made a mistake with the papers, so the marriage was not really ended. The mistake is

corrected at 2005-10-12.

person, mariedTo, ChangedAt , RecordedAt, Reliability, isMaried

1 ,2 , 2001-04-01, 2003-02-01, 1 , 1

1 ,2 , 2004-02-01, 2005-10-01, 1 , 0

1 ,2 , 2004-02-01, 2005-10-12, 0 , 0

So the interval was ended, but this was retracted a few days later and now the interval is open again.

The difference between end dating attributes and ties is that we cannot use NULL values in ties, so we need to add a knot.

The disadvantage is that we have to model ties as knotted if you want to be able to end-date an interval.

Built-in support would be nice here, also because we prefer to handle end-dating in a more generic way.

Suggestion 3: End-Dating an interval

How about adding an isValid bit to every table, that tells us whether the record is valid or not. If we use this for the attribute tables as well,

we don't have to use NULL values.

1. assert A rating from 17th of March.

(anchor key, value, changedAt, recordedAt, Reliability, isValid, metaData)

insert into FI_RAT_FinancialInstrument_Rating values (1, 'A', '2011-03-17', '2011-03-17', 1, 1, 42);

2. assert end of A rating from the 19th of March.

insert into FI_RAT_FinancialInstrument_Rating values (1, '?', '2011-03-19', '2011-03-17', 1, 0, 42);

3. retract 2.

insert into FI_RAT_FinancialInstrument_Rating values (1, '?', '2011-03-19', '2011-03-18', 0, 0, 43);

Suggestion 4: Use different table for retracted values.

Instead of adding a Reliability column to every table we could just move all retracted records into another table

(like suggested by you and Jorg Jansen on the AM forum). When a record is in this table you know that there is another record with

the same anchor key and valid date, but a more recent recording date in the original table.

A record being in this table is equivalent to having a record with reliability = 0 in the other approach.

This way we keep the model simple, because we don't need a reliability column and also because we don't have to insert

a new record in order to retract another record.

The above example becomes as follows:

1. assert A rating from 17th of March.

(anchor key, value, changedAt, recordedAt, isValid, metaData)

insert into FI_RAT_FinancialInstrument_Rating values (1, 'A', '2011-03-17', '2011-03-17', 1, 42);

2. assert end of A rating from the 19th of March.

insert into FI_RAT_FinancialInstrument_Rating values (1, '?', '2011-03-19', '2011-03-17', 0, 42);

3. retract 2.

(anchor key, value, changedAt, recordedAt, isValid, metaData)

insert into HISTORY_FI_RAT_FinancialInstrument_Rating values (1, '?', '2011-03-19', '2011-03-17', 0, 44); *

delete from FI_RAT_FinancialInstrument_Rating where FI_ID = 1 and ChangedAt = '2011-03-19' and RecordedAt = '2011-03-17'

* we can get the retracted timestamp from the metaData in the retracted table.

We can simplify it even further, by removing the recordedAt time:

1. assert A rating from 17th of March.

(anchor key, value, changedAt, isValid, metaData)

insert into FI_RAT_FinancialInstrument_Rating values (1, 'A', '2011-03-17', 1, 42);

2. assert end of A rating from the 19th of March.

insert into FI_RAT_FinancialInstrument_Rating values (1, 'A', '2011-03-19', 0, 42);

3. retract 2.

(anchor key, value, changedAt, recordedAt, isValid, metaData)

insert into HISTORY_FI_RAT_FinancialInstrument_Rating values (1, 'A', '2011-03-19', '2011-03-17', 0, 44); *

delete from FI_RAT_FinancialInstrument_Rating where FI_ID = 1 and ChangedAt = '2011-03-19'

* we can get the retracted timestamp from the metaData in the retracted table.

** we can get the recordedAt time from the metaData in the original table.

This way we have the original recorded time and the recorded time of the retraction.

After we retracted the ending of the A rating at 2011-03-19, we can of course assert a new rating starting

on the 19th of march, or even before this date (keeping in mind that

inserting a new rating before or on the 17th involves retracting the rating record on the 17th as well).

Finally

I have modified your sql in order to test my approach and I get exactly the same results. Get the SQL here: http://pastebin.com/9dZLAfve

Note that I use a stored procedure for retraction, but this can be changed into ETL code that supports batch insertion (if performance is an issue).

Please let me know what you think.

Kind regards,

Bas van den Berg

Roenbaeck

What is appealing with having separate tables for retracted (thanks for the terminology) knowledge is that queries over asserted knowledge will perform better. The downside is that you will close to double the number of tables in the database. But, if these are generated automatically, perhaps it is not such a big deal.

The second downside is that you can no longer handle the general case of 'reliability'. The approach with a separate history table only works well if reliability is absolute, either 1 or 0, and nothing in between. In the approach I have described, reliability can easily be switched to another data type, say a float value in the interval [0,1]. With that you could handle fuzzy knowledge, different degrees of reliability. In this case the latest view (etc.) could take a parameter for the level of reliability that is acceptable for the query.

I don't know if general reliability is something that anyone would use though, but it surely is a nice concept in theory. I am still not decided which way to go with bi-temporal AM. Reliability and validity columns and all information in the same tables? Separate tables for history? Use end-dating? Only support bi-temporality in databases that have a 'period' datatype?

BasVanDenBerg

Can you give me an example scenario in which you want to store fuzzy knowledge?

And how would you calculate this reliability?

Maybe a hospital that stores the patient<-> disease relation. This scenario would only work when the source system stores a reliability value. But would a doctor say that he knows for 60 % certain that you have a disease? In my experience they are always sure what you have until new observations tell them otherwise.

Still if you want to model this using my approach, you would like to have a reliability value in a historized tie. Can't you use a knotted tie for this?

You can change the reliability by retracting the tie and asserting it again.

You can also add a reliability attribute to an anchor, but it's not possible to add a reliability to an attribute (without making a custom extension to anchor modeling).

Bas

Roenbaeck

Thanks Bas, your script convinced me that maybe it's better to keep a separate history table. Inspired by it I created another version, since I had trouble getting table elimination to work for the 'pp' view as you had defined it. My code looks worse, but gets a better execution plan (and since it's automatically generated it doesn't really matter that much).

I also opted for a delete trigger, rather than having a SP to do retracting. I've gotten requests about creating implementations with no metadata, so I had to make some adjustments for that as well, like keeping RecordedAt in the actual tables. The result of our joint effort is here: http://pastebin.com/fci1rWau.

I'd be happy to receive feedback on this as well.

As for your question on reliability, I can't think of that many scenarios either. I think you won't find it that much in domain, but perhaps in the systems you fetch data from. Data from some sources may be considered much less reliable than others, for example.

Roenbaeck

I oversimplified the script. While it does produce the same output as the other one for the 'Simple Bitemporal Test' it actually fails my own test suite. Once the script it updated I will post back here.

In particular, it fails to close an interval over recording time. There are two ways to do this:

a) explicitly using an ErasedAt time column (RetractedAt in Bas' script)

b) implicitly using an Erased bit column

For a) a single row in the history table is sufficient to capture the information, which is positive. The downside is that I cannot pass a value for the erasing time to the delete trigger, so it would have to assume getdate(). In a DW scenario that may not be what you want. A way around that is to use an SP, as in Bas' script.

For b) a second row is needed in order to close the interval. Still the same problem with the trigger, and we get more complex logic when selecting historical information (similar to the nested top 1 select in the first script).

It seems a) is the more desirable solution.

Bas, sorry for not recognizing this in your script right away.

Roenbaeck

The script is now updated and its functionality equivalent to the other alternative. Please note that this is a 3-time column approach (1 valid + 2 transaction) that use no updates and loading/deleting data is done solely using insert statements.

My next stop on the line will be revisiting the first 2-time column approach and see if partitioning on the reliability could give the same advantages as in this approach. The partitioning must be made exclusive though, so that you would not have to touch the 'previously recorded' partition when looking at the 'currently recorded' information.

I am also trying to get a 2-period column solution up and running, if I can get period support to run in PostgreSQL. Anyone familiar with that?

Roenbaeck

And here you can find the version using 3 time columns and a table partitioned on reliability:

http://pastebin.com/aX3NiWRm

I believe it is the most flexible solution so far, and it should perform in par with the two table alternative. The only downside is that partitioning requires Enterprise Edition of SQL Server, but perhaps that is a non-issue for those working with bitemporal data?

I could make the partitioning optional in the tool though, it's very little that changes in the code.

Roenbaeck

Now I need to compare all of these according to my own simple test:

Ordering by when we received information about the capacity, we get:

Arriving at 2005-09-10,

The rating of the instrument is 'A' from 2001-01-01.

Arriving at 2005-10-12,

The rating of the instrument was actually 'B' from 2001-01-01.

Arriving at 2005-10-20,

The rating of the instrument has been set to 'F' from 2003-02-13.

Arriving at 2005-11-09,

The rating of the instrument was increased to 'D' at 2002-04-10.

Also arriving at 2005-11-09,

The rating of the instrument was then increased again to 'E' at 2002-08-20.

Arriving at 2006-09-21,

The rating was never set to 'F', it was an error.

I plotted these six pieces of information over the two time dimensions, getting six points, some of which lie on the same horizontal or vertical lines.

BasVanDenBerg

Hi Lars,

Maybe it's useful to make a distinction between the required functionality

and all different implementations, so that we can make an overview of the pro's

and con's.

Requirements

1. Make Anchor Modeling BI-Temporal. i.e. store valid time and recording time for every fact.

2. Store mutations in the valid time dimension.

3. End-date the valid time dimension. i.e. a rating is valid from T0 till T1. After T1 we know nothing about the validity of this fact.

4. Show intersections in valid time and recording time (create more views).

5. Store revisions of the Valid time mutations (2). We have three kinds of revisions.

a) The value was incorrect. Insert new (correct) value starting on the same valid time

b) The value was incorrect and the correct value is unknown.

c) The entire valid time mutation is incorrect and should be ignored.

6. Store information reliability

Implementations

requirement 1 and 2 are already implemented in anchor modeling, because

we (can) store the recording time in the metadata.

So we have a valid time and recording time for every fact. We can also

store mutations of the valid time dimensions.

But we cannot revise previous valid time mutations (5) and a mutation is

valid forever or until another mutation is found (3).

Implementation A

Step1. move the recording time into the primary key, so that we can store multiple revisions per valid time mutation.

Step2. add reliability (or Isvalid) value to every table

How are the requirements implemented?

5a: add new record with new value and reliability = 1

5b and 5c: add new record with reliability = 0 (value is superfluous,but I recommend to keep it identical to the retracted value)

3: add new record with reliability = 0 (value is superfluous / or you can have multiple value /reliability combinations, depending on your interpretation )

6. add new record with reliability = 0.6

A small variation of this approach is to replace reliability with a

IsValid bit. This will not give you 6, but it's more IO efficient.

Implementation B

Use erased table. Retract by inserting it into the erased table and delete

it from the current table. (when you don't want to incorporate the metadata by design) : Add

RecordedAt to every table and into the primary key.

How are the requirements implemented?

5a: retract and assert new value.

5b and 5c: only retract

3: We need a reliability or Isvalid column in order to implement 3.

Problem: Now we have 2 ways to retract a value. Namely: move to erased

table or add new record with IsValid=0 and changedAt is identical.

Implementation C (just for the record)

add validEndDate to every table gives us a solution to 3, but not 5b and

5c. Also this approach will require a lot of UPDATE statements.

EndDate is redundant, because it's identical to the validDate of the next

mutation or to a very big constant date.

Implementation D (variation of A)

Same as A, but also store erasedAt column and partition on reliability

Pros and Cons

Implementation A

- tables can grow very fast depending on the revision rate. this might be a performance issue.

- tables become more complex (two time dimensions).

- value is superfluous for end dating and retracted records.

Implementation B

- Only current facts (with respect to recording time) are in the attribute

tables. This will be faster.

- Minimal impact to attribute tables. Users who don't use the bi-temporal

aspect can just ignore the erased tables and all will be just the same.

- Now we have 2 ways to retract a value (move to erased or set IsValid=0).

- double the number of tables in database

Implementation D

- tables can grow very fast depending on the revision rate. Partitioning should limit the performance issue.

- tables become very complex (two time dimensions and also an erased column).

- value is superfluous for end dating and retracted records.

Please feel free to correct me if I'm wrong.

Bas

Roenbaeck

Bas, that's a good summary! I'm preparing for the speech tomorrow, but I did have time to make the partition-solution better. Right now I like it the most, and I also get really attractive execution plans, so I believe it will perform really well.

BasVanDenBerg

In implementation D, you renamed reliability into erased (0,1). But because there is also an erasedAt column, the erased bit is redundant (it can be derived. i.e. erased == erasedAt NOT NULL ). But I guess you need this bit to create the partition function?

Also you choose to end-date the rating by using a ? value. This is not a very generic approach in my perspective. (e.g. it does not work for ties ). But maybe this is done because end dating of valid time should be modeled explicitly like you explain in the other topic How to handle data corrections?

I.e. model an isActive bit in a tie, use begin and end date attributes when you know that an anchor is bound to a time period (e.g. a subscription anchor) and use a character like ? in combination with validDate to end-date attributes.

The disadvantage in my view is that the end-dating is domain-specific (or possibly anchor/tie/attribute specific). So you cannot write generic views and etl that handle end-dating.

Maybe this is a topic on itself...

In Implementation D, the transaction time is equal to erasedAt (when not null) or else recordedAt. when erasedAt is not empty the recordedAt value shows the original recording date which has been erased by this record. But this recording date can also be retrieved by looking at the original record. So it's redundant. Can't we merge the recordedAt and erasedAt columns? This also solves the issue that the erased bit is redundant with respect to erasedAt.

The partitioning approach is a nice alternative to having an erased table. You don't have the double amount of tables. A downside might be that tables get a bit more complex and users who don't want to use the recording time, still get to see it. Maybe a solution for this is to make a latest view without recording time. Like you have in current anchor models.

E.g. lFI_FinancialInstrument is based on llFI_FinancialInstrument and filters out the recordedAt fields.

This is also a nice backwards compatibility feature.

I think that more than 90 % of the query's don't use the recording time, so let's not explode the latest views with fields that are rarely used.

Another solution would be to filter it out in your datamart.

I'm just thinking out loud. I hope you can follow me.

Roenbaeck

Bas, all of these are good ideas. And yes, I will be keeping an l-view compatible with mono-temporal AM as well as the new ll-views when using bi-temporal AM.

Unfortunately I need the erased bit for the partitioning function. It is annoying that you cannot use expressions in such functions.

If I understand your suggestion on removing the erasedAt column correctly you would insert an additional row in order to delete something? <1, 'A', 2001, 0> is a valid row, then <1, 'A', 2002, 1> marks it as deleted in 2002? Unfortunately you then lose the benefit of partitioning, since you will have "deleted" rows left in your active partition.

Roenbaeck

After the presentation at Ordina there was a discussion if you could use the overflow partition in order to get rid of the bit-column (Erased) from the example scripts. It turns out that you can!

Here's a short proof of concept: http://pastebin.com/FDASBRAw

This surely makes life easier and the implementation much cleaner.

Roenbaeck

Turns out there's another problem then, a nullable column cannot be part of the primary key, and the partitioning column must be part of the primary key. In other words, it is not possible to have a nullable partitioning column...

Any ideas how to get around that?

Roenbaeck

The only way I could think of is to use a reserved value, and since the maximum value of the different time types vary, I must use the minimum value, 0. In order not to make the output confusing I would add a conversion from 0 to null in the views, as shown in this script:

http://pastebin.com/CWFDf3qH

BasVanDenBerg

Could you please paste the complete code of your final approach?

In your last version, the primary key did not contain erasedAt.

The transaction time is equal to erasedAt (when not null) or else recordedAt. This is the time at which the assertion or retraction of this fact is done. right?

One step I didn't get until our talk at the Ordina presentation is that retraction is done by updating a row, not by inserting a new row. So that's why you need both erasedAt and recordedAt (else you would lose the original recordedAt date)

What if we just use a primary key consisting of recordedAt and an erased bit ? (so we don't use the erasedAt column). If erased=1 then recordedAt is the erased date. Now we only lost the original recording date. Maybe we can make a new column for this (assertedAt).

When new facts are asserted we fill the recordedAt and assertedAt columns.

When facts are retracted we set the erased bit to true and recordedAt to the retraction date.

It admit, this approach is very similar to your initial approach, but the only difference is that transaction time is not spread out over two columns and the primary key can be kept simple (anchor_key, changedAt, recordedAt, erased). The downside is the existence of the assertedAt column. And the fact that you have to update....

Bas

Roenbaeck

I have no code for the final approach yet, as I am building it directly into the tool. You'd have to check out the code from our repository at Google Code and generate a script after switching to "Bitemporal" under the "Settings" menu in order to test it. It's far from complete though, but the tables and some triggers, views and functions are in place.