Anchor Modeling can be used with almost any relational database, even temporal databases, but there are four features that significantly improve query performance. These are table elimination, clustered indexes, foreign keys and statistics. Note that some vendors refer to the first two as ‘join removal’ and ‘index organized tables’. In order to get the best performance in Bitemporal Anchor Modeling a fifth feature is necessary: partitioning. It is not required, but when used partitions recording time over that which is currently recorded and that which has been erased.
The Anchor Modeler, our online modeling tool, currently only supports the automatic generation of code targeted for Microsoft SQL Server. Support for other vendors is in the works.
Clustered indexes are less important when using solid state disks, but if not present the total size of the database including indexes will be larger. Table elimination is less important if every query run against the database is crafted by hand and not automatically generated, for example from an analytical or reporting tool. However, if you want to use the views and table valued functions providing temporal perspectives of the data, table elimination is crucial.
Supported Browsers
Please use a browsers that supports HTML5 canvas and 2D drawing context if you want to use the online modeling tool. Here is a list of tested browsers.
Browser | Version | Performance | Functionality |
---|---|---|---|
Chrome | 6 – 15 | Fast | Full |
Chrome | 5 | Fast | Almost full |
Canary | 17 | Fast | Full, but expect some crashes |
Firefox | 8 | Fast | Full |
Firefox | 4 | Slow | Full |
Firefox | 3.6 | Slow | Full, linefeed bug in generated SQL |
Safari | 5+ | Good | Almost full, cannot load local files |
Opera | 11.5 | Slow | Almost full |
Opera | 10.6 | Good | Partial |
Internet Explorer | 9 | Fast | Almost no functionality |
Table Elimination Support
Anchor Modeling relies on a query optimization feature called table elimination to achieve high performance. Many modern database engines support this technique, which can be seen from the chart below. Anchor Modeling can be used with all relational databases but the performance will be better in those which support table elimination.
Database Engine | Support | Example |
---|---|---|
Microsoft SQL Server 2005 | full | SQL Server Example |
Microsoft SQL Server 2008 | full | SQL Server Example |
Oracle 10gR2 Express Edition* | partial | Oracle Example |
Oracle 11gR1 Enterprise/Express Edition | full | Oracle Example |
IBM DB2 v9.5 | full | IBM DB2 Example |
PostgreSQL v8.4 beta | full | PostgreSQL Example |
Teradata v12** | partial | Teradata Example |
MySQL v5.0.70 | none | MySQL Example |
MySQL v6.0.10 alpha | none | MySQL Example |
MariaDB v5.1 | full | MariaDB Example |
Sybase | not tested |
Partial support includes support for non-historized attributes, but not for historized. These results are based on the example scripts, for which it still may be possible to rewrite the SQL to get full support for table elimination. Should anyone be able to do such a rewrite, please let us know!
* As it turns out, our example for Oracle 10g does not work as intended. While you do get table elimination, statistics cannot be used to get an optimal plan. It seems that support will be limited in 10g and only full in 11g.
** We have gotten reports that Teradata v13 has full table elimination support when using subqueries in the select part of a statement.
Table Elimination Explained
Table elimination is a feature found in many modern query optimizers. Basically, what table elimination does, is to remove tables from the execution plan when it is unneccessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query:
select A.colA from tableA A left outer join tableB B on B.id = A.id;
When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition, B.id = A.id, is not met the selected column, A.colA, will contain a NULL value.
However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, “val1”] and B the rows [1, “other1a”],[1, “other1b”] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution.
Instead, let’s say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unneccessary. We can remove the whole join operation from the execution plan.
Let us look at a more advanced query, where some query optimizers fail, and hence only provide partial support for table elimination.
select A.colA from tableA A left outer join tableC C on C.id = A.id and C.fromDate = ( select max(sub.fromDate) from tableC sub where sub.id = A.id );
In this example we have added another join condition, which ensures that we only pick the matching row from tableC having the latest fromDate. In this case tableC will contain duplicates of the column C.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableC is (C.id, C.fromDate).
Furthermore, since the subselect ensures that we only pick the latest C.fromDate for a given C.id we know that at most one row will match the join condition. We will again have the situation where joining with tableC cannot affect the number of rows in the result set. Since we do not select any columns from tableC, the whole join operation can be eliminated from the execution plan.
Queries like these arise in two situations. Either when you have a denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data.
Clustered Indexes
Anchor Modeling use clustered indexes, also called index organized tables, to further gain performance. Many database engines support this type of indexing, which physically order the data on the media according to the indexed columns. In other words, such indexes take up no extra space, contrary to regular indexes. In fact, regular indexes are very rarely needed and they are not a crucial feature for Anchor Modeling, even if present in most database engines.
Over time, or when large amounts of data is loaded at once, clustered indexes may become fragmented. In other words, the organization of the data has become less than optimal and a defragmentation or recreation of the index is needed in order to solve the problem. Always check the degree of fragmentation after loading large amounts of data and on a regular basis.
Note that indexes slow down inserts, so in databases that support it, switching them off before loading large amounts of data and then switching them back on may speed up the loading process significantly, provided that you trust the loading process not to introduce rows that break the index constraint.
Foreign Keys
Having foreign keys declared are paramount for maintaining referential integrity in the database, but they also improve select query performance. There is a second type of table elimination that can be achieved only when foreign keys are declared. In Anchor Modeling, this type will eliminate anchors from queries in which there are at least one where condition on any of its attributes. They also help the optimizer in many other ways.
As with indexes, performance when loading large amounts of data can be improved by temporarily deferring the foreign key checks until after the loading is done.
Statistics
In order to achieve near optimal execution plans, the query optimizer needs to use statistics of your stored data. This can be illustrated using an example. Below is the latest view for the Actor anchor and attributes in Transact-SQL code:
CREATE VIEW lAC_Actor AS SELECT AC.AC_ID, AC_NAM.AC_NAM_Actor_Name, AC_NAM.AC_NAM_ChangedAt, GEN.GEN_ID, GEN.GEN_Gender, PLV.PLV_ID, PLV.PLV_ProfessionalLevel, AC_PLV.AC_PLV_ChangedAt FROM AC_Actor AC LEFT JOIN AC_NAM_Actor_Name AC_NAM ON AC_NAM.AC_ID = AC.AC_ID AND AC_NAM.AC_NAM_ChangedAt = ( SELECT max(sub.AC_NAM_ChangedAt) FROM AC_NAM_Actor_Name sub WHERE sub.AC_ID = AC.AC_ID ) LEFT JOIN AC_GEN_Actor_Gender AC_GEN ON AC_GEN.AC_ID = AC.AC_ID LEFT JOIN GEN_Gender GEN ON GEN.GEN_ID = AC_GEN.GEN_ID LEFT JOIN AC_PLV_Actor_ProfessionalLevel AC_PLV ON AC_PLV.AC_ID = AC.AC_ID AND AC_PLV.AC_PLV_ChangedAt = ( SELECT max(sub.AC_PLV_ChangedAt) FROM AC_PLV_Actor_ProfessionalLevel sub WHERE sub.AC_ID = AC.AC_ID ) LEFT JOIN dbo.PLV_ProfessionalLevel PLV ON PLV.PLV_ID = AC_PLV.PLV_ID;
Now let’s say that we want to answer how many women have reached the highest professional acting level. This can be done using the following query on the latest view:
SELECT count(*) FROM lAC_Actor WHERE GEN_Gender = 'Female' AND PLV_ProfessionalLevel = 'Oscar';
There is no need for any information from the Name attribute, so it will not be touched during execution, thanks to table elimination. Since we are only interested in two particular non-null values: ‘Female’ and ‘Oscar’ from the other attributes, the anchor can also be eliminated, as foreign keys have been declared. Furthermore, when conditions are set on the attributes like this, the outer joins become inner joins. After all these optimizations, there are two possibilities for what the remaining query may look like, or rather, execute like:
SELECT count(*) FROM AC_GEN_Actor_Gender AC_GEN JOIN AC_PLV_Actor_ProfessionalLevel AC_PLV ON AC_GEN.AC_ID = AC_PLV.AC_ID WHERE AC_GEN.GEN_ID = 1 AND AC_PLV.PLV_ID = 10;
In the query above the gender attribute will first be scanned and all women picked out, then this intermediate result set will be joined with their professional levels, and finally all those that do not have the highest level removed. The other choice is to start with the professional levels:
SELECT count(*) FROM AC_PLV_Actor_ProfessionalLevel AC_PLV JOIN AC_GEN_Actor_Gender AC_GEN ON AC_PLV.AC_ID = AC_GEN.AC_ID WHERE AC_PLV.PLV_ID = 10 AND AC_GEN.GEN_ID = 1;
In order to determine the optimal join order, which of the two queries above to pick, the optimizer needs statistics. If men and women are evenly distributed, then starting with gender will remove about half the rows for the first intermediate result set. If there are ten professional levels with most actors at medium levels, then starting with the professional level will remove most rows for the first intermediate result set. This is much more efficient, and the optimizer will strive to find the join order that produces the smallest intermediate result set first continuing progressively through the joins with as few rows as possible.
When legacy data is migrated, large numbers of new rows will be created, skewing any already gathered statistics. Therefore it is a good idea to update all statistics after loading a large amount of data and on a regular basis, if not done automatically by the database. Normally, experienced users will always know more about the data than the database, so it may also be a good idea to check the executions plans of queries from time to time to ensure that they are optimal.
Also, beleive it or not, italians write in SQL. Translating SQL into Italian is a funny idea, but somewhat unpractical.
Thanks for pointing this out. The translations are automatically generated by Google Translate, but apparently it doesn’t do so well, so I have removed the functionality for now.
I have noticed a huge number of logical reads and scans when deploying a simple schema (1 anchor, 3 static attributes, 1 historized attribute) in CRT mode. 76 entities and only 1 attribute posit/annex per entity and attribute. I will get about 3000 scans/6000 reads per annex table. And a lot of CPU is burned on top-1 sort. That does not look like high performance to me.
Would it be possible to provide a sample model and data so I can look into why you are getting that many scans and reads?
Possible error in insert-trigger:
In the instead-of-insert-trigger for an anchor (in our case [itCA_Caller]) a case-expression similar to the following occurs for every historized attribute:
——
CASE
WHEN [dbo].[rfCA_BRA_Caller_CallerBranch](
v.CA_BRA_CA_ID,
v.CA_BRA_Caller_CallerBranch,
v.CA_BRA_ChangedAt
) = 1
THEN ‘R’ — restatement
WHEN [BRA].CA_BRA_CA_ID is not null
THEN ‘D’ — duplicate
ELSE ‘N’ — new statement
END
——
The function [rfCA_BRA_Caller_CallerBranch] returns 1 if the last value before CA_BRA_ChangedAt and/or the first value after CA_BRA_ChangedAt is equal to the given CA_BRA_Caller_CallerBranch (for given CA_ID).
The problem is that in some cases duplicates are marked as restatements (which results in a key-violation error when trying to insert).
For example: if in our case the combination
[CA_BRA_CA_ID,CA_BRA_Caller_CallerBranch,v.CA_BRA_ChangedAt] already exists (so it is in fact a duplicate) and there is also an adjacent record, with the same [CA_BRA_CA_ID,CA_BRA_Caller_CallerBranch] but different _Changed_At, then the function [rfCA_BRA_Caller_CallerBranch] returns 1 and because the SQL-Server case-expression returns the result of the FIRST when_expression that evaluates to true, the inserted record is marked as a restatement.
Is this indeed the case, or am I missing something?
Kind regards,
Mark Versteegh
—
I think an easy fix would be to move the duplicate-check before the restatement-check.
This bug has already been fixed in the test version (http://www.anchormodeling.com/modeler/test) of the tool, and precisely the way you suggested, by rearranging the order of the when-statements. The code generation in the test version is stable, but we have a couple of minor GUI-bugs that needs to be sorted out before release.
Thanks for the quick response. I tried the test version of the tool, but now I get a duplicate key-error when trying to insert a changed value for a static attribute.
In the release-version these attributes where left unchanged (if an attribute value already exists for the same anchor_ID then no insert takes place), but in the test-version there is a left join on the anchor_ID and the attribute_value, which works fine for duplicates, but handles changes values as ‘new’.
I would expect the value to be updated in this case, but I’m not 100% sure if this is the intended behavior.
This should now be fixed in the test version. The intended behaviour is now to do nothing when using the triggers on the views, but you can still update a value by directly using the attribute table. A warning message will be shown, though, when you try to update static attributes.
Hello,
Please help me to know where can I get an example DWH using anchor model.
Thanks!
Here is a “skeleton” model used to build a real data warehouse (which had additional attributes added):
http://www.anchormodeling.com/modeler/latest/?id=ahNzfmFuY2hvcm1vZGVsZXItaHJkcg0LEgVNb2RlbBjKshEM
Hi,
At this moment we need to clean some customer data from our anchor model setup. If we just to a update on the last status of a customer, the history is still there. Is there a quick way to remove all data from a key is.
Thanks. Sjacco from M-Wise
If you want to physically remove the data I would suggest doing deletes on the latest views. For example:
declare @r int;
set @r = 1;
while (@r > 0)
begin
delete lAC_Actor
where AC_ID in (
select AC_ID from #AC_ID_toBeDeleted
);
set @r = @@ROWCOUNT;
end
This will delete both all attribute data and the anchor data. The loop is necessary in order to delete all history in historized attributes. If you also have related data that needs to be deleted you will have to roll your own “cascading delete”.
create table #PE_ID_toBeDeleted (
PE_ID int not null
);
delete AC_wasCast_PE_in
output deleted.PE_ID_in
into #PE_ID_toBeDeleted
where AC_ID_wasCast in (
select AC_ID from #AC_ID_toBeDeleted
);
After which you can use
#PE_ID_toBeDeleted
to do a similar looped delete on thelPE_Performance
latest view. Of course, you may want to keep performances that involved other actors than the ones deleted. In other words, it is hard to fully automate deletes, since there may be underlying business logic governing them.I’m attempting using AM to create a 6NF + 3NF views in AWS Redshift. I’m using the PostgreSQL generation scheme, but after getting a little bit complicated, it is failing to generate SQL. Maybe I’m doing something incorrect or maybe it is a limitation of the modeler? I’d like to have someone take a look at my model to see which of these things are the problem.
Hi,
Your work is awesome! Thanks a lot for sharing it! I’m eager to try “Privacy actions” and “Encryption groups” (can help ABAC implementation?)!
I am analyzing an architecture on top of the Anchor Modeling for a big Swiss telecommunication company.
Very useful the mnemonic concept;) Is it possible to write the descriptions that we insert in the GUI in the views “_Anchor, _Attribute, _Knot, _Tie” views? I only see them in the XML file:
My address description
Thanks a lot
Waldimiro Rossi, Swiss Impact Ltd.
I have added the descriptions in test version 0.99.4 (now live).
Thank you very much! when is planned the version 1.0?
You can consider the SQL Server support to already be production quality. I was hoping to get support for a couple more databases up to better quality before I announce 1.0.
Hi Lars, can you contact me please via email.