With the help of Nikolay Golov (avito.ru) support for generating unitemporal Vertica implementations have been added to the test version of the online modeling tool. Vertica is an MPP columnar database, and Avito runs a cluster of 12 nodes with 50TB of data in an Anchor model. A relational Big Data solution that outperforms previously tested NoSQL alternatives!
In Vertica there are three distribution models, and they happen to coincide with Anchor modeling constructs.
Broadcast
The data is available locally (duplicated) on every node in the cluster. This suits knots very well, since they may be joined from both ties and attributes anywhere.
Segment
The data is split according to some operation across the nodes. For example, a modulo operation on the identity column in the anchor and the attribute tables could be used to determine on which node data should end up. This keeps an instance of an entity and its history of changes together on the same node.
Project
Data necessary in order to do a cross-node join is stored across the nodes, for all directions of the join. Ties fit this purpose perfectly and does not introduce any overhead thanks to only containing the columns on which joins are done.
Anchor modeling fits MPP extremely well, since the constructs are designed in such a way that the MPP is utilized as efficiently as possible. A knotted attribute or tie can be resolved locally on a node, and a join over a tie cannot be more efficient, while the assembly of the results from different nodes is trivial.
Incidentally, if you search about issues in Vertica, implementations may suffer from having to create lots of projections or broadcasts of wide tables in order to support ad-hoc querying. This is a natural result of using less normalized models. Anchor modeling, on the contrary, works “out of the box” for ad-hoc querying. Furthermore, what should be projected and broadcast is known beforehand, and not tuning work a DBA would have to worry about on a daily basis.
Since Vertica does not have table valued functions or stored procedures, only create table statements are generated so far. We are still working on providing some form of latest views.
One pattern that I think is common in MPP designs is to distribute entities not by their own ID, but by the ID of another entity that “owns” it (is the one side of a one to many relationship that is the main join path for the entity being distributed). Using a restaurant example, I might want to distribute all of restaurants, tables, and seats (and their attributes) by restaurant ID, since seats will always be joined to tables which will always be joined to restaurants. I *think* anchor modeling would allow that in principle, but I suspect that this functionality is currently not automated via the modeler, right?
Indeed, Anchor imposes no such constraint in theory. However, given the technology we have to create implementations with in practice, limitations may arise. In the code currently generated for Vertica we use the dumb/techincal/surrogate/generated id with a modular hash distribution function. I am assuming this will just evenly distribute instances across the nodes, making it close to impossible to achieve co-location of a restaurant and its tables. It does, however, look like it is possible to provide a rich expression instead of the modular hash. The syntax is:
SEGMENTED BY expression { ALL NODES [ OFFSET offset ] | NODES node [ ,... ] }
Presumably, with an expression that returns the same value for a restaurant and its table you should be able to co-locate these on the same node. I expect that you will get a performance hit at write-time though, but the benefits at read-time may certainly outweigh that.
The only other MPP platform I have some experience from, HBase, also supports this behaviour through writing a custom RegionSplitPolicy, such as:
HTableDescriptor tableDesc = new HTableDescriptor("example-table");
tableDesc.setValue(HTableDescriptor.SPLIT_POLICY, RestaurantSplitPolicy.class.getName());
//add columns etc
admin.createTable(tableDesc);
For some reason writing custom split policies seem to be an underused feature of HBase, so your mileage may vary.