The idea behind a Massively Parallel Processing Relational Database Management System is that each node that participates in the cluster should be as autonomous as possible. We have earlier shown that using the available distribution techniques in for example HP Vertica, it is possible to co-locate an instance of an anchor and all of its attributes along with any knots they relate to and any ties the instance participates in. Recently, we were asked how to additionally co-locate closely related instances.
In a very simple thought experiment, this is actually not so difficult to do. Imagine a model of restaurants and their menus. It is reasonable that in many cases an instance of a resturant and its related menu-instances would be part of the result set. If we can co-locate these instances on the same node, the node will become even more autonomous. However, using a simple modular hash on the identities of restaurants and menus will distribute them evenly across the nodes, but for the most part not keep related instances together.
Let us look at a possible solution using some pseudo code.
Let 3 be the number of nodes in the cluster. Let s_resturant be a sequence starting with 1 incremented by 1. Let s_menu0 be a sequence starting with 3 incremented by 3. Let s_menu1 be a sequence starting with 1 incremented by 3. Let s_menu2 be a sequence starting with 2 incremented by 3. Let {1, 2, 3, 4, 5, 6} be ids of six restaurants. Let id modulo 3 determine the node on which to locate an instance. Then restaurants {3, 6} will be co-located on node 0. Then restaurants {1, 4} will be co-located on node 1. Then restaurants {2, 5} will be co-located on node 2. Let s_menu0 generate ids for menus of restaurants on node 0. Let s_menu1 generate ids for menus of restaurants on node 1. Let s_menu2 generate ids for menus of restaurants on node 2. Find the node of restaurant 4 through 4 modulo 3 = 1. Use s_menu1 to create menus {1, 4, 7, 10} at restaurant 4. Then menus {1, 4, 7, 10} will be co-located on node 1.
The drawback is that you will need as many sequences as you have nodes and the fact that the number of nodes may not stay the same over time. However, using similar logic it may be able to apply similar thinking to cover the redistribution of instances in the case of an added node in the cluster.