20 September 2011

Non-associativity of SQL table joins

Some time back I was trying to understand how Microsoft SQL Server handled bracketing and order of operations on table joins. Here's a breakdown I put together on how it works. Possible boring, although possibly useful for not shooting oneself in the foot.

In short: the order and backeting of table joins may matter if you're mixing different types of joins together.

Background Information

This is the boring stuff. Skip down to the next major section if already familiar.

What is associativity

Consider the following equations:
(1 + 2) + 3 = 6
1 + (2 + 3) = 6
Both equations are the same irrespective of the bracket grouping. Addition is said to be an associative operation.

However, the same is not true for subtraction.
(1 - 2) - 3 = -4
1 - (2 - 3) = 2
Subtraction is a non-associative operation.

What is commutativity

Consider the following equations:
a + b = b + a
a / b != b / a
Addition is commutative, but division is not. That is, the order of the operands matter.

It should be no surprise that in T-SQL inner joins are commutative, but left and right joins are not, rather they reverse roles.
  • A inner join B is equivalent to B inner join A
  • A left join B is equivalent to B right join A

Left-hand associativity in the absence of bracketing

T-SQL parses joins using left-hand associativity. That is, there is an implicit grouping to the left.
  • A join-type1 B join-type2 C is equivalent to (A join-type1 B) join-type2 C, which implies that
  • C join-type1 B join-type2 A is equivalent to A reversed-join-type2 (B reversed-join-type1 C)
This is to say, the following discussion applies equally to join ordering as it does to join grouping.

(Non) Associativity of SQL joins

If SQL statements present their tables in the same order, and use the same join and 'ON' clauses, then they may still give different results if the second join is bracketed.

This is because left and right joins produce null values in the absence of rows in the auxillary table. These null rows survive if a second, restrictive, join is applied only to the auxillary table. However if the grouping of joins is such that the second, restrictive, join applies to the whole expression then these null rows don't survive.

In particular:

Left-Hand Bracketed Right-Hand Bracketed Equivalence
(A inner B) inner C A inner (B inner C) Equivalent
(A left B) inner C A left (B inner C) Not equivalent
(A right B) inner C A right (B inner C) Equivalent
(A full B) inner C A full (B inner C) Not equivalent
(A inner B) left C A inner (B left C) Equivalent
(A left B) left C A left (B left C) Equivalent
(A right B) left C A right (B left C) Equivalent
(A full B) left C A full (B left C) Equivalent
(A inner B) right C A inner (B right C) Not equivalent
(A left B) right C A left (B right C) Not equivalent
(A right B) right C A right (B right C) Equivalent
(A full B) right C A full (B right C) Not equivalent
(A inner B) full C A inner (B full C) Not equivalent
(A left B) full C A left (B full C) Not equivalent
(A right B) full C A right (B full C) Equivalent
(A full B) full C A full (B full C) Equivalent

The same information can be presented (row for row) as:

Forward ordering Reverse ordering Equivalence
A inner B inner C C inner B inner A Equivalent
A left B inner C C inner B right A Not equivalent
A right B inner C C inner B left A Equivalent
A full B inner C C inner B full A Not equivalent
A inner B left C C right B inner A Equivalent
A left B left C C right B right A Equivalent
A right B left C C right B left A Equivalent
A full B left C C right B full A Equivalent
A inner B right C C left B inner A Not equivalent
A left B right C C left B right A Not equivalent
A right B right C C left B left A Equivalent
A full B right C C left B full A Not equivalent
A inner B full C C full B inner A Not equivalent
A left B full C C full B right A Not equivalent
A right B full C C full B left A Equivalent
A full B full C C full B full A Equivalent

In some cases equivalence can be restored by adding a WHERE clause that removes the null rows, however this correction can generally only be made in one direction.

The instance of A left B inner C is demonstrated as follows. The first expression has no bracketing, which is equivalent to left-hand bracketing because the T-SQL parser treats joins as left-hand associativity.
with
A(a) as ( select 1 union select 3 union select 5 union select 7 ), -- A contains 1,3,5,7
B(b) as ( select 2 union select 3 union select 6 union select 7 ), -- B contains 2,3,6,7
C(c) as ( select 4 union select 5 union select 6 union select 7 ) -- C contains 4,5,6,7
select a, b, c
from A
left join B on B.b = A.a
join C on C.c = B.b
The query above returns

a b c
7 7 7

However, if we bracket the second join, then the result changes:
with
A(a) as ( select 1 union select 3 union select 5 union select 7 ), -- A contains 1,3,5,7
B(b) as ( select 2 union select 3 union select 6 union select 7 ), -- B contains 2,3,6,7
C(c) as ( select 4 union select 5 union select 6 union select 7 ) -- C contains 4,5,6,7
select a, b, c
from A
left join (
B
join C on C.c = B.b
) on B.b = A.a
This query returns:

a b c
1 null null
3 null null
5 null null
7 7 7

The reason here is that when C is inner joined to B first, it constrains B, but then the entire group has no impact on A, due to the left join. Therefore A can return all of its rows. However, in the previous query, the left join between A and B occurs first. Therefore the more restrictive inner join applies to the rows in A first.

18 September 2011

Rubik's Snake Combinations

I've had a long standing attraction to the Rubik's Snake puzzle for many years, including being involved with glSnake. (Scroll to the red bit for final answer)

One question of interest was how many unique configurations of the Snake are possible. The Snake consists of 24 interconnecting prisms, with each interconnect able to be in one of four possible positions. This leads to a trivial upper bound of 423 = 70,368,744,177,664 distinct combinations. But many of these are not actually possible, as they may be configurations where the snake impossibly passes through itself. And most configurations will appear twice as mirror images.

A trivial lower bound can be reached by starting with a straight snake and then only permuting every other junction. Every rotation will be along a parallel axis, giving rise to pairs of interconnects that are always on discrete planes: meaning that the snake can never bend back on itself. Halving guarantees no reflections. (Symmetrical would only be counted once, but this is a lower-bound). This gives 411/2 = 2,097,152 as a trivial lower bound.

As an enhanced lower bound, we could observe that the snake prisms exits on a 3-dimensional grid of cubes. At any time the next piece could lie in the adjacent voxel in either the x, y or z direction (but can never follow the same axis twice). If we restrict ourselves to only ever advancing in the positive direction on each axis, then we always have a choice of one of (+y or +z) or (+x or +z) or (+x or +y), depending on whether our previous move was +x, +y, or +z respectively. In any case we have 2 choices. This gives us 223, which is not any better than before. However we can extend the idea: if we partition the space into planes normal to the x axis, then any time we extend in the +x direction, without loss of generality we can start moving in in any of the four +/-y or +/-z, i.e. four choices. From there we can move +x again, or either +/- along the axis we didn't previously chose. I.e. 3 choices. In this way we never fold back to previous planes. From that point on we only have 2 choices: to continue moving in a straight diagonal, or move onto the next x plane. We can represent this as the following Markovian chain:

0020Refined lower-bound:
0104111=85,149,351,936
1110(approx 418.15)

That's about as far as I could get without using numerical methods. (I also had a complicated, slightly improved, upper bound). So I wrote a program to walk every possible combination, which back tracks whenever it found a collision. Written as a single-threaded service to run in the background this took about five months to execute. This gave the following answer:

13,535,886,319,159 = approx 421.81 combinations, including possible mirror image duplicates.
Update: 5th Aug 2022: Corrected to 13,446,591,920,995 = approx 421.81. See below.

To handle mirror images, each time it encounters a complete snake it then converts it to a normalized form as follows. The snake is represented as a string of 23 numbers (from 0 to 3). The string is then reversed. The two strings are compared lexicographically and the smaller is the normal form. If the string matches the normal form, then it is counted, otherwise it is skipped. This way potential reflections that appear twice can be counted once, but symmetrical snakes also get counted once. Note that I'm interpreting mirror images here as applying the same turns from either end, not if the volume produced is a mirror image. For example, I'm counting a left-handed corkscrew snake and a right-handed corkscrew snake as discrete. But if you just twist one piece 180 degrees at the start or the end of the snake, then I count that as the same snake.

When ignoring reflections like this, the exhaustive search gives this result. Total number of possible snake configurations that don't overlap, and ignoring mirror repeats:

Final Answer = 6,770,518,220,623 = approx 421.31
Update: 5th Aug 2022: Corrected to 6,721,828,475,867 = approx 421.31. See below.

Behold. OK, it only took another two years to getting around to writing this up any telling the world. (mainly because I started writing up a detailed paper on the method, which quickly got boring). Hopefully anyone else who has tried this got the same answer.

So table of numbers: (Update: see below for correction)

MethodNumberApprox as exponent
Trivial lower bound2,097,152410.5
Refined lower bound85,149,351,936418.15
Final answer6,770,518,220,623421.31Corrected: 6,721,828,475,867
Exhaustive with duplicates13,535,886,319,159421.81Corrected: 13,446,591,920,995
Refined upper bound30,002,572,532,736422.60
Trivial upper bound70,368,744,177,664423.00

Along the way I also discovered that there are about 64,546,391 Corrected: 63,970,851 (approx 412.97 unique cyclic paths where the head and tail of the snake connect. Also, only a relatively small number of snakes are symmetrical.

I should probably go and get some sun, or social interaction, or something now.


Update: 5th Aug 2022

It turned out that my original calculation was wrong. I wrote another Snake Calculator program while playing around with another problem - and results didn't quite match. The good thing about now having two independent programs is that each could be used to find and fix bugs in the other. The original program didn't quite handle an edge-case correctly. Further inspired by hearing about "The Soul of the Snake" I've re-run the program on a full length snake.

With both programs now agreeing for various length snakes, I now feel pretty confident about the following answer being correct for length 24:

Final Corrected Answer = 6,721,828,475,867 = approx 421.31

MethodNumberApprox as exponent
Trivial lower bound2,097,152410.5
Refined lower bound85,149,351,936418.15
Final answer6,721,828,475,867421.31
Exhaustive with duplicates13,446,591,920,995421.81
Refined upper bound30,002,572,532,736422.60
Trivial upper bound70,368,744,177,664423.00

(Yes, the difference is less than 1%, so the exponents are still pretty much the same)

The Internet is now a little bit more correct and, to offset the drop in entropy, a server somewhere in an AWS data centre has been slightly warmed.

14 September 2011

Covariance contravariance inconvenience

I love the covariance/contavariance support in C#4. That is the mechanism that lets you implicitly convert from an IEnumerable<Child> to an IEnumerable<Parent> was expected (covariance), and an IComparable<Parent> can be converted to IComparable<Child> (contravariance). Here Child obviously inherits from Parent.

One of the biggest problems though is that this only works for classes and interfaces where the type parameters are either all outbound or all inbound for covariance and contravariance respectively. E.g. no methods on IEnumerable<T> accept T, which is covariance. And no methods on IComparable<T> return T. Eric Lippert wrote a bunch of posts worth reading about while the compiler team were considering variance.

This is somewhat inconvenient. For example it means that IList<Child> cannot be assigned to IList<Parent>, or vice-versa, because IList<T> methods both accept and return T. And often one may wish to write a method using IList<T> rather than IEnumerable<T> as it allows for direct element access. It is particularly frustrating because I may well only need to use the methods that read from the list, and so if IReadonlyList<T> existed then all might be sweet.

Now I was wrestling with this kind of thing the other day when I read Lippert's post on What is this thing called a Type, which got me thinking about it all again. Here he's again talking about how various operations tranform one Type into another.

OK, to the point. Here's the idea I was thinking. What if it were possible to create modified types based on existing interfaces and classes that only present the input or output methods.

For example (and to use Lippert's Giraffe inherits from Mammal inherits from Animal example):
IList<out Mammal> and IList<in Mammal> becomes types that are declarable in code.

IList<out Mammal> exposes only the subset of IList<Mammal> that is covariant. I.e. the methods for reading from the list, but none of the methods that accept T parameters. IList<out Giraffe> could be assigned to IList<out Mammal>. As a type, IList<out Mammal> itself would be smaller than IList<Mammal>, so it is always possible to convert from IList<Mammal> to IList<out Mammal>. So, finally IList<Giraffe> could be assigned to IList<out Mammal>.

Likewise, IList<in Mammal> exposes only the subset of IList<Mammal> that is contravarient, exposing only the methods for modifying a list. By the same (but contravariant) argument, IList<Giraffe> would be assignable to IList<in Mammal>.

Some code we could then write:

static void FeedMammals(IList<out Mammal> mammals)
{
// mammals.Count is available as it doesn't use the type parameter.
// Using an old-style for-loop to illustrate we don't want to use IEnumerable<T>
for (int i=0; i<mammals.Count; i++) parameter
{
// only the getter is available, as it is outbound
mammals[i].FeedMammalFood();
}

// But these are illegal:
mammals[0] = new Mammal(); // setter
mammals.Insert(0, new Mammal()); // method that accepts T
}


And call with:

List<Giraffe> giraffes = ...;
FeedMammals(giraffes); // yay, cast from List<Giraffe> to IList<out Mammal>


Similarly we could have:

static void AddBabyMammals(IList<in Mammal> mammals)
{
// New baby born
Mammal baby = GetNewBabyMammal();
mammals.Insert(0, baby);

//Setter is also OK to replace a value
mammals[0] = baby;

// But these would be illegal
Mammal m1 = mammals[0]; // getter
foreach (var m2 in mammals) // call to enumarator, because its 'out'
}


And call with:

List<Animal> animals = ...;
AddBabyMammals(animals); // yay, cast from List<Giraffe> to IList<out Mammal>


This would be a wonderful feature. Anyway, that's my two cents.

It's been a while

...a very long long while since I made a post. One may say that after dipping my toe into the Interblag to test the water for a while, I forgot all about it. And then subsequently forgot my password, followed by the email address I used, the login for that email, and finally (or additionally) the name of the blog according to blogger (since the only thing I could remember was blog.ylett.com). Trying to remember if I used to be this forgetful.

Needless to say, the bar was slightly raised for making any subsequent posts. Over the years, every now and then an idea would pop into my mind, followed by "oh, I could blog about that... except I'll have to deal with this mess." Cue chirping crickets.

Problem now solved. Now all I have to do is to figure out how to get back down to just one Google account (since I somehow managed to get my gmail on a different account, and Google+ on a different one again). And think of something to write about.