Thursday, February 16, 2012

Intersecting Two Sorted Integer Arrays

An interesting problem I've run to recently is the following (I tried to express it using Jon Bentley's convention):

Input: Two sorted integer arrays A and B in increasing order and of different sizes N and M, respectively.

Output: A sorted integer array C in increasing order that contains elements that appear in both A and B

Contraints: No duplicates are allowed in C

Example: For input A = {3,6,8,9} and B = {4,5,6,9,10,11}, the output should be C = {6,9}

I posted this question on Stack Overflow, along with my original solution, and got a lot of very helpful feedback. I was also pointed to the relevant use of this problem in the Sort-Merge Join algorithm in RDBMS.

Naive Approach
The naive approach to this problem would be to compare all possible pairs of elements from both arrays and output the ones that match. This would take O(N*M). Of course, this might be an acceptable approach if the arrays weren't sorted, but we can definitely use that property to our advantage.

A Better Approach
My original solution was to keep two pointers, one for each array, and scanning the arrays from left to right interchangeably, while picking out elements that match. So when we the current element of one array is larger than the second array, we keep incrementing the pointer of the second array until we either find the current first array element or overpass it (find one larger). I keep all matched in a separate array, which is returned once we reach the end of either one of the input arrays.

This approach will take O(N+M), since we always scan the smallest array to the end and in the worst case go  the end of the largest array as well.

This is my Java implementation of this approach:
      public static int[] intersectSortedArrays(int[] arrayA, int[] arrayB) {  
           // worst case size for the intersection array  
           int[] c = new int[Math.min(arrayA.length, arrayB.length)];  
           int ai = 0, bi = 0, ci = 0;  
   
           while (ai < arrayA.length && bi < arrayB.length) {  
                if (arrayA[ai] == arrayB[bi]) {  
                     // check if element already exists  
                     if (ci == 0 || arrayA[ai] != c[ci - 1]) {  
                          c[ci++] = arrayA[ai];  
                     }  
                     ai++;  
                     bi++;  
                } else if (arrayA[ai] > arrayB[bi]) {  
                     bi++;  
                } else if (arrayA[ai] < arrayB[bi]) {  
                     ai++;  
                }  
           }  
   
           return Arrays.copyOfRange(c, 0, ci);  
      }  

Alternate Scenario
Some of the suggestions on Stack Overflow were to scan one of the arrays linearly, while using binary search to find a match in the second array. This would mean O(N*log(M)) time, if we scan A and for each of its N elements binary search on B (O(log(M)) time).

So, when would O(N*log(M)) be better than O(N+M)? To find out, I made a 3D plot of the two functions and the running time in Mathematica:
Instead of using M directly, I represented it as a function N, because as you can see, the binary search approach gets better when M is a certain number of times larger than N. Specifically, when N is 1 million, the intersection point is roughly at k = 25. This means that only when M is larger then 25 million would it make sense to use the binary search method to intersect the two arrays in this case.

Also notice that for smaller N, the intersection point is slightly lower (somewhere in low 20s) and it's growing logarithmically with N. This is the benefit of having a 3D plot.

We can find the exact intersection point for N = 1 million using Wolfram Alpha:
Of course, this is only a theoretical hypothesis using asymptotic analysis, i.e. we are ignoring the practical constants. Next step is to find out the actual break point, using my implementations on my machine.

Binary Search Method Implementation
Before I can do real world experiments, I first need to implement the Binary Search Method. Since I already have the simple Binary Search implemented, the rest is easy:

      public static int[] intersectSortedArrays2(int[] arrayA, int[] arrayB) {  
           // worst case size for the intersection array  
           int[] c = new int[Math.min(arrayA.length, arrayB.length)];  
           int ci = 0;  
           for(int a : arrayA){  
                if(BinarySearch.search(arrayB, a) >= 0){  
                     if(ci == 0 || c[ci-1] != a){  
                          c[ci++] = a;  
                     }  
                }  
           }  
           return Arrays.copyOfRange(c, 0, ci);  
      }  


The Experiment
In order to compare the two methods with respect to the multiplication factor k, the number of times array B is larger than array A, I setup an experiment that runs both methods using different values for k. Hopefully, at some k, the Binary Search method overtakes the Linear intersection method.

Here is my Java code for the experiment setup:
      public void experiment(){  
           int N = 1000000, M;  
           int[] A = new int[N];  
             
           Random rand = new Random();  
             
           int k = 1;  
           while(k <= 200){  
                M = k*N;  
                int[] B = new int[M];  
                  
                for(int i = 0; i < M; i++){  
                     B[i] = rand.nextInt();  
                     if(i < N) A[i] = rand.nextInt();  
                }  
                  
                Arrays.sort(A); Arrays.sort(B);  
                  
                long tStart = System.currentTimeMillis();  
                int[] c1 = ArrayIntersect.intersectSortedArrays(A, B);  
                long t1 = System.currentTimeMillis() - tStart;  
                  
                tStart = System.currentTimeMillis();  
                int[] c2 = ArrayIntersect.intersectSortedArrays2(A, B);  
                long t2 = System.currentTimeMillis() - tStart;  
                  
                System.out.println(k + "," + t1 + "," + t2);  
                  
                assertArrayEquals(c1, c2);  
                  
                if(k == 1) k += 9;  
                else k += 10;  
           }  
      }  

In this experiment, the numbers are generated within the whole range of Integer, and we keep N (size of array A) fixed at 1 million, while (size of array B) goes from 1 Million to 200 Million in increments of 10 Million.

Plotting the results in MATLAB gives us the following chart:
From the graph above, the break-even point seems to be around k = 70. This means that when intersecting two sorted integer arrays, the Binary Search method performs better than the Linear intersection method when one array is roughly 70 times larger than the other (at least when the smaller array has roughly 1 Million elements, but this factor should not grow too fast with N, as we've seen from the theoretical 3D function plot earlier).

Thursday, February 2, 2012

Bitmap Sort

In Jon Bentley's Programming Pearls book, the first Column introduces a sorting problem. As we learn more about the problem and clearly define it's constraints, the solution transitions from Merge Sort using Disk to an efficient Bitmap Sort.

This algorithm uses a bitmap (or a bit vector) to represent a finite set of distinct integers. For example, if we have an integer range 0-5, we can represent it using a 6-bit array, e.g.
[2,3,5] becomes 0 0 1 1 0 5
[1,3,4] becomes 0 1 0 1 1 0

In order to sort an array of integers, we first need to initialize a bit array of size corresponding to the range and fill it with zeroes (in Java, this is the default value). Then we go through the input array and set the corresponding bit in our bitmap to 1 for each input integer. Finally, we can scan through the bitmap and output the number for each bit that's set to 1. Since we're scanning the bitmap in order, we print all the original integers in a sorted order.

Implementation
This sounds like a very simple algorithm, that also runs in linear O(n) time! However, when I started to implement this algorithm in Java, a number of implementation details popped up and the faults of this algorithm also became apparent.

Most languages I know of won't let you to simply define a bit array -- the smallest primitive type is usually the 8-bit byte. For a range N, we will need to define an byte array of size N/8. And for each number i from input, we will set the i mod 8 'th bit of the floor(i / 8) 'th element of the bitmap. For example, if we had the range 0-15 (N=16) and the input integer was 13, we would set the 5th bit of the 1st element of the bitmap.

When we iterate through the bitmap, we go through each bit in each element and print out the number i * 8 + j for every j 'th bit set in the i 'th element.

This is my Java code:
      public int[] sortDistinctIntegers(int[] a, int min, int max){  
           int N = (max-min) / 8 + 1;  
           byte[] bitmap = new byte[N]; //initialized to 0  
             
           for(int i = 0; i < a.length; i++)  
                bitmap[a[i]/8] |= 1 << (a[i] % 8);  
             
           int k = 0;  
           for(int i = 0; i < N; i++){  
                for(int j = 0; j < 8; j++){  
                     if((bitmap[i] & (1 << j)) > 0){  
                          a[k] = i * 8 + j + min;  
                          k++;  
                     }  
                }  
           }  
             
           return a;  
      }  

In order to save space, I use the original array to "print" the sorted numbers, by keeping the counter k. Also notice that I've defined the range by the min and max variables, and I use the min variable as the offset when setting the bitmap and also when setting sorted integers. Note that this will also work for negative integers.

Shortcomings
This is where the shortcomings become apparent. First, we need to know the range in advance. If we don't, we cannot use this method for the 64-bit longs, since the range of the long type is larger than the largest allowable size for an array in Java (the maximum integer value). The maximum allowable range would be (2^32)-1*8 -- by having a byte array of maximum number of elements, each representing 8 numbers. Of course, we can increase the effective range by adding more dimensions.

Actually, the code above in its current form will not work for the maximum range between Integer.MIN_VALUE and Integer.MAX_VALUE because there will be an overflow in the (min-max) integer calculation. In order to fix that, we need to hack the expression so it evaluates the intermediate value as long and then converts back to integer after the division by 8 (this works because the division puts the range back into the integer land), so it will look like this:
 int N = (int)(((long)max-min) / 8 + 1);  
We will also need to do this in the bitmap location calculation, which will then look like this:
 bitmap[(int)(((long)a[i]-min)/8)] |= 1 << (int)(((long)a[i]-min) % 8);  

Second, even if we know the range, this method will be inefficient for sparse input (with very wide range and very small number of elements) -- and not just in space, since it takes a while to initialize and fill a 2 billion element byte array.

Performance Analysis
To evaluate the performance of this algorithm, I compare it to Java's built-in sort function. My testing code is as follows:
      public static void main(String[] args) {  
           BitmapSort bs = new BitmapSort();  
             
           Random gen = new Random();  
           Set<Integer> numbers = new HashSet<Integer>();  
           int input[];  
             
           for(int N = 10; N <= 2000000000l; N*=10){  
                System.out.print(N + ",");  
                  
                while(numbers.size() < N){  
                     numbers.add(gen.nextInt());  
                }  
                  
                input = new int[N];  
                int i = 0;  
                for(Integer num : numbers){  
                     input[i] = num;  
                     i++;  
                }  
                  
                int temp[] = input;  
                long t1 = System.currentTimeMillis();  
                Arrays.sort(temp);  
                System.out.print((System.currentTimeMillis() - t1) + ",");  
                  
                t1 = System.currentTimeMillis();  
                bs.sortDistinctIntegers(input, Integer.MIN_VALUE, Integer.MAX_VALUE);  
                System.out.println(System.currentTimeMillis() - t1);  
           }  
      }  

I use a HashSet to keep distinct integers generated from the Random class and then time Java's sort execution and my Bitmap Sort implementation.

The chart below shows my results:

Note that the code did not execute for input size larger than 10 million as the HashSet structure ate all the memory in the generation phase (even with 6GB allocation).



Java's sort performance is much faster for the small sizes, but starts to grow exponentially towards the millions zone. I have a feeling that Bitmap Sort might overtake Java's default sorting algorithm for larger input sizes (although, I'm not sure if Java switches algorithms at certain input size).

This experiment was of course for the worst case of Bitmap Sort -- with integers within the maximum range. However, if the range is much better defined (as would be the case in a lot of practical situations), the algorithm should perform much better.

Running the experiment with numbers generated in the range between 0 and N * 2 gives the follow results:



In this case, Bitmap Sort outperforms Java's system sort for all input sizes and looks like it scales better too!