Thursday, February 16, 2012

Intersecting Two Sorted Integer Arrays

An interesting problem I've run to recently is the following (I tried to express it using Jon Bentley's convention):

Input: Two sorted integer arrays A and B in increasing order and of different sizes N and M, respectively.

Output: A sorted integer array C in increasing order that contains elements that appear in both A and B

Contraints: No duplicates are allowed in C

Example: For input A = {3,6,8,9} and B = {4,5,6,9,10,11}, the output should be C = {6,9}

I posted this question on Stack Overflow, along with my original solution, and got a lot of very helpful feedback. I was also pointed to the relevant use of this problem in the Sort-Merge Join algorithm in RDBMS.

Naive Approach
The naive approach to this problem would be to compare all possible pairs of elements from both arrays and output the ones that match. This would take O(N*M). Of course, this might be an acceptable approach if the arrays weren't sorted, but we can definitely use that property to our advantage.

A Better Approach
My original solution was to keep two pointers, one for each array, and scanning the arrays from left to right interchangeably, while picking out elements that match. So when we the current element of one array is larger than the second array, we keep incrementing the pointer of the second array until we either find the current first array element or overpass it (find one larger). I keep all matched in a separate array, which is returned once we reach the end of either one of the input arrays.

This approach will take O(N+M), since we always scan the smallest array to the end and in the worst case go  the end of the largest array as well.

This is my Java implementation of this approach:
      public static int[] intersectSortedArrays(int[] arrayA, int[] arrayB) {  
           // worst case size for the intersection array  
           int[] c = new int[Math.min(arrayA.length, arrayB.length)];  
           int ai = 0, bi = 0, ci = 0;  
   
           while (ai < arrayA.length && bi < arrayB.length) {  
                if (arrayA[ai] == arrayB[bi]) {  
                     // check if element already exists  
                     if (ci == 0 || arrayA[ai] != c[ci - 1]) {  
                          c[ci++] = arrayA[ai];  
                     }  
                     ai++;  
                     bi++;  
                } else if (arrayA[ai] > arrayB[bi]) {  
                     bi++;  
                } else if (arrayA[ai] < arrayB[bi]) {  
                     ai++;  
                }  
           }  
   
           return Arrays.copyOfRange(c, 0, ci);  
      }  

Alternate Scenario
Some of the suggestions on Stack Overflow were to scan one of the arrays linearly, while using binary search to find a match in the second array. This would mean O(N*log(M)) time, if we scan A and for each of its N elements binary search on B (O(log(M)) time).

So, when would O(N*log(M)) be better than O(N+M)? To find out, I made a 3D plot of the two functions and the running time in Mathematica:
Instead of using M directly, I represented it as a function N, because as you can see, the binary search approach gets better when M is a certain number of times larger than N. Specifically, when N is 1 million, the intersection point is roughly at k = 25. This means that only when M is larger then 25 million would it make sense to use the binary search method to intersect the two arrays in this case.

Also notice that for smaller N, the intersection point is slightly lower (somewhere in low 20s) and it's growing logarithmically with N. This is the benefit of having a 3D plot.

We can find the exact intersection point for N = 1 million using Wolfram Alpha:
Of course, this is only a theoretical hypothesis using asymptotic analysis, i.e. we are ignoring the practical constants. Next step is to find out the actual break point, using my implementations on my machine.

Binary Search Method Implementation
Before I can do real world experiments, I first need to implement the Binary Search Method. Since I already have the simple Binary Search implemented, the rest is easy:

      public static int[] intersectSortedArrays2(int[] arrayA, int[] arrayB) {  
           // worst case size for the intersection array  
           int[] c = new int[Math.min(arrayA.length, arrayB.length)];  
           int ci = 0;  
           for(int a : arrayA){  
                if(BinarySearch.search(arrayB, a) >= 0){  
                     if(ci == 0 || c[ci-1] != a){  
                          c[ci++] = a;  
                     }  
                }  
           }  
           return Arrays.copyOfRange(c, 0, ci);  
      }  


The Experiment
In order to compare the two methods with respect to the multiplication factor k, the number of times array B is larger than array A, I setup an experiment that runs both methods using different values for k. Hopefully, at some k, the Binary Search method overtakes the Linear intersection method.

Here is my Java code for the experiment setup:
      public void experiment(){  
           int N = 1000000, M;  
           int[] A = new int[N];  
             
           Random rand = new Random();  
             
           int k = 1;  
           while(k <= 200){  
                M = k*N;  
                int[] B = new int[M];  
                  
                for(int i = 0; i < M; i++){  
                     B[i] = rand.nextInt();  
                     if(i < N) A[i] = rand.nextInt();  
                }  
                  
                Arrays.sort(A); Arrays.sort(B);  
                  
                long tStart = System.currentTimeMillis();  
                int[] c1 = ArrayIntersect.intersectSortedArrays(A, B);  
                long t1 = System.currentTimeMillis() - tStart;  
                  
                tStart = System.currentTimeMillis();  
                int[] c2 = ArrayIntersect.intersectSortedArrays2(A, B);  
                long t2 = System.currentTimeMillis() - tStart;  
                  
                System.out.println(k + "," + t1 + "," + t2);  
                  
                assertArrayEquals(c1, c2);  
                  
                if(k == 1) k += 9;  
                else k += 10;  
           }  
      }  

In this experiment, the numbers are generated within the whole range of Integer, and we keep N (size of array A) fixed at 1 million, while (size of array B) goes from 1 Million to 200 Million in increments of 10 Million.

Plotting the results in MATLAB gives us the following chart:
From the graph above, the break-even point seems to be around k = 70. This means that when intersecting two sorted integer arrays, the Binary Search method performs better than the Linear intersection method when one array is roughly 70 times larger than the other (at least when the smaller array has roughly 1 Million elements, but this factor should not grow too fast with N, as we've seen from the theoretical 3D function plot earlier).