Technology Distillate: February 2015

A wheem away a wheem away a wheem away a wheem away.....In the Searches, the Binary Searches, the Data lurks today.......

The mighty Binary Search, the final part of Part one "Sort and search algorithms using C#"

Lets put it in perspective what it means when operating a binary search on sorted data.
Assume that we have a data set one million long and reading them takes 1 milliseconds.
Lets say that the data we are looking for is roughly in the middle (but we don't know that yet).
That is approximately 500,000 milliseconds or 500 seconds, or 8 minutes 19 seconds and 80 milliseconds. We'll just say 8 minutes 20. This happens because you have to sequentially read the data either from the start or from the back. Each read and compare taking 1 milliseconds.

Can we drop that substantially to a few milliseconds? Yes we can! Enter, the 'Binary Search'.

How this works is rather simple actually. In the initial condition, you find the mid point, start and end indexers. We will call the start and end indexers the left and right index. The data we want to find we will call the "key". We now start our binary search and we do this by first checking the key with the value at mid point index. If key is smaller than the mid indexed value we move the right indexer to one less than the mid index and find the new mid index which sits between the left and the new right index. We then compare the new mid indexed value with the key. If the key is more than the mid indexed value we then make the left index one more than the mid index and find the new mid point index between the new left and the right index. Eventually we will reach a case where left= right and either we have found a match or we return that the record does not have the value searched. We are essentially halving the problem domain at each pass. So how does this look on 1 million data set? Like this:

initial = 1 million.
Problem domain size after 1st pass: 500, 000
Problem domain size after 2nd pass: 250,000
Problem domain size after 3rd pass: 125,000
Problem domain size after 4th pass: 62,500
Problem domain size after 5th pass: 31250
Problem domain size after 6th pass: 15625
Problem domain size after 7th pass: 7813 (we round up the decimal point)
Problem domain size after 8th pass: 3907
Problem domain size after 9th pass: 1954
Problem domain size after 10th pass: 977
Problem domain size after 11th pass: 489
Problem domain size after 12th pass: 245
Problem domain size after 13th pass: 123
Problem domain size after 14th pass: 62
Problem domain size after 15th pass: 31
Problem domain size after 16th pass: 16
Problem domain size after 17th pass: 8
Problem domain size after 18th pass: 4
Problem domain size after 19th pass: 2
Problem domain size after 20th pass: 1 and it is here where we either find the data or we say we cant find it.

If each of the comparisons took 1 milliseconds, then the jobs done in 20 milliseconds. Compare this with a linear search taking 8 minutes 20 seconds.
We also only had to perform 20 passes instead of 500,000 passes.

Now for some code!

public int BinarySearch(ref int[] array, int key)
{
    int m = array.Length / 2;
    int l = 0;
    int r = array.Length - 1;

    while (l < r)
    {
      if (key > array[m])
      {
        l = m +1;
        m = (l + r) / 2;
       }
       else
       {
         r = m;
         m = (l + r) / 2;
       }
      }
     if (array[m] == key) return m;
     else return -1;
   }

How the code works
At the initial stage m is set to half the array length. "l" and "r" is set to 0 and one less than array length respectively.
The code then enters the while loop which only runs if l is smaller than r. The key is compared with value indexed by m. If bigger, l is one more than the middle indexer other wise r is adjusted to be the index m is and a new m is calculated from the new problem domain range.
Notice that there is no case where key == array[m] in the while loop because for that to happen l == r must be true. If l==r is true the while loop will break before the "if" statement and instead execute the final test of array[m]==key. If this is true the index where the found key is gets returned otherwise return -1(to indicate no key was found).

The pros is that this search is very fast. The con is that the data has to be sorted first. This is the reason also why I put this at the end after talking about sorting algorithms first.

This was the last post for part one. Part two will focus on taking these same algorithms and making them accept all data types including abstract.

Thank you for reading and I hope things were clear.

Previous: Heap Sort part two.
Landing page: Part One.

Technology Distillate

Pages

Friday, 13 February 2015

BINARY SEARCH.