Introduction
I'm not sure if someone has published an article about big O notation in code project or not. In one of its newsletters it did tout
one: Big-O notation explained
by a self-taught programmer.
Actually, considering his background and what Justin covered, it was a very good job. I loved his example of using the gumball machine to show how the order of magnitude
is an indicator of size. I do wish he had gone into the math he used to come up with 200 gumballs. I also think it is a good read, covering the basic concepts
of big O (except for one small gaff).
One of his points was that big O is a predictor of performance. That, to me, is the main point for using it. I am going to expand his coverage to two other notations.
O(LogN) is fairly well documented in other articles. I've just seen passing mention of O(NLogN) without seeing an article about it. IE performance increasingly gets worse
for notations: 1, LogN, N, NLogN, and N^2. (N^3 would be worse yet, but I haven't seen any algorithms using that or articles talking about it.)
If you know of other (relatively common) notations, I'd like to hear about them.
Background
Two years ago, the only Big O I knew about was a Manga cartoon about a man and his
"cyborg". (Not accurate, but neither is robot. Another member of that show really is a robot .)
Then in a job quiz I was asked about big O. I admitted to not knowing anything about it and then looking it up on the web and explained what I had found out about it and what
I knew about it before reading about it.
Some false information is available on the web, basically because people aren't informed about what they are writing about. (It seems there is a lot of good and bad information on the web, on almost any subject.)
For instance, per web comments, O(LogN) requires the use of the natural log to make log calculations. That is completely false. I admit that a lot of text uses logn to refer to log base e (the natural log.) More often it is listed as "ln" or "loge". Just listing as a "log" generally refers to base 10. Either base can be used here but neither has a direct relationship with O(LogN), the direct relationship for O(LogN) is the log base 2 of N. The base has nothing to do with the predictive nature of LogN. Every log base that exists will give the exact
same prediction. (N1 takes "x" seconds, N2 should take ?)
I may have been told the reason for the definition of e and forgot it, but there doesn't seem anything natural or even reasonable about it to me. Wikipedia doesn't even explain where e comes from. I have yet to figure out what earthly good a natural log has over any other log (Other than being built-in with most calculators.)
UPDATE! Thanks to reader comments, I know where e comes from. I said the ratio of values will be the same no matter what log base number is used. The slope of the graph of a constant to a power at that power will be e^(loge(constant^power)). Again, any two logs using the same base when comparing two numbers will give you the same ratio. The natural log is immaterial in this calculation, the factor will be different, but the ratio is exactly the same.
Logs are based on the power of the base number. The thing about logs is if you want to multiply two values together, you add the two log values together then you take the base number to that power, it is the same as multiplying the two original numbers together without using logarithms. So when comparing numbers the ratio of the logs will be exactly the same no matter what base number for the log is used. IE 2 vs. 8, 10 vs. a thousand, a thousand vs. a billion are all one to three ratios whatever logarithm base number is used. 2 vs. 1024 is a one to ten ratio.
Justin has an O(N) example finding 1 value in an array of values. You can change it to LogN by passing a sorted array (instead of a random distribution) of values. Set the start(st) to 0 and the end(en) to the length minus 1. Verify the number is in range and not the end points and then keep getting the halfway index(mi = (st+en)/2) just like the Binary Sort logic does to split the array into sorted halves. You can return true if the number matches. If mi equals st and you haven't found a record return false If not set either st (low) or en (high) to mi to cut the list in half. If you have 1024 items you can find the answer in 10 steps. The same number of steps for any number of items over 512. With a Billion item list, you can find out in 30 steps if you have a match. That is a basic explanation of how O(LogN) works.
A log is a log, it works exactly the same if the base is e, 10, 2, or any other number greater than 1. The log of 2 in base 2 is 1, 4 is 2 and 8 is 3. Take the natural log of 2, multiply by 3, raise e to that power, the answer is 8. That will always be the answer no matter which log base number you use.
One O notation that is hardly ever mentioned is NlogN. This is N times the log of N and it seems to be more common than you'd think because of its coverage in big O articles.
Using the algorithm
I'm not going into detail about the code, but I will supply code that runs O(N^2) (Bubble sort squares N to process. You double N, the time should take 4 times as long to process) and code that runs O(NLogN) (Binary sort).
The following chart displays test statistics showing the performance differences between the Binary Sort and the Bubble Sort.
"Number" is the number of int fields I put in an array. I stopped testing the Bubble sort after 200K items. (It consistently quadrupled times when items doubled, takes well over 2 minutes at 200K while the binary sort takes 4 hundredths of a second to sort the same amount.) I was going to find out how many items could be sorted in the same 2 minute time limit using the binary sort but ran out of juice (RAM) at 200M. Basically the binary sort is an NlogN process. IE N times the logN. Like the article says, it is a measure of the maximum time it will take. On my machine using the base 10 log, multiplying N * Log * 3.84E-8 is an estimate in seconds for sorting using the base 10 logN. If you wanted the number multiplier for a different log base, use the same math to get that base value. (IE The natural log will come up with a different number that you would use in place of 3.8...)
I estimate my machine could sort 150M with a bubble sort in 2.5 years.
Update! Quicksort was a link provided by a user. It is another term describing the binary sort and goes into more detail about the algorithm. It uses some math I don't really understand, but it shows a maximum performance hit being 39 percent slower than optimum. IE if you re-run the sort on a sorted array, it should take between 72 and 100 percent of the time the original sort of a randomly filled array would take. I would guess a heavy probability of less than 85%.
You should be able to easily tweak the test software to run a second sort test on an already sorted array to see how much faster it is.
Number: Number of items being sorted
Bubble: Time to sort using bubble sort
Binary: Time to sort using Binary Sort algorithm
Note that I stopped Bubble testing at 200K,
I estimate 150M (750 times larger) would take 79M seconds (2.5 years) on my machine.
LOG(Num;10)*Num: Basically what it says, the log base 10 of the Number times the Number (NLogN)
Secs/(LOG(Num;10)*Num): The ratio of the time taken for the binary sort divided
by NLogN. If exactly that kind of process, this should be constant.
(IE a sorted array should sort faster because it will consistently divide
in half. The random distribution shouldn't divide exactly in half
but it should be close enough to look constant.)
On my machine, a good estimation using base 10 is 3.8E-8 or 0.000000038. The time only advances in 1 millisecond increments on my machine so the first two bad results are because
of timing round-off errors. Note that a different log base would have a different constant, but it should consistently provide the same estimate. IE The base 2 log can be calculated by dividing the base 10 log N by the base 10 log of 2.(Close to .3)
Number | Bubble | Binary | LOG(Num;10)*Num | Secs/(LOG(Num;10)*Num) | BinSecs |
1,000 | 3ms | 0 | 3000 | 0.00E+000 | 0.000 |
10,000 | 304ms | 2ms | 40000 | 5.00E-008 | 0.002 |
100,000 | 30.667s | 19ms | 500000 | 3.80E-008 | 0.019 |
200,000 | 140.418s | 40ms | 1060206 | 3.77E-008 | 0.040 |
1,000,000 | | 228ms | 6,000,000 | 3.80E-008 | 0.228 |
10,000,000 | | 2.645s | 70,000,000 | 3.78E-008 | 2.645 |
90,000,000 | | 27.481s | 715,881,825.85 | 3.84E-008 | 27.481 |
150,000,000 | | 46.372s | 1,226,413,688.86 | 3.78E-008 | 46.372 |
Points of Interest
Don't try writing code that is provided by the system for free unless you have a very good reason or are playing. (I was playing.
Array.Sort
was consistently 1.5 times faster than my binary sort. IE it too is O(NLogN) but more efficient.) Justin's gaff? Writing about O(1) and then mentioning having a million items not impacting it. O(1) can't process N items because the best you can do then is O(LogN). If you apply O(1) to a million items, it becomes O(N). The math for the gumball globe supports Justin's 200 number. I counted 12 gumballs going across. Circumference is 2 pi R, so R is approximately 4. Except that 12 gumballs weren't going across in a straight line, the spherical nature of the gumball means you can't pack them in without air-space, so make R 3.5
I had to look up the formula for a sphere – 4/3 pi r^3. Rounding again, >170 gumballs for a Radius of 3.5. R of 4 is >250. An order of magnitude wag of 200 is pretty good. When I first read about the binary sort, that author said it didn't do so well with a sorted list and I've forgotten the details of how he did it, but he basically picked the first item in the list as his middle value. (Making an already sorted list an N^2 process) I said to myself "To heck with that, make it work well with sorted or unsorted lists". I also figured, instead of moving all the smaller numbers to the left, just assume it will hit the middle and then shift the middle value left or right.
Code
The following is the C# code for the Bubble Sort, an alternate Bubble Sort (one forth the loops but still N^2,) the Binary Sort routine and testing code (could be run on a console application).
public static void BubbleSort(int[] ar)
{
for (int pass = 1; pass < ar.Length; pass++)
for (int i = 0; i < ar.Length - 1; i++)
if (ar[i] > ar[i + 1])
Swap(ar, i);
}
public static void AltBubbleSort(int[] ar)
{
int endpoint = ar.Length, endOfFrst = ar.Length / 2 + 1;
for (int i1 = 0; i1 < endOfFrst; i1++)
{
endpoint--;
for (int i2 = i1; i2 < endpoint; i2++)
{
if (ar[i2] > ar[i2 + 1])
Swap(ar, i2);
if (ar[i1] > ar[i2])
Swap(ar, i1, i2);
}
}
}
private static void Swap(int[] ar, int first)
{
int hold;
hold = ar[ first ];
ar[first] = ar[first + 1];
ar[ first + 1 ] = hold;
}
private static void Swap(int[] ar, int i1, int i2)
{
int hold = ar[i1];
ar[i1] = ar[i2];
ar[i2] = hold;
}
public static void BinSort(int[] ar, int start, int end)
{
int middle, middlel, middlevalue, hold, left, right;
if (start > end)
{
hold = start;
start = end;
end = hold;
}
if (start < ar.GetLowerBound(0)) start = ar.GetLowerBound(0);
if (end > ar.GetUpperBound(0)) end = ar.GetUpperBound(0);
middle = start + (end - start) / 2;
left = middle - 1;
right = middle + 1;
if (ar[start] > ar[end]) Swap(ar, start, end);
if (ar[middle] > ar[start] && ar[middle] > ar[end]) Swap(ar, middle, end);
else if (ar[middle] < ar[start] && ar[middle] < ar[end]) Swap(ar, middle, start);
middlel = middle;
middlevalue = ar[middle];
bool finished = false;
while (!finished)
{
while (left >= start && ar[left] < middlevalue) left--;
while (right <= end && ar[right] > middlevalue) right++;
if (left >= start && ar[left] == middlevalue) Swap(ar, left--, --middlel);
if (right <= end && ar[right] == middlevalue) Swap(ar, right++, ++middle);
if ((left >= start && ar[left] > middlevalue) &&
(right <= end && ar[right] < middlevalue))
Swap(ar, left--, right++);
if (left < start)
{
if (right <= end && ar[right] < middlevalue)
{
Swap(ar, middlel, ++middle);
if (right != middle)
Swap(ar, right, middlel++);
else middlel++;
right++;
}
if (right > end) finished = true;
}
else if (right > end)
{
if (left >= start && ar[left] > middlevalue)
{
Swap(ar, middle, --middlel);
if (left != middlel) Swap(ar, left, middle--);
else middle--;
left--;
}
if (left < start) finished = true;
}
}
if (middlel - start > 1) BinSort(ar, start, --middlel);
else if (ar[middlel] < ar[start]) Swap(ar, middlel, start);
if (end - middle > 1) BinSort(ar, ++middle, end);
else if (ar[middle] > ar[end]) Swap(ar, middle, end);
}
static void testSort(string[] args)
{
DateTime start = DateTime.Now;
List<int> arraySize = new List<int>();
int size;
if (args.Length > 0)
{
foreach (string a in args)
if (int.TryParse(a, out size)) arraySize.Add(size);
}
else arraySize.Add(20000);
foreach (int sizes in arraySize)
{
int[] testArray3 = GenRandVals(sizes), testArray2 = new int[2],
testArray1 = new int[2], testArray4 = new int[2];
TimeSpan span = new TimeSpan(DateTime.Now.Ticks - start.Ticks);
Console.WriteLine(string.Format(
"Time to generate and copy arrays: {0} of size: {1}", span.ToString(), sizes));
start = DateTime.Now;
List<int> listArray = new List<int>(sizes);
if (sizes <= 200000)
{
testArray2 = CopyArray(testArray3);
testArray1 = CopyArray(testArray3);
testArray4 = CopyArray(testArray3);
for (int i = 0; i < sizes; i++) listArray.Add(testArray3[i]);
span = new TimeSpan(DateTime.Now.Ticks - start.Ticks);
Console.WriteLine(string.Format("Time to load Arrays: {0}, Length {1}",
span.ToString(), listArray.Count()));
start = DateTime.Now;
listArray.Sort();
span = new TimeSpan(DateTime.Now.Ticks - start.Ticks);
Console.WriteLine(string.Format(
"Plus time to sort List<int> listArray: {0}", span.ToString()));
start = DateTime.Now;
BubbleSort(testArray1);
span = new TimeSpan(DateTime.Now.Ticks - start.Ticks);
Console.WriteLine(string.Format(
"Time to run original BubbleSort: {0}", span.ToString()));
start = DateTime.Now;
AltBubbleSort(testArray2);
span = new TimeSpan(DateTime.Now.Ticks - start.Ticks);
Console.WriteLine(string.Format(
"Time to run alternate BubbleSort logic: {0}", span.ToString()));
int misMat = 0;
for (int i = 0; i < sizes; i++)
if (testArray1[i] != testArray2[i]) misMat++;
if (misMat > 0)
Console.WriteLine("Alternate bubbleSort routine has " +
"an error. {0} missmatches found", misMat);
start = DateTime.Now;
SortedLinkedNumbers xxx, yyy = CopyLinkedArray(testArray3);
span = new TimeSpan(DateTime.Now.Ticks - start.Ticks);
Console.WriteLine(string.Format(
"Time to run SortedLinkedNumbers Sort logic: {0}, Length {1}",
span.ToString(), sizes));
}
start = DateTime.Now;
BinSort(testArray3, 0, sizes - 1);
span = new TimeSpan(DateTime.Now.Ticks - start.Ticks);
Console.WriteLine(string.Format(
"Time to run binary Sort logic: {0}, Length {1}", span.ToString(), sizes));
start = DateTime.Now;
if (sizes <= 200000)
{
int highCount = 0, lowCount = 0, misMatch = 0;
for (int i = 0; i < sizes; i++)
{
if (testArray3[i] < testArray4[i]) lowCount++;
if (testArray3[i] > testArray4[i]) highCount++;
if (testArray3[i] != testArray2[i] || testArray1[i] != testArray3[i])
misMatch++;
}
Console.WriteLine(string.Format("{0} Values were moved earlier in the array, " +
"{1} were moved later {2} didn't move.",
lowCount, highCount, sizes - (highCount + lowCount)));
if (misMatch > 0)
Console.WriteLine(string.Format(
"Error in sorting - array didn't exactly match. Error count {0}.", misMatch));
}
if (sizes > 5000000)
{
testArray3 = null;
Thread.Sleep(20);
testArray3 = GenRandVals(sizes);
start = DateTime.Now;
Array.Sort(testArray3);
span = new TimeSpan(DateTime.Now.Ticks - start.Ticks);
Console.WriteLine(string.Format("Time to run Array Sort logic: {0}, Length {1}",
span.ToString(), sizes));
testArray3 = GenRandVals(10);
Thread.Sleep(20);
}
}
}
public static int[] GenRandVals(int sizeArray)
{
if (sizeArray <= 0) throw (new Exception(string.Format("GenRandVals(int sizeArray)
called. sizeArray MUST be > 0. Value= {0} sent.", sizeArray)));
int[] vals = new int[sizeArray];
int range = int.MaxValue, diff = range / 2;
Random rand = new Random();
for (int i = 0; i < sizeArray; i++)
vals[i] = rand.Next(range) - diff;
return vals;
}
public static int[] CopyArray(int[] ar)
{
int len = ar.Length;
int[] Cpy = new int[len];
for (int i = 0; i < len; i++)
Cpy[i] = ar[i];
return Cpy;
}
History
- Spent some time tweaking the article before asking to publish. Asked for publish review 9/23/2013.
- Based on feedback and links provided in feedback changed lower case o to upper case. Corrected BinSort logic to allow other than beginning index of 0 and remove the possible int overflow finding the middle index.
Republished 9/30/2013.
- Added more detail in the article from feedback on the article. Publish 10/1/2013