Introduction
The algorithm described herein is designed to
implement sorting of variable length strings in O(n) time. The implementation sorts
an input file of strings. When processing the input file, this program uses two linked lists, one for the strings, and another for the attributes about the strings. This allows a more elegant implementation for processing strings of variable length, as opposed to static or
heap-dynamic arrays. Once the file has been read, the data is transitioned into the desired format, that is, a char array holds all the strings, and another array contains the start and size information for each string.
Algorithm Description
Sorting strings in O(n) time, requires a little
thought before it can be accomplished successfully. First, we note that a popular
and effective method for sorting integral keys is Radix sort[1]
. And we
also note that Radix
sort can be customized in a number of ways. In this case,
we are being asked to sort variable length strings, with possible values
being A-Z. That is, varying length keys, with Radix 26. The character '.' is not being considered a potentially value, and is used only to aid in the processing of the strings
from the input file.
To accomplish this particular sorting problem,
we need to first sort the strings based on their length, then build an array of
indexes corresponding to sub-lists of strings of a given length. This allows us
to sort alphabetically, considering those strings of length > w, were w is going
to range from k-1 down to 0. In this way, we explicitly consider only those strings that
are big enough to be considered with respect to size at each pass. This
directly implies that we will retain the O(N)
property of Radix
sort for the task at
hand.
Looking at the code, we see a properly coded
radix sort for sorting the strings based on their length. This routine explicitly
does not use linked lists for its buckets, as that may incur overhead of
run-time memory allocation for each bucket. We opt to dynamically
allocate a single array, and work from there. What we are doing is first
initializing an array, called counts to zero, then mapping the occurrence of the
hexadecimal digit i=k-1 down to 0, to its appropriate bucket and incrementing the
number of occurrences of a particular digit. We then pass through, and accumulate
the appropriate indexes, corresponding to correct positions in the array tmp[]
, based
on the w'th digit. We then map the elements to their respective buckets, using
the indexes we created. And when we are done, we concatenate the strings placing
them from tmp[]
into A[]
. When w == 0
, we will have performed a Radix sort based
on size.
Once we have the list sorted by size, we need to
obtain the starting indexes for each string of size > k-1 down to 0. This is
trivial, as we have a sorted list, we simply apply the counting
mechanism employed in the bucket sorting phase of Radix
sort, and then
calculate the indexes. These indexes, stored in sizes[]
, correspond to
sublists, of strings that can be sorted at each pass of the
alphabetical Radix
sort, for any given value of w, in the range of k -1 down to
0.
We then call a specialized Radix
sort to sort
alphabetically, this is done by considering ONLY elements of size > w at
any given time, the statement:
lower = sizes[w];
allows us to only consider the w'th character in that
sublist for strings large enough to be considered, that is, strings that actually have
a w'th character. When we are done, at each pass, we concatenate everything
back into A[]
, considering only those slots that range from lower to n -1,
as those are the ones we have sorted. When w ==0, the strings are now
sorted in alphabetical order. We also note that we have not changed the ordering
of size, therefore, the string ABS will always come before ABSOLUTE,
as desired.
NOTE: The obvious, and incorrect answers for
this problem are the following:
- Explicitly padding the strings to
be the size of the maximal length string results in O(n^2) runtime.
- Considering all the strings for
the w'th character when some strings are not large
enough. We could have tried to sort based on this, and
for example considered a zero character when strings are
too short; however, this is incorrect, as when w is
large, we may inadvertently sort strings which do not need to
be sorted. Consider when n-1 strings are size,1, and the n'th
string is size n. This approach would make the
algorithm O(n^2) as it effectively is the same as padding, without
the overhead of explicitly increasing size of
each string with zero characters.
Algorithm Analysis
Time Complexity
First, the radix sort based on string size,
takes O(n + k) time. In this particular implementation, k is 8 (8 * 4
=> 32 bit int when radix is 16). This is a constant independent of n, hence sorting the
strings based on their size via Radix
sort is O(n).
Second, we use a variation of bucket sort, to
obtain an array of starting addresses of elements of length 1,2,3,... k. This
is O(n).
Third, when we do the radix sort of the strings
based on alphabetical order, we have a list ordered by size, and an array of
starting addresses for each set of strings that happen to be > w in length,
thus, we proceed for w = k -1 (k -1 being the maximal length string,
LSD first) and only consider the set of strings > w, that is,
lower
= sizes[w];
When we are done, we have sorted the list. This routine is
O(n + k). Even though k is going to vary in length due to the fact that the strings are of varying length, k is treated as
constant with respect to doing the sort on the k'th character during each
pass as only individual sublists large enough to be considered are sorted at any
given value of k. We do not consider elements too small, that is, when A[i].T
< w, nor do we explicitly pad the strings to be of a fixed size.
When we are done at each pass, we concatenate the results into
A[] for only those strings that we have sorted. With this and the fact that k is independent of n, the total time
complexity for this routine is going to be
O(n).
Thus we have, O(n) + O(n) + O(n) for the size
radix sort, for the counting of the indexes, and for the alphabetical radix sort
respectively. This algorithm is O(n).
NOTE: The strings themselves are not touched
except for the consideration of the w'th character, which takes O(1) time to accomplish,
so it play no part in the sorting. To be more explicit, during the
sorting, we only access the strings by doing the operation CH(i,w) == C[A[i].S + w]
,
which takes O(1) time. We do not move the strings from their location in C[]
, only
their attributes A[]
are moved, as described in the handout.
Space Complexity
The algorithm uses O(n) space to store the
strings in C[]
. O(n) space to store the attributes of the strings in A[]
( starting
indexes and size ) and during each step of the algorithm in execution, that is,
for the steps described above, we use array's of either
O(k), O(n),
O(RADIX +1)
space for
sizes[]
,
tmp[]
, and
count[]
, respectively. The total space complexity is
O(n).
Using the Code
The algorithm uses a Radix-Sort[1]
to sort the strings based on size, with a Radix of 16, ie, Hexadecimal. The reason is, we can binary shift the bits by a factor of four and bitwise mask with the number 15 ( binary all ones in positions 3,2,1,0 respectively ) to obtain the
next digit in Hex ( see ./include/sort.h). This is considerably faster than using 10 as a Radix. But is machine-dependent, if you are using a BIG-ENDIAN or LITTLE-ENDIAN machine, kindly define
the appropriate included macro. If you don't, YOU WILL NOT BE ABLE TO COMPILE.
Files:
Makefile -- Makefile to build on Linux/Unix
f -- Sample input file
include -- Include directory containing headers
include/sort.h -- Header for sorting routines
include/list.h -- Header for linked list code
include/defs.h -- Defines used by this program
include/types.h-- Typedefs used by this program
list.c -- Linked list code for processing input file f
main.c -- function main
sort.c -- The Algorithm, and supporting code
References:
[1] Sedgewick, R. 1997. Algorithms in C, Parts 1-4:
Fundamentals, Data Structures, Sorting, Searching. 3rd Edition. Addison-Wesley