|
You should try and take advantage of all available information...
|
|
|
|
|
Create a 100 bit array, set a bit by the value if that bit is not already set, otherwise report the duplicate number.
Dave.
|
|
|
|
|
Sorry to double post. the last reply seemed to hang. I also edited the response
reate a 100 bit array, set a bit by each value if that bit is not already set, otherwise report the duplicate number.
Dave.
|
|
|
|
|
Dave,
Thanks for the response. Your solution is basically the same as my solution except that you
use an array of bits where I use an array of booleans. For n numbers, we both need an additional array of n elements. The person interviewing me claims that it can be done with O(1) additional space and time faster than O( n log(n) ). I still do not see how.
Bob
|
|
|
|
|
I suggest you have a better look at my earlier reply...
|
|
|
|
|
Bob,
I dug back 35 years and came up with the answer. Since the numbers are consecutive, add 1 to 100 into a singe location, then clear another location and add all the numbers in the array. That number will exceed the calculated result by the value of the duplicate number.
This was my solution to a problem in the 60's when we booted the mainframe with a deck of binary cards. The decks became huge (more than 1/2 of a tray of cards) and operators would drop the deck and spend hours trying to insure the deck was back in order(smart operations people always kept a spare deck, just in case). The solution was to initialize a location with 0 and then for each card read, increment the location. The location was then accumulated into a sum, and another location was summed with the actual card sequence number. If the two sums came out the same,then each card had been read once and once only. Appropriate measures had to be taken if a card mis-read caused a backspace and reread (dont consider the card read until you got a good status). The only downsize was that we had to insure that overlay loading was not attempted (could not use the ORG directive to backup into a prior defined table etc then restored). You could never be sure of when in the sequence of loading, any particular location was loaded.
Quite soon after that we went to booting from a tape. Note, we saved an autoload file that was the Startup image, and there was a one card boot that could read in that image instead of the whole boot deck, but when necessary, the deck had to be used - at least until the tape boot became implemented.
Dave.
|
|
|
|
|
Right, summing all the numbers is the cheapest and fastest way to identify the duplicate number.
The sum trick only works because you know the range of valid numbers; if the problem was: we have 100 different numbers and 1 got duplicated, then you would need a bitmap. Hence, use all available information. BTW: In order to sum the numbers 1..N you don't need an accumulator, the sum equals N*(N+1)/2 = 5050.
As for the cards, the punch cards I used a long time ago used to have 80 columns; columns 73-80 were not relevant to the program; they typically got used for numbering the cards. And sorting machines were available, so dropping a tray only did cost a few passes through the sorting machine to get everything back in order. IIRC they had around 100 output bins, so for decimal digits, they could handle two digits at a time.
|
|
|
|
|
Luc,
As "Murphy" predicted, the tray usually got dropped in the middle of the night when the Tab room (with the sorter) was locked up at night, so the routine had been to just put them back in some order, and correct the order when an out-of order card was read. The summing solution did the trick, at least until tape boot was perfected.
Dave.
|
|
|
|
|
This sounds like a COBOL deck - it wasn't true of other decks.
I once did some AlgolW stuff on a IBM 360 where all 80 columns could contain code.
But I still think your first reply has it spot on - do a sum and use all available information.
Regards
David Rice
|
|
|
|
|
So you are the first who really read my replies in this thread...
Actually I used punch cards only for Fortran (Fortran IV, Watfor, Watfive) programs;
they had an optional C in column 1 for comment cards; an optional statement number in 2-5,
an optional continuation character in col 6, the actual statement(s) in col 7-72, and
free comment in 73-80, normally used for sequential numbering the cards.
|
|
|
|
|
I'd forgotten about Fortran - last used it 1975.
Not with punched cards but a paper tape reader on a tele-type terminal.
|
|
|
|
|
The only program I ever used paper tape for was a PDP-8 assembly program on a machine without a
disk; so the "development cycle" consisted of:
- load editor
- load source
- edit
- save source
- load assembler
- load source
etc etc
Things were simple back then, now they are comfortable but complex.
|
|
|
|
|
First of all, I don't claim to have THE solution to your problem.
You've stated there are two possible approaches: 1. sorting the array, 2. keeping a separate bit array. I think the actual solution combines both approaches.
When you're sorting an array, there are two basic operations:
1. comparisons
2. swaps
Let's suppose your algorithm looks like this:
<br />
main loop {<br />
int i = ...;<br />
int j = ...;<br />
if (A[i] > A[j])<br />
swap(A[i],A[j]);<br />
}<br />
However, since you only have to find the repeated element, you might change the algorithm so it looks like this:
<br />
main loop {<br />
int i = ...;<br />
int j = ...;<br />
if (A[i] == A[j])<br />
return A[i];
else if (A[i] > A[j])<br />
swap(A[i],A[j]);<br />
}<br />
Provided the sorting algorithm you use is in-place (for example, heapsort), you only need O(1) additional space.
Even though the worst case is still O(n log(n)) , the average case's complexity HAS TO (read: might... hehe) be lower, since you stop the algorithm once the repeated element was found.
Now, you have to choose the sorting algorithm that gives you the best average case while still being O(n log(n)) in the worst case. And that's way too advanced computer science for me.
Hope that helps,
Eduardo
-----------------------------------------------------------------------
EDIT: Wow. I just read the sum one. Wow. Forget everything I said.
|
|
|
|
|
Computing the sum is an easy approach if there is only one duplicate number (though I prefer to formulate the problem as having a missing number rather than a duplicate). Now for the fun part:
Assume there are 65534 numbers in the range 0 to 65535 without duplicates, so two numbers are missing. Can you write a program to identify both missing numbers on a single pass through the data, that could run on a machine with 64K of code storage and 256 bytes of data storage (the proper solution shouldn't require anywhere near 4K, much less 64K; if your code size isn't ridiculous assume it will fit)?
What if three numbers are missing? How about four?
modified on Thursday, December 18, 2008 7:12 PM
|
|
|
|
|
supercat9 wrote: Can you ...
2 yes
3 provisional no
|
|
|
|
|
3 yes
4 probably
|
|
|
|
|
I wonder what the 'best' approaches are? The amount of state one would have to keep wouldn't be large (256 bytes of state would be enough state to allow for over 100 missing numbers) if one had unlimited time to convert the state into a list. One method would be to use two bytes to explicitly keep track of whether the values 65521-65535 had been observed, and for all other values compute the sums (x mod 65521), (x^2 mod 65521), (x^3 mod 65521) ... (x^n mod 65521). I believe the resulting combination of sums would uniquely identify the missing numbers, though I have no idea how one would determine which numbers were missing except via trial and error.
I have a different approach which would use somewhat more storage (256 counters capable of counting up to the number of missing items) but I don't know how many missing numbers it would be able to support before the count patterns ceased to be unique. The count patterns are easy to resolve when n=2; when n=3, they are more complicated to resolve, but it will always be possible. For n=4, I think that remains true. Not sure about n=5.
What's your approach?
|
|
|
|
|
You could replace 65521 (the largest prime in range) by 65413 (the next prime) with the same result, forgoing the extra counters/flags. Intuitively I agree summing powers mod prime would hold all the information, and extracting it seems not obvious.
However IMO all that is way too complex. Here is mine:
you can accumulate the sum, that's good enough for 1 missing/extra number.
For two numbers, you can accumulate the sum and also the product, like so:
- threat zero separately (with a simple flag);
- use a double ("d0") to multiply all numbers from 1 to 65535
- use a double ("d2") to multiply all numbers present
- rescale both to prevent overflow (say divide both by 65536 each time d0 exceeds 65536)
- finally divide d0 by d2 (or d2 by d0) to get the product.
- even though d0 and d2 are both inaccurate (they lost most of the digits), their quotient
accurately represents the product of the missing (or extra) numbers.
With a 32-bit mantissa you can discriminate all possible products that have the same sum, the
extreme case is 65532*65535 against 65533*65534.
Having sum and product, a simple equation generates the two numbers.
For three numbers (say x y z):
- accumulate the sum x+y+z
- accumulate the product xyz
- accumulate the product (x+1)(y+1)(z+1)
- the products need 48-bit mantissa, a double still is OK
- a simple substitution yields xy+yz+zx
From these three (xyz, , xy+yz+zx, z+y+z) you can find x, y and z either analytically,
or by trial-and-error: just try all possible values of z (1..65535), and solve what remains
for x and y. Most z values will have no solution, you end up finding six permutations of the
one real solution.
For four numbers, you need a 64-bit mantissa, which a double does not offer; so you need a higher-
precision math package. Then accumulate products and two shifted products, in the end leading to
a fourth-order set of equations for which I haven't figured out yet if there is an acceptable
way to solve it.
|
|
|
|
|
Your approach for n=3 seems to devolve into trial and error; the same sort of trial and error would allow for a solution using the (mod 65521) maths. Using a smaller modulus would require having more flags to handle the numbers between the modulus and 65535.
A nice (but sub-optimal) bit-based approach for handling the two-number case would be to keep sixteen words, holding the xor of all the items in the list with bit 0 set, all the items with bit 1 set, etc. At least one of the numbers must have at least one bit set which is clear in the other number, and so that number may be read out easily.
For the case of n=3, a similar approach could work except that instead of doing an xor, one would have to hold a count of how many numbers had the various bits set. If there is exactly one number missing which had a particular bit set, the number may be easily identified. With three numbers, though, it's possible that a number of bits will be present in two missing numbers and none in exactly one. This situation may be resolved, however, by nothing that there must be exactly three missing numbers, and so if there are three different patterns of missing bits they must derive from the missing numbers. I don't quite know how to describe the analysis algorithmically, but it works out.
I think a similar approach will work with n=4, though it starts to get messier. I have no idea about n=5 and beyond.
|
|
|
|
|
supercat9 wrote: Your approach for n=3 seems to devolve into trial and error; the same sort of trial and error would allow for a solution using the (mod 65521) maths. Using a smaller modulus would require having more flags to handle the numbers between the modulus and 65535.
I tend to disagree. First of all there is an analytical solution for one root of a cubic equation;
second, when a trial loop over z is executed, what is left is a simple quadratic equation, so each z-value needs an IsSquare(discriminant) test, which is much more straightforward than your modulo-prime arithmetic.
I did look into xoring the numbers for n=2 earlier, but decided against it; the product being very
easy to handle. I am not sure I see how xoring helps you for n=2 or higher. For n=2 if the two
missing numbers are identical except for 1 bit, then its easy, otherwise...
Just try numbers 0-7 with missing 2 and 3 (easy), 2 and 5 (now what?).
|
|
|
|
|
For the n=2 case, if there were 16 xors produced (xor of all numbers with bit 0 set, with bit 1 set, etc.) then if the missing numbers were 2 and 5, then xor[0] and xor[2] would both be 5, while xor[1] would be 2. Easy.
If the numbers shared a bit but each had at least one bit the other lacked (e.g. the values were 3 and 5) then the xor's would have three non-zero values: xor[0] would be 6 (3 xor 5), xor[1] would be 3, and xor[2] would be 5. Simplest thing to do in that scenario is to ignore the non-zero xor value which does not have the 'indicated' bit set (i.e. xor[0] in this case, since the value 6 has bit 0 clear). The other two non-zero xor values are the missing numbers.
If one number is a subset of the other (e.g. 2 and 3), then one of the non-zero xors will have its corresponding bit clear (e.g. xor[1] will equal 1, which has bit 1 clear) and the other will have it set. The one that has the bit set is one of the missing numbers; xor that with the other one to get the other missing number.
For n=3, one will have a count of bit values, mod 4. Bits that are missing from none of the numbers or from all three will be trivial to place accurately. Bits that are missing from exactly one of the numbers will be easy to place; if any such bits exist, the problem will then decompose to n=2. The only hard case is where three different bits are each used by exactly two of the missing numbers (e.g. if the missing numbers are 3, 5, and 6, then tally[0][2..0] = (1,1,2); tally[1][2..0] = (1,2,1); tally[2][2..0] = (2,1,1). The tally values make it possible to determine which bits are isomorphic; that in turn will make it possible to determine which numbers are missing.
|
|
|
|
|
Another approach:
run 17 sums: one simple sum (S) as before and then S0 is the sum of all numbers with bit 0 set, S1 is the sum of all with bit 1 set .. etc.
Now we can precompute the expected answers for S0 .. S15, call these E0 .. E15.
If there are two numbers missing, then we get a+b from E-S, and then by looking at En-Sn, these can take the values {0,a,b,a+b} and as a and b must differ in at least one place, a or b must appear. So we can always solve the 2 case.
It is possible to do 3 and 4 in most cases, and I suspect you can prove you can do 3 in all cases but someone may prove otherwise.
Peter
"Until the invention of the computer, the machine gun was the device that enabled humans to make the most mistakes in the smallest amount of time."
|
|
|
|
|
Very interesting indeed.
With numbers a, b and c missing, the difference between the 17 Ei and Si values could take the values
{0, a, b, c, a+b, a+c, b+c, a+b+c} which may or may not be 8 different numbers (a could equal b+c). Some of these should appear in the Ei-Si values; how many different numbers are we expecting? Not sure yet.
If a=arbitrary, b=arbitrary and c=0, then c would not help in providing different numbers, so
we would find at most {0, a, b, a+b}; we are sure to find a+b in E-S
If a=arbitrary, b=65535-a and c=0, then we would find at most {0, a, 65535-a, 65535}
Another point of view: if it were to always provide the unique solution for n=3, and n=4, and...
at what value of n would it start to fail, and why? Is 16 a relevant number, i.e. does widening
the numbers enable the algorithm to find more missing numbers? I guess not, since all the missing
nunbers COULD be in the range [0, 65535] however large the word width, so adding bits does not
strengthen the algorithm.
|
|
|
|
|
Just adding together numbers is apt to cause some confusion as bits carry from one place to the next. For n=3 or n=4 I would think it would be best to do something like:
for (i=0; i<16; i++)
{
if (x & (2 << i))
{
count1[i] ^= (count0[i] & x);
count0[i] ^= x;
}
}
That will indicate whether a bit is set in 0, 1, 2, 3 (mod 4) of the missing numbers that have a particular bit set. The case where a bit is missing in all four numbers can be resolved by noting that there must be some other bit which does not appear in all four numbers, so the count0/1[] values for that bit will show that it's missing in at least some numbers.
|
|
|
|
|
Modify the process slightly to use one of supercat's ideas, as well as having 17 sums, also have 17 counts, so that you know how many are in each of the Sn sums. As well, this may not be necessary, but I'm lazy, so set a flag to detect 0.
For n=3, if 0 is missing treat as n=2, done. Otherwise if any of the Sn is only missing one number, then you get one missing number directly, simply add it back in then you can process as for 2. If any of the Sn is missing two numbers then as E-S gives you a+b+c then you can work out one number, add it back in and process for 2. Done. As the numbers are not all the same, all the sums can't be down by 0 or 3.
For n=4, if 0 is missing treat as n=3, done. Otherwise if any of the Sn is only missing one number, then you get one missing number directly, simply add it back in then you can process as for 3. Done. Otherwise if any of the Sn is missing three numbers, then you get one missing number from a+b+c+d -(a+b+c), so process as if one found. Done. Otherwise all are missing either 0, 2 or 4. Can anyone with more time than me show that under these conditions you can get three different pairs? My guess is that it is likely.
[edit] I guessed wrongly - {1,3,5,7} breaks it. [/edit]
Peter
"Until the invention of the computer, the machine gun was the device that enabled humans to make the most mistakes in the smallest amount of time."
modified on Sunday, December 21, 2008 9:42 PM
|
|
|
|
|