Click here to Skip to main content
16,012,116 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have three files F1,F2 and F3 . Each containing millions of record in it .
These files contain numbers . Now my requirement is to check any duplicates in any of the file eg number N1 should be unique in all the files. I am doing this by putting all the record in a hashset and proccessing it . So my question is can i put around 100 million records in a Hashset ? Will there be any memory problem ?
I cant use a database so if there is any other option to do this please tell me ?
Posted

1 solution

In principle you can, but you will push the limits of the available RAM of some computers. You need not as much memory as you might thing if you compose your hash set wisely.

Instead of keeping some content, a file record in the values of the hash set, keep only the position in a file (or maybe, the position and the size of record). I don't know what would be you key type and how much memory will the key take.

By the way, you probably meant the type System.Collections.Hashtable. For any new development, you never should use this type, as well as any other non-specialized non-generic collection types. It was rendered obsolete as early as of the .NET version 2.0, when generics were introduces. It wasn't formally marked with the [Obsolete attribute only because there is nothing wrong in maintaining them in well-working legacy code. Non-generic types require type casts and hence potentially more dangerous than the generic classes you really need to use. You should pick one of these three:
http://msdn.microsoft.com/en-us/library/xfhwa508.aspx[^],
http://msdn.microsoft.com/en-us/library/ms132259.aspx[^],
http://msdn.microsoft.com/en-us/library/ms132319.aspx[^].

The major difference between all those key-indexed container is different overhead between computational complexity (time of operation, practically) and memory overhead. As you situation can be most critical to memory overhead, you will need to study this problem to make a right choice.

I don't know if you can use the class System.Collections.Generic.HashSet<T> for your purpose.

Now, the remaining question is: what if you still need to keep more data than you can hold in your RAM? Well, I would certainly solve such problem, but it would need more work. The idea is simple: you can learn how associative containers work and implement it using disk memory for major storage. Please see:
http://en.wikipedia.org/wiki/Hash_table[^].

However, I would stop here. First and foremost, I'm not quite sure that your whole approach is reasonable. To me, all solution which involve huge memory consumption are suspicious. If I knew your exact goals, I would probably tried to review the whole architecture.

—SA
 
Share this answer
 
Comments
prejval2006 2-Jan-13 1:32am    
My goal is to remove duplicates in all files without using any Database .eg Number N1 should be unique in all the files .So i am puttiing all the records in a single HashSet . And the number of records is all high Max 50 millions . So is there any other optimised way to do it ?
Sergey Alexandrovich Kryukov 2-Jan-13 1:49am    
I think I basically answered, don't you think so? Do some estimates and predictions. If your memory is enough, stay with memory. If you agree that my advice makes sense, consider accepting the answer formally (green button).
—SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900