Is it safe to keep large volume of data in HashSet or any other list ?

Question

0.00/5 (No votes)

See more:

I have three files F1,F2 and F3 . Each containing millions of record in it .
These files contain numbers . Now my requirement is to check any duplicates in any of the file eg number N1 should be unique in all the files. I am doing this by putting all the record in a hashset and proccessing it . So my question is can i put around 100 million records in a Hashset ? Will there be any memory problem ?
I cant use a database so if there is any other option to do this please tell me ?

Posted 1-Jan-13 18:47pm

prejval2006

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2013-01-01T19:11:00

In principle you can, but you will push the limits of the available RAM of some computers. You need not as much memory as you might thing if you compose your hash set wisely.

Instead of keeping some content, a file record in the values of the hash set, keep only the position in a file (or maybe, the position and the size of record). I don't know what would be you key type and how much memory will the key take.

By the way, you probably meant the type System.Collections.Hashtable. For any new development, you never should use this type, as well as any other non-specialized non-generic collection types. It was rendered obsolete as early as of the .NET version 2.0, when generics were introduces. It wasn't formally marked with the [Obsolete attribute only because there is nothing wrong in maintaining them in well-working legacy code. Non-generic types require type casts and hence potentially more dangerous than the generic classes you really need to use. You should pick one of these three:
http://msdn.microsoft.com/en-us/library/xfhwa508.aspx[^],
http://msdn.microsoft.com/en-us/library/ms132259.aspx[^],
http://msdn.microsoft.com/en-us/library/ms132319.aspx[^].

The major difference between all those key-indexed container is different overhead between computational complexity (time of operation, practically) and memory overhead. As you situation can be most critical to memory overhead, you will need to study this problem to make a right choice.

I don't know if you can use the class System.Collections.Generic.HashSet<T> for your purpose.

Now, the remaining question is: what if you still need to keep more data than you can hold in your RAM? Well, I would certainly solve such problem, but it would need more work. The idea is simple: you can learn how associative containers work and implement it using disk memory for major storage. Please see:
http://en.wikipedia.org/wiki/Hash_table[^].

However, I would stop here. First and foremost, I'm not quite sure that your whole approach is reasonable. To me, all solution which involve huge memory consumption are suspicious. If I knew your exact goals, I would probably tried to review the whole architecture.

—SA