|
The average of a sort complexity is linear with respect to _Last – _First.
It's the same link as the one you gave me.
Also, now that I think of it again, you're probably right. It does work. I was trying to think of cases where the data collected would not be accurate, but assuming the "middle" parameter of nth_element() for the final filter (filtering the already filtered data) is less or equal to the "middle" parameter of nth_element() for the initial filtering, there is no problem.
|
|
|
|
|
Thanks Cyrilix,
How do you think nth_element could be implemented in linear time complexity O(n)? I can not imagine an implementation with O(n).
Any ideas?
regards,
George
|
|
|
|
|
Isn't it part of a standard C++ library in Visual C++? Why does it have to be re-implemented on your end?
|
|
|
|
|
Thanks Cyrilix,
I am not going to re-implement it, and I am interested to learn how nth_element is implemented to enhance my algorithm skill.
Any ideas to share?
regards,
George
|
|
|
|
|
I'm probably the worst person to ask, but I can't think of anything off the top of my head.
|
|
|
|
|
Never mind, Cyrilix.
I will start another thread to discuss this.
regards,
George
|
|
|
|
|
Hi DavidCrow,
I agree sorting with qsort in memory itself is fast. The issue for me is the data size is larger than the physical memory size. I can not read tens of G data into memory to sort. My physical memory is only 2G.
Currently, I have several Million data samples, but in the near future it will be grown to tens of G data samples.
Any comments or ideas?
regards,
George
|
|
|
|
|
George_George wrote: I can not read tens of G data into memory to sort.
And I would not have suggested such had you mentioned this requirement up front.
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Sorry DavidCrow,
It is my mistake to make a confusion before. When I ask the question, I think the data should be several M, but after some experiment, the data should be several G.
Do you have any new ideas or comments according to the new situation?
regards,
George
|
|
|
|
|
George_George wrote: Do you have any new ideas or comments according to the new situation?
Yes, and it would only require that you keep 1000 of those numbers in memory. If you are familar with a binary tree or a linked list, I would go that route.
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Hi DavidCrow,
I have a new idea. I want to use STL nth_element to select the largest 1000 elements from N elements. And I will divide the several G elements into tens of groups and each group has N elements. How do you think of this solution.
I heard the time complexity of nth_element is O(n), but actually I do not believe it. Do you know how nth_element could be implemented in O(n)?
regards,
George
|
|
|
|
|
George_George wrote: How do you think of this solution.
I tested it with 16,000,000 integers and it took ~1 second.
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Hi DavidCrow,
I can not believe your system is so fast -- looks like some super computers. What machine are you using?
regards,
George
|
|
|
|
|
George_George wrote: What machine are you using?
It is a Dell Optiplex 170L with a 2.4GHz CPU and 1GB of RAM. I also tried it on an IBM ThinkPad A21m with an 800MHz CPU and 512MB of RAM. It took ~4 seconds.
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Thanks DavidCrow,
It is appreciated if we could go back to the topic that the size of data (tens of G, not several) is not enough to hold at one time in physical memory. Any good ideas are appreciated.
regards,
George
|
|
|
|
|
George_George wrote: Any good ideas are appreciated.
One has already been offered. See here.
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Thanks DavidCrow,
I am stilling looking for more efficient solutions, how do you think using nth_element? I could save time to do sorting, since I only want the 1000 largest elements, no need to sort them.
regards,
George
|
|
|
|
|
George_George wrote: I am stilling looking for more efficient solutions...
Unless you have actual metrics from each solution, how do you know what is efficient and what is not?
George_George wrote: how do you think using nth_element?
How do you plan on using that with multiple files? At some point in time, you'll need the largest 1,000 from each file in memory, or in a separate file, so that the largest 1,000 from everything can be selected.
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Thanks DavidCrow,
I appreciate your comments above. I must make my hands dirty to evaluate further.
regards,
George
|
|
|
|
|
So what have you discovered with this?
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Hi,
I didn't check the correctness of the answers above. But you might use the auto sort mechanisme of std::map.
std::map<int, int> theMap;
int data[10];
data[0] = 1;
data[1] = 7;
data[2] = 3;
data[3] = 8;
data[4] = 1;
data[5] = 9;
data[6] = 12;
data[7] = 11;
data[8] = 13;
data[9] = 6;
for(int iIndex = 0; iIndex < 10; iIndex++)
{
if(theMap.size() < 4)
{
theMap[data[iIndex]] = data[iIndex];
}
else
{
if(data[iIndex] > theMap.begin()->second)
{
theMap[data[iIndex]] = data[iIndex];
theMap.erase(theMap.begin());
}
}
}
I don't know if this is performant on millions of data, but its a simple straightforward algorithm .
codito ergo sum
|
|
|
|
|
Thanks BadKarma,
Your solution works if we could contain all the data into memory (a Set instance), but in my situation the data size can not be afforded in memory. It is appreciated if you could share some insights with us.
regards,
George
|
|
|
|
|
In my solution i only contain 4 integer in memory.
The example does use an int array but that is just to provide me with input datan this data could alsoo be extracted from a file or any other means of input.
codito ergo sum
|
|
|
|
|
Thanks BadKarma,
I understand your solution now. Do you have any ideas to make improvements?
regards,
George
|
|
|
|
|
Hello!
I`m not that of an expert to provide scientific solutions, but having your problem and trying not to take into account what others have said before (nth_element and Map don`t tell anything to me, yet), I`m thinking at two solutions:
You say that you have the data into files. Well then, why not reading them line by line (or using a buffer, I don`t know the format of your file(s)) and by this way you will only work on several MBs of RAM, the rest is either read or will be read, but stored into that file.
--
For example, use a buffer of 5000 elements. Read the first 5000 elements. While reading them, you could already start building the Highest1000 vector (which I also suggest you sort it after adding one value, for efficiency). Read the next 5000 elements... modify the Highest1000 vector... and so on.
---
It might take some time, i know, but you don`t kill your computer and ... if you need the application to do something else while sorting, try using a multithreaded application.
--Even more effective it would be a thread to do reading/writing aspects, and another one to do the sorting and interplacing the new values (I`m afraid I cannot provide so much information on how to make the 2 treads work synchronous - global variables is all that comes to me)
The 2nd solution is actually adding to the first one, by saving your partial results into another file. So: if now let`s say you have 1G of data, save the highest 1000 highest values, and the next time you run your app and this time you have 2Gs of data, address the rest (which is 1G, again).
If timing is such a concern, you should think that actually this is the way Windows overcomes the lack of RAM, by writing data to HDD. It`s true that the system might get slow while reading/writing into files, but that`s the nature of your application. When working with such amounts of data, you cannot expect a pocket pc to do it while serving other applications and doing it fast. You say that you have 2G of memory, which is enough ...
Please provide feedback, I`ve been thinking to your problem and I`m interested in your result.
Best Wishes,
Shpid3r
PS: sorting is done the fastest by splitting in 2: let`s say the first position of Highest1000 vector is the lowest value, and that the last value (while adding, so not the 1000th) is the highest. Then, compare with the last one, then with the one at the half, and so on...
|
|
|
|