Click here to Skip to main content
16,004,602 members
Please Sign up or sign in to vote.
2.33/5 (2 votes)
See more:
Hi,

I am producing some test data.

I need:

Forename,
Surname,
Nationality,
Gender,
DoB,
Email Address,

I have lists of popular forenames by gender and nationality and surnames by nationality. I can produce all the details from a randomized combination of forename and surname.

What I would like to do is use the name 'Frequency' column I have with my list of names.
How should I store the frequency with the name and use it to weight by random person generator?

Here is some code

C#
private static Dictionary<string, double> _enMForenames = new Dictionary<string, double>
        {
            {"Oliver",1.939}, //about 499 more
        };
private static Dictionary<string, double> _enFForenames = new Dictionary<string, double>
        {
            {"Amelia",1.638}, //about 499 more
        };
private static Dictionary<string,double> _enSurnames = new Dictionary<string, double>
        {
            {"SMITH",0.0074062771845516}, //about 30k more
        };


Thanks ^_^


As the title suggests. I don't have to use the frequency but it sounds like fun (well, to me anyway :)).


EDIT: I have created my own dictionary for producing a weighted random type.
Will this work?

C#
private class WeightedRandomLists<T> : Dictionary<T, double>
        {
            static Random rand = new Random((int)(DateTime.Now.Ticks % int.MaxValue));

            private static Dictionary<T, double> _totals;


            public T NextRandomItem
            {
                 get
                {
                    if (!this.Any())
                        return default(T);

                    if (_totals == null)
                    {

                        double total = Values.Sum();

                        double runningTotal = 0.0;
                        _totals = this.Select(kvp =>
                        {
                            runningTotal += (kvp.Value/total)/100; //because percent
                            return new KeyValuePair<T, double>(kvp.Key, runningTotal);
                        }).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
                    }

                    var random = rand.NextDouble();
                    var index = Array.BinarySearch(_totals.Values.ToArray(), random);
                    return _totals.Keys.Cast<T>().ToArray()[index];
                }
            }
        }
Posted
Updated 21-May-15 4:45am
v5

I would store probabilities for each name:
C#
var names = new [] {"Adam", "Barbara", "Cecilia", "Daniel", "Emily", "Felix"};
var probabilities = new [] {0.15, 0.2, 0.1, 0.35, 0.15, 0.05};

Assuming System.Random.NextDouble()[^] returns a random variable x of uniform distribution on interval 0 <= x < 1. We can transform this to our custom distribution.
C#
// running sum of probabilities
var thresholds = new [] {0.15, 0.35, 0.45, 0.8, 0.95, 1.0};

var randomValue = random.NextDouble();
var index = Array.BinarySearch(thresholds, randomValue);

if(index < 0)
{
    index = ~index;
}

var name = names[index];

Array.BinarySearch[^] returns the index of the item if found or bitwise complement of the index of next bigger item.
 
Share this answer
 
v3
Comments
Andy Lanng 21-May-15 10:10am    
Is this an issue if the probabilities don't add up to 1.0?
Tomas Takac 21-May-15 10:21am    
Yes. But probabilities are basically percentages, so you can calculate it as frequency / sum of all frequencies.
Andy Lanng 21-May-15 10:23am    
ah ok - i'll give it a second go
Andy Lanng 21-May-15 10:26am    
I have updated my question. What do you think?
Tomas Takac 21-May-15 10:39am    
Seems like an overkill to generate the array every time. In your other question you say you need this for 2M names so this might be a problem. Do you have the requirement to add/remove names or update frequencies on the fly? My point is: do you need the dictionary at all? Why not create your own class from scratch?
On of the best candidates for storing just frequency would be something like
System.Collections.Generic.Dictionary<string, uint>, where first generic parameter is for the name, and the second one for frequency. (Not really "frequency", but just the number of occurrences, but this is what you really need.) And it's also possible that you may need to make the second parameter some class (using reference types is much better than value-type structure) containing frequency and/or other data.

This class gives you time complexity of O(1) for finding by name. As you need to update frequency on each repeated name, this is what you want.

There is a lot of unclear in your post, first of all, where you get the input data, but this is not a part of your question. ;-)

—SA
 
Share this answer
 
v2
Comments
Andy Lanng 21-May-15 9:01am    
Cool - I have the dictionaries set up for en names (examples in the question).

I only have the percent of population (baby names 2014) at my disposal. Can I work with that?

I would like to know how to implement the "weighting" of the random name generator
Sergey Alexandrovich Kryukov 21-May-15 9:05am    
Yes, you can work with it. I'll probably answer how to weight, if you explain the problem, but this is a different question. Will you accept this answer?
—SA
Andy Lanng 21-May-15 9:24am    
well..

My question did say "and use it to weight by random person generator?", but as it's you SA, no problem ^_^

It's probably better to break it down anyway.
Andy Lanng 21-May-15 9:36am    
See this continued here:
http://www.codeproject.com/Questions/993726/Just-for-fun-part-II-Produce-real-names-by-nationa
Sergey Alexandrovich Kryukov 21-May-15 9:43am    
Very good. I basically understand that you want to generate random numbers, but with weights. I just want a little bit of explanation. Say, you have a set of names with weights, normalized or not. You want to generate a random name from the list, but higher weight should give higher probability. Now the distribution. Consider uniform distribution, but the names are repeated in numbers proportional to the weight, and then you choose the index in all the list of names, including the repetitions; so it will give you "uniform" distribution with weights. Or distribution can be different... Generally, you can use random generator with uniform distribution but make any distribution out of it, using some mapping function. The function correspondent to the situation described about is just a step-like function...
—SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900