Welcome to the Inaugural Post for my Learning Python Series
The post is the start of a series of walk through from Start to Finish of my journey into Data Analysis and Data Science with Python.
The Tools and Loading the Data
- What are the tools to download in order to get started building in Python?
- How do I load the data and construct the domain?
- How do I do some basic analysis on the data to get a feel for the relationships?
The Next Part
The next post will utilize Panda to perform quicker more structure data analysis.
The Tools
I downloaded and installed the Anaconda distribution along with the Visual Studio Python Tools.
The Anaconda distribution is a python distribution that contains many scientific libraries.
The Data
Fortunately, there is a massive amount of data that you can have fun and experiment with.
The UCI Machine Learning Repository has a huge amount of data.
Examples
- Mice Protein Expression
- Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.
-
- Car Evaluation Data Set
- Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods.
- Adult Data Set
- Predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset.
I am going to use the Adult Data Set in my examples.
Loading the Data and Graphing the Data
Grab the data from:
http://archive.ics.uci.edu/ml/datasets/Adult
import csv
with open('C:\adult.test','r') as f:
for line in f:
reader = csv.reader(f)
for row in reader:
age = row[0]
workclass = row[1]
fnlweight = row[2]
education = row[3]
educationnum = row[4]
maritalstatus = row[5]
occupation = row[6]
print workclass
This will print out the workclass column in the data.
Resulting in an output like:
State-gov
Federal-gov
Private
Private
Private
Local-gov
Private
Local-gov
In order to start looking into the data, we can use some of Python’s built in magic to bucket the data and create some histograms.
Histograms will tell us how the data is distributed and start to give us clues about the shape of the data.
In order to create a histogram, we will use the collections library to count the data.
def create_histogram(labels, values, bucket_size, title):
plt.bar(labels, values)
plt.title(title)
plt.show()
agelist = list()
with open ('C:\Users\Jon\Documents\adult.test','r') as f:
for line in f:
reader = csv.reader(f)
for row in reader:
try:
age = row[0]
agelist.append(age)
except IndexError:
print("something")
agelistfloat = [float(x) for x in agelist]
agedist = Counter(agelist)
labels, values = zip(*agedist.items())
valueslistfloat = [float(x) for x in values]
labelsliststring = [float(x) for x in labels]
create_histogram(labelsliststring,valueslistfloat,5,"Age Distribution Simple")
Once I get the file:
I take my agelist and run it through Counter.
Counter allows for rapid tallying of data. It returns a defaultdict
object that lists each age and how many occurrences there were.
We then unzip the list to labels and values. * reverses the zip operation.
We then call the pyplot.bar (labels, values)
to show the graph.
We can see that our data is a right-skewed distribution. When you look at the data, you can see it is evenly distributed over the income generating population. At around 18, it starts and starts to gradually tail off at the peak of around 40.
First Refactoring – Making the Code a Bit More Compact and Readable
I mapped each row to a named tuple in order to iterate through the data a bit more intuitively.
First, I created a dictionary of the columns in the data.
economic_columns = ['age', 'workclass',
'fnlwght', 'education', 'educationNum',
'maritalStatus', 'occupation', 'relationship',
'race', 'sex', 'capitalGain', 'capitalLoss',
'hoursPerWeek','nativeCountry', 'income']
EconRecord = collections.namedtuple('econ',economic_columns)
I then created an object that would represent the named tuple.
rowlist = list()
for econ in map(EconRecord._make, csv.reader
(open('C:\Users\Jon\Documents\adult.txt', "r"))):
rowlist.append(econ)
I used the EconRecord
and the map function to apply EconRecord._make
to every record in the collection. Creating a new econ
record for each row in the file.
The result is being able is being to aggregate the items in a bit more cleanly with more concise readable code.
agelistint = [int(x.age) for x in rowlist]
agedist = Counter(agelistint)
labels, values = zip(*agedist.items())
The next series we will start to scatter plot and look for relationships in the data using Panda.
References
Filed under: CodeProject, machine learning, python