Data from the World Health Organization (WHO) relating to social, economic, health and political indicators are compiled by this organization, and are available as a file called WHO.csv. This file, as the name indicates, is a Comma Separated Value file. This CSV file has 358 columns, and 202 rows, one row pertaining to each country of the world. As examples, some of the columns are titled "Country", "Continent", "Population (in thousands) total", and "Number of confirmed poliomyelitis cases".
In this article, we try to get some meaningful data from this file, by means of the R programming language. About two years ago, I attended an online course called the The Analytics Edge offered by MIT, on the edX online platform. They had introduced the R language using a reduced form of the WHO CSV file mentioned above. We try to introduce the R language using a different version of the reduced form of the WHO file. Before we embark on a journey of R, let us take a look at the contents of this reduced WHO data file. This is a file which has the same 202 rows as above, but has only 15 columns, so that our understanding is simpler. This reduced WHO data file is available for download as WHOReduced.csv, at the top of this page. The columns of this reduced WHO data file are:
We use this reduced data set to understand some nuances of the data, and use the R programming language for this.
R is a software environment for data analysis, statistical computing and graphics. It is also a programming language, which enables one to code a set of steps to achieve a statistical or machine learning outcome. R is open-source. Though there are many choices for data analysis software like SAS, Stata, SPSS, Microsoft Excel, Matlab, Minitab, pandas, we will be using R for purposes of this article.
In the remainder of this article, we get introduced to R by a series of questions and their corresponding answers.
In this section, we pose a set of questions and get answers to these using R commands. This will serve as our introduction to R.
- How do I read in the CSV data into R?
Data from a CSV file can be loaded onto R by reading it into a data frame. Before getting into data frames, we need to know what a vector is. A vector is a series of numbers or characters stored as the same object. For example, the R command v = c(1, 2, 3, 4, 5)
creates a vector named v
, and this vector has five elements, the numbers 1, 2, 3, 4, 5. It is not correct to combine characters and numbers in the same vector. Two or more vectors of the same length can be combined into a data frame, which is an important data structure in R. If we consider two vectors v1 = c(1, 2, 3, 4, 5)
and v2=c(100, 200, 300, 400, 500)
, then these two can be combined into a single data frame which has five rows and two columns, with the first column being the first vector v1
, and the second column being the second vector v2
. In its simplest form, a data frame can be construed as a matrix. However, a data frame is more general than a matrix since the different columns can have quantities of different data types, as we see below.
Since we are working with a CSV file, R has a simple command to read in the entire CSV file into a single data frame. You will have to use the R menu to change directory to the folder where the file WHOReduced.csv is located, before executing this command.
> who = read.csv("WHOReduced.csv")
This command loads the entire CSV file into the data frame named who
. Just type this command into the R console, and hit Enter, for this command to run.
Next, we take a look at the structure of this data.
- How do I start understanding the structure of this data?
R has a useful command called str
which enables one to understand the structure of the data loaded into a data frame.
> str(who)
Upon running this command, the R console outputs the following output.
Looking at this output, one can get to know that there are 202 observations of 15 variables. What this means is that there are 202 rows, with each row having 15 variables. The 15 different variables in this data frame are Country, CountryID, Continent, AdultLiteracyRate, GNI, Population, PopGrowth, UrbanPop, BPLPop, MedianAge, Above60, Below15, FertilityRate, HospitalBeds, NumberOfPhysicians
. Some of these variables are of int
type, containing integer values. Some others are of num
type containing floating point values. The first variable Country
is of type Factor
, which is a categorical variable. The above screenshot shows that Country
has 202 categories, aka levels, with each level being the unique country name.
A small note on the continent labeling in this file. This is shown in the following table. These are strictly not the names of the continents, but we will take these for the purpose of this article.
Continent Label | | Continent Name |
1 | | Eastern Mediterranean |
2 | | Europe |
3 | | Africa |
4 | | North America |
5 | | South America |
6 | | Western Pacific |
7 | | Asia |
Next, we take a look at the summary of this data.
- How do I get a summary of this data?
R has another useful command called summary
which enables one to understand the summary of the data loaded into a data frame.
> summary(who)
Upon running this command, the R console outputs the following output:
Looking at this output, we find that R has output a summary of all the 15 different variables within this data frame. For quantities which have numerical values, R has output these quantities - the minimum value, the first quartile value (which is the value for which 25 percent of the values fall below this value), the median value (the value for which 50 percent of the values fall below this), the mean, the third quartile (the value for which 75 percent of the values fall below this), and the maximum value. For example, for the variable MedianAge
, these values are Minimum = 15.00, First quartile = 20.00, Median = 25.00, Mean = 26.74, Third quartile = 35.00, and Max = 43. We also see an entry called NA's : 23
corresponding to the variable MedianAge
. This indicates that there are 23 entries for which the median age is not listed in the data set, and hence in the data frame. These 23 values are not available. In a similar manner, the summary of all the other 13 integer/numerical variables can be understood. For the factor variable Country
, the summary has listed the first six entries in the screenshot above.
The R commands str()
and summary()
are very helpful for getting information on the structure of the data, and the summary of the data respectively.
Next, we pose some interesting questions on this data, and seek their answers.
- Which is the country having the minimum, and maximum population percentage under 15 years of age?
For answering this question, we need to identify the index of this country. The R command for this is:
> which.min(who$Below15)
Upon running this command, the R console outputs the answer as 4. Now, the country name is found using the following command:
> who$Country[4]
The answer is Andorra. The above two commands can be combined into a single command as:
> who$Country[which.min(who$Below15)]
The yields the same answer as Andorra as the country which has the minimum percentage of population under 15 years of age.
Similarly, the following command can be used to find the country which has the maximum of this number:
> who$Country[which.max(who$Below15)]
The answer to this is Uganda
.
- Which is the country having the minimum, and maximum population percentage over 60 years of age?
For answering these questions, as before, we type the command:
> who$Country[which.min(who$Above60)]
The yields the answer as United Arab Emirates
as the country which has the minimum percentage of population above 60 years of age.
Similarly, the following command can be used to find the country which has the maximum of this number:
> who$Country[which.max(who$Above60)]
The answer to this is Japan
. - Is there a country whose entire population is urban?
Looking at a summary of the data, it is seen that the maximum value of variable UrbanPop
is 100. To find out the country whose entire population is urban, we use the command: For answering this question, as before, we type the command:
> who$Country[which.max(who$UrbanPop)]
The yields the answer as Monaco
.
Similarly, the following command is used to find the country which has the minimum value for this number:
> who$Country[which.min(who$UrbanPop)]
The answer to this is Burundi
.
- How does a plot of the GNI vs Fertility Rate look like?
For answering this question, as before, we plot the data using the command:
> plot(who$GNI, who$FertilityRate)
The yields a plot as shown below:
We see that this is largely a triangular plot. Implying that a lower GNI indicates that the fertility rate is high, and vice versa. However, there are some countries that have a high GNI and high fertility rate. We investigate this question next.
- Which countries have a high GNI and high Fertility Rate?
For answering this question, as before, we take a subset of the original data as follows:
> HighVals = subset(who, GNI > 10000 & FertilityRate > 2.5)
This creates a subset of the data where the GNI is greater than 10000 and Fertility Rate is greater than 2.5. To find out the number of countries which fall in this category, we use the command:
> nrow(HighVals)
This gives the output as 9
, indicating that there are 9 such countries. To identify the countries which fall in this category, we use the command:
> HighVals[c("Country", "GNI", "FertilityRate")]
This gives the result:
This lists the 9 countries along with their GNI and Fertility Rate values.
- Which countries have a the highest and lowest ratio of number of doctors per person?
For answering this question, we add a vector to the original data set using the command:
> who$DrsPop = who$NumberOfPhysicians / who$Population
Here, the ratio of the variable NumberOfPhysicians
to the variable Population
is taken, and stored as a separate vector DrsPop
within the same data frame who
. To answer the above question, we use the commands:
> who$Country[which.min(who$DrsPop)]
> who$Country[which.max(who$DrsPop)]
The answers to these questions are respectively San Marino
(highest number of physicians per person) and Malawi
(lowest number of physicians per person).
A look at the structure of data using the str()
command will yield 202 observations with 16 variables, with the 16th one being the one newly added DrsPop
.
- How does the histogram of the number of Hospital Beds look like?
For answering this question, we plot the histogram using the command:
> hist(who$HospitalBeds)
This shows the histogram as shown in the following figure:
We see that this histogram is highly skewed, with a large number of countries having a low value for the number of hospital beds.
- How does a box plot of the Population Growth against continent look like?
For answering this question, we plot the box plot using the command:
> boxplot(who$PopGrowth ~ who$Continent,
xlab = "Continent", ylab = "Population Growth")
This shows the box plot as shown in the following figure:
From this boxplot, we see that there are some continents where the population growth rate is indeed negative. There are some continents where the interquartile range (the vertical height of the box) is quite small, indicating that there not much of a difference between the population growth rates across the continent. Outliers, where the distance from the first or third quartile is greater than the interquartile range is termed as an outlier, and is shown as a circle in the above plot.
- How does a table of the Above60 variable vary with Continent?
For answering this question, we use the table
command as follows:
> table(who$Above60, who$Continent)
This shows the table as shown below:
From this table, we see that there are 11 countries in Continent 2 (Europe
), having 22 percent of their population above 60 years of age.
- Can we find out the average urban population on a Continent basis?
For answering this question, we use the tapply
command as follows:
> tapply(who$UrbanPop, who$Continent, mean, na.rm=TRUE)
The tapply(arg1, arg2, arg3)
command takes three arguments, and groups arg1
by arg2
and applies arg3
. This means that in this case, the tapply
command groups the variable UrbanPop
by variable Continent
and applies the mean. The parameter na.rm=TRUE
in the above command is used to indicate to R to exclude the NA values from the computation.
This shows the table below:
We see that the mean urban population is maximum in Continent 1 (Eastern Mediterranean
), though Continent 4 (North America
) is not far behind.
- Can we find out the average population growth on a Continent basis?
For answering this question, we again use the tapply
command as follows:
> tapply(who$PopGrowth, who$Continent, mean, na.rm=TRUE)
This shows the table below.
We see that Continent 3 (Africa
) has the highest average population growth, whereas Continent 2 (Europe
) has the lowest.
In this article, we got introduced to looking at data in a CSV file using simple commands in R. The example file we used was WHOReduced.csv
which is a reduced version of the WHO data as of 2017. I have attempted to give an introduction to R by posing a set of simple but important questions on the data. We got introduced to the commands read.csv(), str(), summary(), which.min(), which.max(), plot(), subset(), nrow(), hist(), boxplot(), table(), tapply()
. I plan to continue writing articles on this in future, and cover other important analytics tools using the R language.
Meanwhile, I urge you to load your own CSV files, try out the commands listed above, and let me know your feedback on this.