Sunday, August 7, 2016

Italian Centenarians

Being centenarian is getting easier in a way I didn't expect. Don't you believe it? Have a look at this:

Data are coming from ISTAT, the Italian Institute for Statistics and shows how centenarian people boomed in Italy in the last decades.

The cool thing about it, especially if you are not Italian, is not much in the information you get but in the way I created the picture. Starting from the rough data, using python and the matplotlib library.

Here is the source code:
import matplotlib.pyplot as plt # 1

# 2
years = [
    1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
    1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
    2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
    2012, 2013, 2014, 2015, 2016
]
population = [
    302, 229, 345, 368, 482, 649, 703, 922, 960, 1120,
    2852, 3053, 3273, 3562, 3690, 4015, 4377, 4516, 4765, 5435,
    6153, 7265, 7614, 8797, 9470, 9470, 10728, 11319, 12243, 13544,
    15029, 16390, 17884, 19095, 18765
]

plt.plot(years, population) # 3
plt.xlabel('Year') # 4
plt.xlim(years[0], years[-1]) # 5
plt.ylabel('Population')
plt.title('Italian Centenarians') # 6
plt.fill_between(years, population, 0, color= 'blue', alpha= 0.5) # 7

plt.show() # 8
1. This is the standard way to import the pyplot sub-package. No one want to bring a long such long name, so the plt moniker is used instead.
2. If you don't believe me, you can get the data by yourself from ISTAT. Demography are from demo.istat.it.
3. I say to pyplot that I want to plot my data as a continuous line putting years on the X axis and population on Y.
4. Let the viewer better understand what's going on, showing a descriptive label.
5. Trim the picture so that it starts on the left with the first year and end to the right with the last one. Pyplot likes to let some room around.
6. It is always a good idea to give a name to the picture.
7. A nice touch, a bit of color to improve the readability.
8. After anything has been set on the area, it is time show it. This is the moment the image is actually generated.

Friday, August 5, 2016

Who is going to have dessert?

In the previous post We have seen how easy is to calculate the BMI for a bunch of people when using numpy arrays. Let's go a step further and use those data.

Given a BMI numpy array, let's check which element is fit enough to stand a dessert. Basically, what we want is getting a boolean array where True means that the associated guy has a BMI less than 25. Easier to write the Python code than explain it:
   import numpy

   # ...
   # given bmis, a numpy array containing BMI for each person under surveillance

   dessert = bmis < 25
Now, if we want to check if Person One could get a slice of cake, we can just check the relative element in the numpy dessert array:
   if dessert[1]:
      print('Enjoy the dessert!')
   else:
      print('No dessert for you!')

Calculating Body Mass Index

The Body Mass Index is a good tool to estimate the body fat for a person from their height and weight. It is quite easy to calculate:
   bmi = weight / height ** 2
Things are getting a bit more complicated if I want to get the BMI for a bunch of people. I'd like to write something like this:
   bmis = weights / heights ** 2
Problem is that I can't do that with standard python lists, since there is no overload for mathematic operators working on them. I should write a piece of code that works on single components and puts the result in another list.

Luckly I could use the numpy library, designed expressly to provide better numeric capability to python. It is just a matter of using the array class. Here is an example:
   import numpy

   heights = numpy.array([1.78, 1.52, 1.63, 1.68])
   weights = numpy.array([65, 82, 63, 65])
   bmis = weights / heights ** 2

   print(bmis)

Monday, April 18, 2016

Saving R plot to png

On Linux, if we want to save a plot to file we have to do specify before plotting to which file and in which format we want to work.

Say that I want to generate a PNG named myPlot.png in my current directory:
png('myPlot.png')
plot(df$var1, df$var2)
dev.off()
The last line says to R that I want to switch back to the default, meaning X11 that is going to show my up the result.

Sunday, April 17, 2016

Simple data manipulation on a R data frame

Given a data frame df with a variable named Var, to extract the vector containing all the observations for that variable I should use the dollar sign to connect dataframe to its variable:
df$Var
Now, for that variable I can get its mean:
mean(df$Var)
Standard deviation:
sd(df$Var)
Summary (that works also for a complete data frame):
summary(df$Var)
Which observation has the minimum value for the passed variable:
which.min(df$Var)
Which observation has the maximum value for the passed variable:
which.max(df$Var)

It is easy to generate a scattered plot that correlates two variables in a data frame:
plot(df$Var1, df$Var2)
Here Var1 would get the X axis while Var2 the Y axis.

We can extract a subset from a dataframe evaluating conditions on one or more variables:
sub = subset(df, Var1 > 100 & Var2 < 50)
Notice that the AND logical operator is an ampersand. To see how many observations are in this subset (and in any dataframe) we can use the nrow function:
nrow(sub)

To generate an histogram in R we use the hist() function. The boxplot() function generate boxes, that are quite useful to see the statistical range of a variable.


Reading and writing CSV files in R

Before loading a file in R is often useful change directory in the environment, this is done by:
setwd('pathname')
If you have a doubt about which is your current working directory, just print it:
getwd()
Reading from a CSV file to a data frame is pretty simple:
df = read.csv('path/to/file.csv')
Now we can get the structure of the dataframe:
str(df)
It gives us information on the number of observations (rows) and variables (columns); names of variables, a few of their values and, when they are detected as 'factors', also the number of 'level'on which that variable is structured.
Another useful function is:
summary(df)
It tries to provide us useful summary for each variable, giving the levels in case of factor, or a few statistic measures otherwise (min, max, mean, median, first and third quartile).

We can create a subset from a dataframe selecting a specific value for a variable, like this:
sub = subset(df, MyVariable = 'a value')
Then we can save this subset to a CSV file:
write.csv(sub, 'path/to/subFile.csv')

Localization problems in R

Just a hint. If you see your R environment behaving strangely, it could be a problem of localization.

Look for the Sys.getlocale function in the R documentation, here is a handy link, courtesy of the Zurich polytechnic.

For my case, to solve the problem I called:
Sys.setlocale("LC_ALL", "C")

Sunday, March 13, 2016

Coefficient of variation

Write a python function that calculates the coefficient of variation for a list of numbers.

This is a trivial case, if we can use the numpy library. However let's do it without first. Even better, I start with a dummy that always returns zero:
def cv(data):
    """
    :param data a list of numbers
    :returns its coefficient of variation, or NaN.
    :rtype float
    """
    return 0.0
Now I can write a few test cases for it:
class CV(unittest.TestCase):
    def test_none(self):
        coll = None
        self.assertTrue(math.isnan(cv(coll)))

    def test_empty(self):
        self.assertTrue(math.isnan(cv([])))

    def test_zero_mean(self):
        coll = [1, 2, 0, -1, -2]
        self.assertTrue(math.isnan(cv(coll)))

    def test_std_var_0(self):
        coll = [42, 42, 42]
        self.assertEqual(cv(coll), 0)

    def test_1(self):
        coll = [0, 0, 6, 6]
        self.assertEqual(cv(coll), 1)

    def test_dot5(self):
        coll = [10, 4, 12, 15, 20, 5]
        self.assertAlmostEqual(cv(coll), 0.503, delta=0.001)
Following the line of the previous post, I have decided that my function should return NaN if the caller passes a None or an empty list in. I remarked this requisite with the first two test cases, test_none and test_empty.
The other test cases should be look quite clear, once we know what the coefficient of variation is. In a few words, it is the standard deviation of a population divided by its mean. This implies that we can't calculate it when the mean is zero. In that case the function should return NaN, as showed by test_zero_mean test case.

Said that, this implementation should look quite straightforward:
def cv(data):
    """
    :param data a list of numbers
    :returns its coefficient of variation, or NaN.
    :rtype float
    """
#1
    if not data:
        return float('NaN')  
#2
    mean = sum(data) / float(len(data))
    if mean == 0:
        return float('NaN')
#3 
    sq_sum = 0.0
    for d in data:
        sq_sum += (d - mean) ** 2
    stddev = math.sqrt(sq_sum / len(data))
#4
    return stddev / mean
1. When the user passes a None or an empty list, NaN is returned.
2. Calculate the mean. If it is zero, NaN is returned.
3. Calculate the standard deviation.
4. Return the coefficient of variation.

As I hinted above, using numpy makes the code so much cleaner:
def cv_(data):
    """
    :param data a list of numbers
    :returns its coefficient of variation, or NaN.
    :rtype float
    """
    if not data:
        return float('NaN')

    mean = numpy.mean(data)
    if mean == 0.0:
        return float('NaN')

    return numpy.std(data) / mean
Full code and test cases are on github.

Saturday, March 12, 2016

Strings standard deviation

Implement a function that gets in input a list of strings and gives back the standard deviation of the lengths of the strings.

I have found this problem while following the Introduction to Computational Thinking and Data Science MIT course hosted by edX.

As usual, I wrote a first implementation for the function, to be extended after setting up a series of test cases:
def standard_deviation(strings):
    """
    :param strings: a list of strings
    :returns the standard deviation of the lengths of the strings, or NaN.
    :rtype float
    """
    return 0.0
The function docstring specifies what I expect as input parameter and what the caller should expect to get as output.
For the moment, my no-brainer implementation always returns a floating point zero.

Then I wrote a number of test cases, many of them given informally as part of the problem:
class StdDevTest(unittest.TestCase):
    def test_none(self):
        self.assertTrue(math.isnan(standard_deviation(None)))

    def test_empty(self):
        strings = []
        self.assertTrue(math.isnan(standard_deviation(strings)))

    def test_1(self):
        strings = ['a', 'z', 'p']
        self.assertEqual(standard_deviation(strings), 0)

    def test_2(self):
        strings = ['apples', 'oranges', 'kiwis', 'pineapples']
        self.assertAlmostEqual(standard_deviation(strings), 1.8708, delta=0.0001)

    def test_3(self):
        strings = ['mftbycwac', 'rhqbqawnfl', 'clgzh', 'ilqy', 'ckizvsgpnhlx', 'kziugguuzvqarw', 'xqewrmvu', 'ktojfqkailswnb']
        self.assertEqual(standard_deviation(strings), 3.5355339059327378)

    def test_4(self):
        strings = ['zgbljwombl', 'slkpmjqmjaaw', 'nddl', 'irlzne', '', 'poieczhxoqom', 'waqyiipysskxk', 'dloxspi', 'sk']
        self.assertEqual(standard_deviation(strings), 4.447221354708778)

    def test_bad_data(self):
        with self.assertRaises(TypeError):
            standard_deviation([1, 2, 3])
test_none, test_empty: What the function should do in case of None passed as input parameter is not specified by the problem. I decided to let it behaves as it gets an empty list of strings in. Notice the use of the function isnan from the standard math library.
test_1: If all the strings have the same size, a standard deviation of zero should be returned.
test_2: The interesting point in this test case is that in the problem definition we are given an approximated expected result value (I guess it was just a bit of sloppiness). For this reason I used the "almost equal" assertion to check it.
test_3, test_4: Vanilla tests. Just check everything goes as expected.
test_bad_date: We are not required to behave politely if the user gives us garbage in. Here I stress the fact that in case of a list of numbers in, our function is expected to react throwing a TypeError exception.

Running these test cases on the current implementation for standard_deviation() should give a bunch of failures and a single success on test_1. We could do better. But we have to understand better the problem requisites.

Checking what standard deviation is, we see how we have to calculate the mean of the elements in the collection, then subtract it from each element, square the result, sum all these values, divide them by the size of the collection, and finally extract the square root.

Converting this in Python, you should get something like this:
lengths = [len(s) for s in strings] # 1
mean = math.fsum(lengths) / len(lengths) # 2

# 3
sq_sum = 0.0 
for l in lengths:
    sq_sum += (l - mean) ** 2

return math.sqrt(sq_sum / len(lengths)) # 4
1. Let's prepare the data in input, converting the strings to their sizes, that is what really matter to us.
2. Calculating the mean it's easy. Be careful only to get it as a floating point number. To achieve it, I used the fsum() math function instead of the plain sum(). An explicit cast to float would have worked equivalently.
3. Now, the trickiest part of the algorithm. I feel like if I were more expert in Python I could have written these three lines more succinctly. Anyway, it is just a matter of summing the squared difference of each string lengths against the mean.
4. Finally, I just have to divide by the size of the collection, extract the square root, and returning the result to the caller.

Just one thing more. Before starting to calculate the standard deviation, I'd better check the input against empty collections. I could do that in this way:
if not strings:
    return float('NaN')
This would catch both the case of an empty list of strings and a None passed by mistake.

Full python 2.7 code with test cases is available on github.