Sunday, March 13, 2016

Coefficient of variation

Write a python function that calculates the coefficient of variation for a list of numbers.

This is a trivial case, if we can use the numpy library. However let's do it without first. Even better, I start with a dummy that always returns zero:
def cv(data):
    """
    :param data a list of numbers
    :returns its coefficient of variation, or NaN.
    :rtype float
    """
    return 0.0
Now I can write a few test cases for it:
class CV(unittest.TestCase):
    def test_none(self):
        coll = None
        self.assertTrue(math.isnan(cv(coll)))

    def test_empty(self):
        self.assertTrue(math.isnan(cv([])))

    def test_zero_mean(self):
        coll = [1, 2, 0, -1, -2]
        self.assertTrue(math.isnan(cv(coll)))

    def test_std_var_0(self):
        coll = [42, 42, 42]
        self.assertEqual(cv(coll), 0)

    def test_1(self):
        coll = [0, 0, 6, 6]
        self.assertEqual(cv(coll), 1)

    def test_dot5(self):
        coll = [10, 4, 12, 15, 20, 5]
        self.assertAlmostEqual(cv(coll), 0.503, delta=0.001)
Following the line of the previous post, I have decided that my function should return NaN if the caller passes a None or an empty list in. I remarked this requisite with the first two test cases, test_none and test_empty.
The other test cases should be look quite clear, once we know what the coefficient of variation is. In a few words, it is the standard deviation of a population divided by its mean. This implies that we can't calculate it when the mean is zero. In that case the function should return NaN, as showed by test_zero_mean test case.

Said that, this implementation should look quite straightforward:
def cv(data):
    """
    :param data a list of numbers
    :returns its coefficient of variation, or NaN.
    :rtype float
    """
#1
    if not data:
        return float('NaN')  
#2
    mean = sum(data) / float(len(data))
    if mean == 0:
        return float('NaN')
#3 
    sq_sum = 0.0
    for d in data:
        sq_sum += (d - mean) ** 2
    stddev = math.sqrt(sq_sum / len(data))
#4
    return stddev / mean
1. When the user passes a None or an empty list, NaN is returned.
2. Calculate the mean. If it is zero, NaN is returned.
3. Calculate the standard deviation.
4. Return the coefficient of variation.

As I hinted above, using numpy makes the code so much cleaner:
def cv_(data):
    """
    :param data a list of numbers
    :returns its coefficient of variation, or NaN.
    :rtype float
    """
    if not data:
        return float('NaN')

    mean = numpy.mean(data)
    if mean == 0.0:
        return float('NaN')

    return numpy.std(data) / mean
Full code and test cases are on github.

Saturday, March 12, 2016

Strings standard deviation

Implement a function that gets in input a list of strings and gives back the standard deviation of the lengths of the strings.

I have found this problem while following the Introduction to Computational Thinking and Data Science MIT course hosted by edX.

As usual, I wrote a first implementation for the function, to be extended after setting up a series of test cases:
def standard_deviation(strings):
    """
    :param strings: a list of strings
    :returns the standard deviation of the lengths of the strings, or NaN.
    :rtype float
    """
    return 0.0
The function docstring specifies what I expect as input parameter and what the caller should expect to get as output.
For the moment, my no-brainer implementation always returns a floating point zero.

Then I wrote a number of test cases, many of them given informally as part of the problem:
class StdDevTest(unittest.TestCase):
    def test_none(self):
        self.assertTrue(math.isnan(standard_deviation(None)))

    def test_empty(self):
        strings = []
        self.assertTrue(math.isnan(standard_deviation(strings)))

    def test_1(self):
        strings = ['a', 'z', 'p']
        self.assertEqual(standard_deviation(strings), 0)

    def test_2(self):
        strings = ['apples', 'oranges', 'kiwis', 'pineapples']
        self.assertAlmostEqual(standard_deviation(strings), 1.8708, delta=0.0001)

    def test_3(self):
        strings = ['mftbycwac', 'rhqbqawnfl', 'clgzh', 'ilqy', 'ckizvsgpnhlx', 'kziugguuzvqarw', 'xqewrmvu', 'ktojfqkailswnb']
        self.assertEqual(standard_deviation(strings), 3.5355339059327378)

    def test_4(self):
        strings = ['zgbljwombl', 'slkpmjqmjaaw', 'nddl', 'irlzne', '', 'poieczhxoqom', 'waqyiipysskxk', 'dloxspi', 'sk']
        self.assertEqual(standard_deviation(strings), 4.447221354708778)

    def test_bad_data(self):
        with self.assertRaises(TypeError):
            standard_deviation([1, 2, 3])
test_none, test_empty: What the function should do in case of None passed as input parameter is not specified by the problem. I decided to let it behaves as it gets an empty list of strings in. Notice the use of the function isnan from the standard math library.
test_1: If all the strings have the same size, a standard deviation of zero should be returned.
test_2: The interesting point in this test case is that in the problem definition we are given an approximated expected result value (I guess it was just a bit of sloppiness). For this reason I used the "almost equal" assertion to check it.
test_3, test_4: Vanilla tests. Just check everything goes as expected.
test_bad_date: We are not required to behave politely if the user gives us garbage in. Here I stress the fact that in case of a list of numbers in, our function is expected to react throwing a TypeError exception.

Running these test cases on the current implementation for standard_deviation() should give a bunch of failures and a single success on test_1. We could do better. But we have to understand better the problem requisites.

Checking what standard deviation is, we see how we have to calculate the mean of the elements in the collection, then subtract it from each element, square the result, sum all these values, divide them by the size of the collection, and finally extract the square root.

Converting this in Python, you should get something like this:
lengths = [len(s) for s in strings] # 1
mean = math.fsum(lengths) / len(lengths) # 2

# 3
sq_sum = 0.0 
for l in lengths:
    sq_sum += (l - mean) ** 2

return math.sqrt(sq_sum / len(lengths)) # 4
1. Let's prepare the data in input, converting the strings to their sizes, that is what really matter to us.
2. Calculating the mean it's easy. Be careful only to get it as a floating point number. To achieve it, I used the fsum() math function instead of the plain sum(). An explicit cast to float would have worked equivalently.
3. Now, the trickiest part of the algorithm. I feel like if I were more expert in Python I could have written these three lines more succinctly. Anyway, it is just a matter of summing the squared difference of each string lengths against the mean.
4. Finally, I just have to divide by the size of the collection, extract the square root, and returning the result to the caller.

Just one thing more. Before starting to calculate the standard deviation, I'd better check the input against empty collections. I could do that in this way:
if not strings:
    return float('NaN')
This would catch both the case of an empty list of strings and a None passed by mistake.

Full python 2.7 code with test cases is available on github.