Saturday, March 12, 2016

Strings standard deviation

Implement a function that gets in input a list of strings and gives back the standard deviation of the lengths of the strings.

I have found this problem while following the Introduction to Computational Thinking and Data Science MIT course hosted by edX.

As usual, I wrote a first implementation for the function, to be extended after setting up a series of test cases:
def standard_deviation(strings):
    """
    :param strings: a list of strings
    :returns the standard deviation of the lengths of the strings, or NaN.
    :rtype float
    """
    return 0.0
The function docstring specifies what I expect as input parameter and what the caller should expect to get as output.
For the moment, my no-brainer implementation always returns a floating point zero.

Then I wrote a number of test cases, many of them given informally as part of the problem:
class StdDevTest(unittest.TestCase):
    def test_none(self):
        self.assertTrue(math.isnan(standard_deviation(None)))

    def test_empty(self):
        strings = []
        self.assertTrue(math.isnan(standard_deviation(strings)))

    def test_1(self):
        strings = ['a', 'z', 'p']
        self.assertEqual(standard_deviation(strings), 0)

    def test_2(self):
        strings = ['apples', 'oranges', 'kiwis', 'pineapples']
        self.assertAlmostEqual(standard_deviation(strings), 1.8708, delta=0.0001)

    def test_3(self):
        strings = ['mftbycwac', 'rhqbqawnfl', 'clgzh', 'ilqy', 'ckizvsgpnhlx', 'kziugguuzvqarw', 'xqewrmvu', 'ktojfqkailswnb']
        self.assertEqual(standard_deviation(strings), 3.5355339059327378)

    def test_4(self):
        strings = ['zgbljwombl', 'slkpmjqmjaaw', 'nddl', 'irlzne', '', 'poieczhxoqom', 'waqyiipysskxk', 'dloxspi', 'sk']
        self.assertEqual(standard_deviation(strings), 4.447221354708778)

    def test_bad_data(self):
        with self.assertRaises(TypeError):
            standard_deviation([1, 2, 3])
test_none, test_empty: What the function should do in case of None passed as input parameter is not specified by the problem. I decided to let it behaves as it gets an empty list of strings in. Notice the use of the function isnan from the standard math library.
test_1: If all the strings have the same size, a standard deviation of zero should be returned.
test_2: The interesting point in this test case is that in the problem definition we are given an approximated expected result value (I guess it was just a bit of sloppiness). For this reason I used the "almost equal" assertion to check it.
test_3, test_4: Vanilla tests. Just check everything goes as expected.
test_bad_date: We are not required to behave politely if the user gives us garbage in. Here I stress the fact that in case of a list of numbers in, our function is expected to react throwing a TypeError exception.

Running these test cases on the current implementation for standard_deviation() should give a bunch of failures and a single success on test_1. We could do better. But we have to understand better the problem requisites.

Checking what standard deviation is, we see how we have to calculate the mean of the elements in the collection, then subtract it from each element, square the result, sum all these values, divide them by the size of the collection, and finally extract the square root.

Converting this in Python, you should get something like this:
lengths = [len(s) for s in strings] # 1
mean = math.fsum(lengths) / len(lengths) # 2

# 3
sq_sum = 0.0 
for l in lengths:
    sq_sum += (l - mean) ** 2

return math.sqrt(sq_sum / len(lengths)) # 4
1. Let's prepare the data in input, converting the strings to their sizes, that is what really matter to us.
2. Calculating the mean it's easy. Be careful only to get it as a floating point number. To achieve it, I used the fsum() math function instead of the plain sum(). An explicit cast to float would have worked equivalently.
3. Now, the trickiest part of the algorithm. I feel like if I were more expert in Python I could have written these three lines more succinctly. Anyway, it is just a matter of summing the squared difference of each string lengths against the mean.
4. Finally, I just have to divide by the size of the collection, extract the square root, and returning the result to the caller.

Just one thing more. Before starting to calculate the standard deviation, I'd better check the input against empty collections. I could do that in this way:
if not strings:
    return float('NaN')
This would catch both the case of an empty list of strings and a None passed by mistake.

Full python 2.7 code with test cases is available on github.

No comments:

Post a Comment