Lab 4: code quality and introduction to sequences

The purpose of this week’s lab is to:

  • understand the indexing of (1-dimensional) sequences;
  • do some computations over sequences that require iterating through them using loops;
  • get used to aspects of good code quality.

Like previous labs, this lab has more problems than we expect everyone to finish during the lab time. If you do not have time to finish all problems, you can continue working on them later. You can do this at home on your own computer, or in the CSIT lab computers during the drop-in sessions. You may also use the InfoCommons computers (available in the university libraries and other teaching spaces).

In order to help you out to develop and/or check your understanding of programs that perform computations with sequences, we highly encourage you to use the online Python tutor along this lab (and beyond this lab). We consider the online Python tutor to be an splendid tool to satisfy this goal. Click here for more details.

Codebench#

All exercises below (except Exercise 5(c)) are available at CodeBench under the Lab4 exercise set. Make sure you SUBMIT your programs for these programming problems before you close CodeBench. Ask your tutor for help if you struggle with CodeBench.

Sequence types#

We have already seen a number of times that all values in python have a type, such as int, float, str, etc. To determine the type of a value we can use the function type(_some expression_). python has three built-in sequence types: lists (type list), strings (type str) and tuples (type tuple). These sequence types are used to represent different kinds of ordered collections.

To write a list literal, write its elements, separated by commas, in a pair of square brackets:

In [1]: my_list = [1, 2, 3, 4, 5, 6]

In [2]: type(my_list)
Out [2]: ...

The elements that you write can be expressions. These are evaluated, and the resulting values become the elements of the list:

In [3]: my_list = [2, 2 + 1, 2 * 2, 2 + 3]

In [4]: my_list
Out [4]: ...

Warm-up exercise: Indexing sequences#

All of list, tuple and string are called sequence data types. Every element in a sequence has an index (position). The first element is at index 0. The length of a sequence is the number of elements in the sequence. The index of the last element is the length minus one. The built-in function len returns the length of any sequence.

Indexing a sequence selects a single element from the sequence (for example, a character if the sequence is a string). Python also allows indexing sequences from the end, using negative indices. That is, -1 also refers to the last element in the sequence, and -len(seq) refers to the first.

This exercise is to try some operations on the list sequence type. Execute the following in the python shell. For each expression, try to work out what the output will be before you evaluate the expression, and then check if your guess was right.

In [1]: my_list =  [1, 2, 3, 4, 5, 6]

In [2]: my_list[1]
Out [2]: ...

In [3]: my_list[4]
Out [3]: ...

In [4]: my_list[-1]
Out [4]: ...

In [5]: L = len(my_list)

In [6]: my_list[L - 1]
Out [6]: ...

In [7]: my_list[1 - L]
Out [7]: ...

They should all run without error. Is the result of each expression what you expected?

Exercise 1: iteration over sequences#

Python has two kinds of loop statements: the while loop, which repeatedly executes a suite as long as a condition is true, and the for loop, which, in its for element in sequence: form, executes a suite once for every element of a sequence. (To be precise, the for loop works not only on sequences but on any type that is iterable. All sequences are iterable, but later in the course we will see examples of types that are iterable but not sequences.)

Both kinds of loop can be used to iterate over a sequence. Which one is most appropriate to implement some function depends on what the function needs to do with the sequence. The for loop is simpler to use, but if you use the loop variable to traverse the element themselves (and not the indices of the elements), it only allows you to look at one element at a time. The while loop is more complex to use (you must initialise and update an index variable, and specify the loop condition correctly) but allows you greater flexibility over a for loop that uses the loop variable to traverse the elements; for example, you can skip elements in the sequence (increment the index by more than one) or look at elements in more than one position in each iteration. In any case, you can also achieve this with a for loop if you use the loop variable to traverse the indices of the elements.

Exercise 1(a): rewriting a while loop as a for loop for sequence traversal#

The following function takes one argument, a sequence, and counts the number of elements in it that are negative. It is implemented using a while loop.

def count_negative(sequence):
    count = 0
    index = 0
    while index < len(sequence):
        if sequence[index] < 0:
            count = count + 1
        index = index + 1
    return count

Note that the function will work on any sequence type (e.g., both list and tuple), as long as it contains only numbers.

Rewrite this function so that it uses a for loop instead.

To test your function, you can use the following inputs:

  • [-1, 0, -2, 1, -3, 2] (3 negative numbers)
  • [i for i in range(-10, 10, 3)] (4 negative numbers)
  • [-1, -1, -1, -1, -1] (5 negative numbers)
  • [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] (0 negative numbers)

You can create more test cases by making variations of these, or using other list creating functions.

Exercise 1(b): determine if the elements in a sequence are laid out in increasing order#

Write a function called is_increasing that takes a sequence (of numbers) and returns True if and only if the elements in the sequence are in (non-strict) increasing order. This means that every element is less than or equal to the next one after it. For example,

  • for [1, 5, 9] the function should return True
  • for [3, 3, 4] the function should return True
  • for [3, 4, 2] the function should return False

Is it best to use a for loop or a while loop for this problem? (Note: Downey’s book describes different solutions to a very similar problem in Section “Looping with Indices” in Chapter 9.)

Test your function with the examples above, and with the examples you used for exercise 1(a).

Also test your function on an empty sequence (that is a list with no elements). An empty list can be created with the expression []. Does your function work? Does it work on a sequence with one element?

Exercise 1(c): determine the closest number in a sequence to the average of the sequence#

The average (or mean) of a sequence of numbers is the sum of the numbers divided by the length of the sequence. You can calculate the average of a sequence of numbers using python’s built-in function sum (which works on any sequence type, as long as it contains numbers), or writing your own function using a loop over the sequence.

Write a function most_average(numbers) which finds and returns the number in the input that is closest to the average of the numbers. (You can assume that the argument is a sequence of numbers.) By closest, we mean the one that has the smallest absolute difference from the average. You can use the built-in function abs to find the absolute value of a difference. For example, most_average([1, 2, 3, 4, 5]) should return 3 (the average of the numbers in the list is 3.0, and 3 is clearly closest to this). most_average([3, 4, 3, 1]) should also return 3 (the average is 2.75, and 3 is closer to 2.75 than is any other number in the list).

Exercise 2: Improving code quality#

The two function definition statements below, although functional, are poorly written and documented. They are examples of bad quality code.

Assume that the sequences involved consist only of ints.

def count_occurrences (xx, x):
    j = -1 # Initialize j to -1
    # Determine whether element x is in sequence xx
    if x not in xx:
        return 0
    for xxx in xx:
        if xxx == x:
            if j != -1:
                j = j + 1 # Increment j
            else:
                j = 1     # Initialize j to 1
    return j

def most_frequent_element(zzz):
    j = None
    # Count how many times z is in zzz
    for z in zzz:
        if count_occurrences(zzz, j) != 0:
            if count_occurrences(zzz, z) > count_occurrences(zzz, j):
                j = z
        else:
            j = z
    return j

The objective of this exercise is to significantly improve the quality of both function definition statements accordingly to the guidelines given in the lecture, including:

  1. Good commenting and documentation;
  2. Good variable and function naming;
  3. Good code organisation.

To this end, the first step is to understand what the functions do. Can you tell what the functions do from their names? You may also use the online Python tutor to help you understanding most_frequent_element. You can use, e.g., the call most_frequent_element([1,2,3,2,3]).

After understanding what the functions do, then improve the code quality along the following lines:

  • Write a docstring for count_occurrences and most_frequent_element functions. The docstrings should describe, concisely, what the function does and what the restrictions on its arguments are.

  • The names of the local variables are quite cryptic. Based on your understanding of what the functions are doing, propose better names for the variables.

  • What other changes can you do to improve the code quality? Some questions to think about:

    1. Is it possible to simplify the code inside the for loop in count_occurrences? (e.g., think how would it help to initialize j to 0 in count_occurrences)
    2. Same question as the previous one for the for loop in most_frequent_element; note that solving (2) may in turn help to reduce the number of calls to count_occurrences.

Once you are done, it might be convenient that you discuss with your tutor about the resulting clean code to see if there might still be yet unexplored changes to improve code quality.

Exercise 3: Closest matches#

Exercise 3(a)#

Write two functions, smallest_greater(seq, value) and greatest_smaller(seq, value), that take as argument a sequence and a value, and find the smallest element in the sequence that is greater than the given value, and the greatest element in the sequence that is smaller than the given value, respectively.

For example, if the sequence is [13, -3, 22, 14, 2, 18, 17, 6, 9] and the target value is 4, then the smallest greater element is 6 and the greatest smaller element is 2.

  • You can assume that all elements in the sequence are of the same type as the target value (that is, if the sequence is a list of numbers, then the target value is a number).
  • You can not assume that the elements of the sequence are in any particular order.
  • You should not assume that the sequence is of any particular type; it could be, for example, a list, or some other sequence type. Use only operations on the sequence that are valid for all sequence types.
  • Does your function work if the sequence and target values are strings?
  • What happens in your functions if the target value is smaller or greater than all elements in the sequence?

Exercise 3(b)#

Same as above, but assume the elements in the sequence are sorted in increasing order; can you find an algorithm that is more efficient in this case?

Exercise 4: Counting duplicates in a sequence#

If the same value appears more than once in a sequence, we say that all copies of it except the first are duplicates. For example, in [-1, 2, 4, 2, 0, 4], the second 2 and second 4 are duplicates; in the string “Immaterium”, the ‘m’ is duplicated twice (but the ‘i’ is not a duplicate, because ‘I’ and ‘i’ are different characters).

Write a function count_duplicates(seq) that takes as argument a sequence and returns the number of duplicate elements (for example, it should return 2 for both the sequences above). Your function should work on any sequence type (for example, both lists and strings), so use only operations that are common to all sequence types. For the purpose of deciding if an element is a duplicate, use standard equality, that is, the == operator.

Exercise 5: Generating a bin-based histogram and visualize it with matplotlib#

A histogram is way of summarising (1-dimensional) data that is often used in descriptive statistics. Given a sequence of values, the range of values (from smallest to greatest) is divided into a number of sections (called “bins”) and the number of values that fall into each bin is counted. For example, if the sequence is [2.09, 0.5, 3.48, 1.44, 5.2, 2.86, 2.62, 6.31], and we make three bins by placing the dividing lines at 2 and 4, the resulting counts (that is, the histogram) will be the sequence 2, 4, 2, because there are 2 elements less than 2, 4 elements between 2 and 4, and 2 elements > 4.

Exercise 5(a): counting elements within a bin#

Write a function count_in_bin(values, lower, upper) that takes as argument a sequence and two values that define the lower and upper sides of a bin, and counts the number of elements in the sequence that fall into this bin. You should treat the bin interval as open on the lower end and closed on the upper end; that is, use a strict comparison lower < element for the lower end and a non-strict comparison element <= upper for the upper end.

Exercise 5(b): generating a bin-based histogram with your own code#

Write a function histogram(values, dividers) that takes as argument a sequence of values and a sequence of bin dividers, and returns the histogram as a sequence of a suitable type (say, a list) with the counts in each bin. The number of bins is the number of dividers + 1; the first bin has no lower limit and the last bin has no upper limit. As in (a), elements that are equal to one of the dividers are counted in the bin below.

For example, suppose the sequence of values is the numbers 1,..,10 and the bin dividers are [2, 5, 7]; the histogram should be [2, 3, 2, 3].

To test your function, you can create lists of random values using the random module:

In [1]: import random
In [2]: values = [random.randint(0, 100) for i in range(50)]

This creates a list of 50 integers between 0 and 100. The following creates 10 evenly sized bins covering the range of values:

In [2]: interval = max(values) - min(values)
In [3]: dividers = [min(values) + i * (interval / 10) for i in range(1, 10)]

As you increase the size of the list of values, you should find that the histogram becomes more symmetrical and more even.

Exercise 5(c): generating the same bin-based histogram as in 5(b) and visualizing it using matplotlib library#

Test your function (and visualize the result) by comparing it against the histogram function provided by the matplotlib library. To this end, first type import matplotlib.pyplot, then help(matplotlib.pyplot.hist) (recall that you can also surf the web for examples of usage, etc.). You will have to carefully read the documentation to understand how to call this function so that it has the same semantics as the one you have written. Once you have called the matplotlib.pyplot.hist function, you can visualize the histogram with the matplotlib.pyplot.show() call.

bars search times arrow-up