In this lab we discuss the topic of algorithmic complexity is, and learn how to determine the complexity of a particular algorithm. We also learn how to use big-O notation to describe complexity.

Note that this lab is more open ended than previous labs, and you’re encouraged to try and experiment with the last exercise.

## Pre-Lab Checklist

• You have completed both Lab09 and Lab10, and are comfortable with binary trees, binary search trees, and the binary search ordering constraint. (We will reuse functions from these labs.)
• You are comfortable with higher order functions.
• You are familiar with pattern matching against abstract data types.

## Getting Started

1. Go to the project’s dashboard under the Project tab and click on the Fork button. The gitlab url for this lab is https://gitlab.cecs.anu.edu.au/comp1100/2023s2/studentfiles/lab11-1100_s2_2023

2. You will be asked where to fork the repository. Click on the user or group to where you’d like to add the forked project. You should see your own name here. Once you have successfully forked a project, you will be redirected to a new webpage where you will notice your name before > project-name. This means you are now the owner of this repository. The url in your browser should reflect the change: https://gitlab.cecs.anu.edu.au/uXXXXXXX/lab11-1100_s2_2023

3. Finally, clone your forked repository to your computer according to the instructions in the Week 2 Lab.

This lab contains additional exercises for COMP1130 students. COMP1100 students are encouraged to attempt them once they have completed the compulsory exercises to receive the participation marks for the lab.

## Complexity

Through the course, we have mentioned several times the concept of “efficiency” of code, and in this lab we hope to make this concept more concrete.

Intuitively, some algorithms are “easy”, in the sense that the time taken grows slowly as the input size grows. “Hard” problems are ones where the time taken grows very quickly with the input size.

The point of determining the algorithmic complexity of an algorithm, is to try and quantify precisely how fast the algorithm runs as a function of input size. A precise description of the complexity is usually too difficult to obtain, so we usually strive for an upper bound and a lower bound, expressed via big O notation.

## Best, Worse, Average case

One problem with big O notation is that for an input of size $$n$$, the time taken to run may vary wildly with what the particular input is. Take for example the following code, the definition of the elem function. The “size” $$n$$ of the input will be defined as the length of the input list.

elem :: (Eq a) => a -> [a] -> Bool
elem e list = case list of
[] -> False
x:xs
|x == e -> True
|otherwise -> elem e xs


If the element we are looking for is at the start of the list, then it would only take a single operation to find it, and the complexity would be $$O(1)$$.

If the element is at the end of the list, or not even present in the list at all, then we have to recurse through the entire list to (attempt to) find it. In this case, the complexity would be $$O(n)$$.

On average, assuming the element we are looking for is equally likely to be anywhere in the input list, we have to look though on average half the list until we find it, which is $$O(n/2)$$, which is still $$O(n)$$.

We resolve these problems by talking about the complexity for the best, worse, and average cases separately. In this course, we use the notation $$O(g)$$ to refer to all three, and explicitly state whether this is best, worse or average case complexity.

Many sources use different notation $$(\Omega, \Theta, O)$$ for best case and average/tight case, rather than using big-$$O$$ for all three. Be careful!

When talking about algorithms, sometimes it is also useful to consider the space complexity, that is, how the memory used by the function increases with the size of the input. Unfortunately it is very difficult to reason about the space complexity of programs in Haskell due to lazy evaluation. Consequently, we will not discuss space complexity here.

Be very careful when you are talking about the average running time: “average” means that there is some probability distribution on the inputs. For example, suppose we were playing cards, and we were trying to find the fastest way to sort the deck of cards in order. It’s reasonable to assume that when the cards are shuffled, the deck is more or less uniformly random. But if we were trying to sort a database after inserting a few more people, then the database might only be slightly out of order. The more you know about what the “average” input looks like, the better you can write an algorithm targeting those kinds of input.

If you want to calculate the time taken for a function to run, you can run :set +s on your terminal before you run the function. If you want revert it back you can run :unset +s or kill the terminal and open another.

### Exercise 1

Open the file Complexity.hs.

Edit the file to replace all instances of O(?) with the correct complexity, as shown for elem below.

The first has been done for you. For functions that take a list as input, treat the size of the input as the length of that list. For functions that take an integer as input, treat the integer itself as the size $$n$$.

-- | Checks if an element is present in a list.
-- best O(1), worst O(n), average O(n)
elem :: (Eq a) => a -> [a] -> Bool
elem e list = case list of
[] -> False
x:xs
|e == x -> True
|otherwise -> elem e xs

-- | Computes the length of the input list
-- best O(?), worst O(?), average O(?)
length :: [a] -> Integer
length list = case list of
[] -> 0
_:xs -> 1 + length xs

-- | Takes in a list, and throws the list away, and returns zero.
-- best O(?), worst O(?), average O(?)
zero :: [a] -> Integer
zero list = case list of
[] -> 0
_:xs -> zero xs

-- | Returns the first element in a list
-- best O(?), worst O(?), average O(?)
head list = case list of
[] -> error "head: Empty list has no first element"
x:_ -> x

-- | Returns the last element in a list
-- best O(?), worst O(?), average O(?)
last :: [a] -> a
last list = case list of
[] -> error "last: Empty list has no last element"
[x] -> x
x:xs -> last xs

-- | Takes a list, and removes the duplicate elements
-- best O(?), worst O(?), average O(?)
nub :: (Eq a) => [a] -> [a]
nub list = case list of
[] -> []
x:xs
|elem x xs ->  nub xs
|otherwise -> x : nub xs

-- | Concatenates two lists together
-- best O(?), worst O(?), average O(?)
(++) :: [a] -> [a] -> [a]
[]      ++ ys = ys
(x:xs)  ++ ys = x:(xs ++ ys)

-- | Reverses a list
-- best O(?), worst O(?), average O(?)
rev1 :: [a] -> [a]
rev1 list = case list of
[] -> []
x:xs -> (rev1 xs) ++ [x]

-- | Reverses a list
-- best O(?), worst O(?), average O(?)
rev2 :: [a] -> [a]
rev2 list = revHelper list []
where
revHelper list acc = case list of
[] -> acc
x:xs -> revHelper xs (x:acc)


Remember that labs are not tests: feel free to discuss these with your fellow students, and with your tutors.

Submission required: Exercise 1 Complexity

## Exponential and logarithmic classes

Most of the complexity classes that we have seen so far appear all to be polynomial classes, usually linear or quadratic. Logarithmic classes appear when the size of the problem halves after each operation, and exponential classes appear when increasing the input size by one step, makes the problem twice (or worse!) as hard to solve.

Consider the method of bisection search described in Lab 10. At each stage, the number of remaining elements to search through is (approximately) halved. If the number of elements in the tree to begin with was doubled, then it would only take one extra operation to try and find an element in this tree. This means that the complexity of trying to find an element in a binary tree is $$O(\log n)$$.

Exponential cases show up when the problem grows twice as large for each increase in the size of the input problem. For example, consider the rather silly function exp2.

exp2 :: Integer -> Integer -> Integer
exp2 n
|n < 0  = error "exp2: undefined for negative arguments"
|n == 0 = 1
|n > 0  = exp2 (n - 1) + exp2 (n - 1)


Each evaluation of exp2 calls itself twice more, giving complexity $$O(2^n)$$.

### Exercise 2

See Complexity.hs.

State the complexity of the following functions: fib1, fib2, power by replacing all the instances of O(?) appropriately.

Submission required: Exercise 2 fib1, fib2, power

### Exercise 3

See Complexity.hs.

The function power has reasonable complexity, but it could be made even more efficient. Note the following identities: $$x^{2n} = (x^n)^2 \text{ and } x^{2n+1} = x^{2n} \cdot x = (x^n)^2 \cdot x$$

We can use the following identities to compute exponentiation more cheaply. For example, if we wanted to compute $$x^{10}$$, the power function would compute $$x^{10} = x \cdot x \cdot x \cdot x \cdot x \cdot x \cdot x \cdot x \cdot x \cdot x$$ using ten multiplications.

Using this identity, we can do better: $$x^{10} = (x^5)^2 = ((x^2)^2 \cdot x)^2$$ where squaring a term can be evaluated by multiplying a term by itself. Now powerFast only needs four multiplications, performed in the following order.

x^2 := x * x
x^4 := x^2 * x^2
x^5 := x^4 * x
x^10 := x^5 * x^5


Use these identities to write a more efficient function powerFast that computes exponentiation by a whole number. How many multiplications must it perform to compute x^n?

You should NOT use the operator (**) or (^) to define this function.

Submission required: Exercise 3 powerFast

## Implementing Sets

A set is a collection of unordered elements without duplicates. The concept of a “first” or “last” element is meaningless for a set, as the elements are stored without concern of the ordering. So the set $$\{1,2,3\}$$ and the set $$\{2,3,1\}$$ represent the same thing.

We’d like to implement sets in Haskell in two different ways: By using lists, and by using binary search trees.

We’ve seen before that binary search trees can be far more efficient than lists for various operations, and so it’s often useful to use trees as the underlying data structure. Have a look at Set_With_Lists.hs, where we have defined a data type Set that is implemented using a list, with the constraint that the list shouldn’t contain duplicates.

Note that much like we cannot enforce the binary search ordering constraint for binary search trees, we cannot enforce that sets cannot contain duplicates as part of the definition of the type. It is up to the user to ensure that this constraint is respected.

As we will see, lists are a clunky and inefficient way to implement sets, as it’s slow to find elements, insert elements, delete elements, and so on.

### Exercise 4

Have a look at the SetsWithLists.hs file.

Some of the functions have already been implemented for you, and their time complexities have been stated.

In SetsWithLists.hs, complete the functions setEquals, addElement, setUnion, and write down the best, worse and average case time complexity for each. Try to be as efficient as you can.

Submission required: Exercise 4 setEquals, addElement, setUnion

### Exercise 5

Have a look at BinaryTree.hs, which contains a module for all the functions we wrote in Labs 9 and 10. We will need these functions for the rest of the exercises, so copy them over. If you haven’t finished any, now is the time to do so.

Defining the complexity for trees can be difficult, as for a list we can assume that “on average” the element we are looking for is in the middle of the list. For trees, it’s hard to estimate what the average tree looks like, so we will only consider best and worst case. The tree being balanced or not will wildly affect the time complexity, so we consider that seperately.

Copy your solutions to Lab 9 and Lab 10 into BinaryTree.hs. State the best and worse time complexity of each function, with and without the assumption of the tree being balanced.

Your implementations should be as efficient as you can make them, if they were not already.

Submission required: Exercise 5 BinaryTree.hs

### Exercise 6

Have a look at the SetsWithTrees.hs file. It contains all the same functions as SetsWithLists.hs, but now the underlying data structure for a set is a binary search tree. Some of the functions have been done for you, using functions from BinaryTree.hs.

In SetsWithTrees.hs, complete the functions setEquals, addElement, setUnion. State the best and worse time complexity of each function, with and without the assumption of the tree being balanced.

You can use any of the functions imported from BinaryTree.hs to help you.

You may find analysing the time complexity difficult. Don’t worry too much if so, as trees are much more difficult to reason about intuitively than lists. Even so, take a stab, you should at least think about how many operations each function might take: best case, worst case, and everything in between. Again, check and discuss your results with your tutor and your classmates if you are unsure.

Submission required: Exercise 6 SetsWithTrees.hs

## Extension (Optional)

### Extension 1

In SetsWithLists.hs, complete the functions removeElement, setIntersection, setDifference, and write down the best, worse and average case time complexity for each.

Be efficient!

Submission optional: Extension 1 SetsWithLists.hs

### Extension 2

In SetsWithTrees.hs, complete the functions removeElement, setIntersection, setDifference, and write down the best, worse and average case time complexity for each.

Be efficient!

Think about under what circumstances trees will perform better than lists, and when trees will perform worse.

Submission optional: Extension 2 SetsWithTrees.hs

## Real World Data (Optional)

This section is quite long, and it provides you with the opportunity to experiment with running your code on some “real data”. Feel free to spend as much or as little time on it as you wish.

It can be hard to see how our new implementation of sets performs against lists when we only ever run it on short simple inputs. Here, we provide a testing suite SetTimer.hs and two books, “Alice’s Adventures in Wonderland” (AAIW) and “Through the Looking Glass” (TTLG). The code in SetTimer.hs has been provided for you to run your implementation of Set against a battery of tests, which are then timed for how long they take.

Module SetTimer.hs is initially written to test your SetsWithLists code, and can be easily modified to work with SetsWithTrees by changing a single import statement at the top.

You can compile this code using the command

ghc -Wall -O2 SetTimer.hs


which behaves similarly to the cabal build command you have used in assignments. You can then run the compiled program with the command

./SetTimer


Note that if you haven’t done all the extensions, then the code will crash halfway through execution as soon as it comes across the undefined extension functions. This is normal, and you can still see the timing for all other functions up to that point.

Do not attempt to run SetTimer.hs using GHCi. Although interpreters are great for quick, interactive testing and debugging of code, they are not as reliable for testing speed as a compiled program.

You don’t need to (and are not expected to) understand how the code in SetTimer.hs works. Reading from files is a side-effecting action, which requires use of the IO () monad, well beyond the scope of this course.

The code proceeds by the following steps:

1. The two text files are read in and converted into lists of lower-case words, of type [String].

2. Two distinct sets are created, one for each list of words (and therefore each text file), by repeatedly using the addElement function in your Set code.

3. Same as Step 2, but the words are inserted into the set in alphabetical order.

4. The input lists of words are first split up into sub-lists of size 300 or less. These sub-lists are individually turned into “small” sets using the same technique as step b. Finally, these small sets are repeatedly combined into one large set using the set union operation.

5. Every word from the texts is repeatedly removed from their respective sets, resulting in the empty set.

6. Set difference in used repeatedly on the complete sets and all the small sets created above.

7. All the words that appear in either AAIW or TTLG are calculated, and it is checked which of those words are in both texts, using containsElement.

8. Same as step 7, but this time we take the set intersection of the word sets for AAIW and TTLG.

Run the code several times. Do the reported times change? If so, by how much? What does this suggest to you?

### Exercise 7 (Optional)

And now for the actual task of the exercise. Your challenge is to experiment with your code, and to attempt to come up with ways to make it more efficient. You will find that small changes in your code should produce measurable changes in the timing information—and large changes (such as finding a way to perform an operation that is $$O(n)$$ rather than $$O(n^2)$$, for example) should produce significant changes.

We are deliberately leaving this exercise open-ended. There are many, many different ways to optimise your code, and they will depend almost entirely on how you have written it in the first place. Here are some first-line things to consider:

1. Are there any unnecessary operations performed by a function? For example, does your containsElement function run through a complete list, or does it terminate when (if) it finds the desired element?

2. Are you scanning through your data structures more often than necessary? For example, by calling a length function you already went through an entire list once without doing anything but counting the number of elements—not necessarily a clever thing to do if you also have to read or process the actual elements in the list somehow.

3. Do you spot the same function call in multiple spots inside one function? Rather than actually running it multiple times, it might be cleverer to factor it out into a where clause. This will usually enhance both readability and performance.

4. Can you perform an operation using a (possibly higher-order) function from the Prelude or one of the various Haskell libraries? Often, these operations will be more efficient and tightly-coded than yours, because years worth of work has already gone into tuning them. But be careful: most libraries come with no guarantees about efficiency or logical correctness— make sure you understand where a specific library came from and how well it was tested or verified. Libraries which are used by pretty much every user of the language on a daily basis (like the core language modules in Haskell) usually provide a higher level of confidence.

5. How does the performance change when you modify the code to use your SetsWithTrees module rather than the SetsWithLists module? You would hopefully find that it gets better, but is this always the case? Do any operations take longer to perform?

6. How does your tree code perform when you can’t guarantee that elements are inserted or removed in random order? For example, in Step e of the code (removing all elements one by one), the removal is done in exactly reverse order to insertion. What impact will this have on the performance of your tree operations?

7. While the whole texts which you process are only about 150 KB long, you might find that your program uses large amounts of memory to process it. If you want to find out, compile your program with

ghc -W -O2 --make -rtsopts SetTimer.hs


and run it with

./SetTimer +RTS -sstderr


This will give you a report at the end about the total usage of memory and the time the Haskell run-time system spends mopping up no longer used memory blocks (“garbage collection” or “GC”). Would your program process a 150 KB long text by using hundreds of MB of memory (which costs time to handle)? Unfortunately there is rather little you can do about this problem from within an “automatically managed memory” language like Haskell.

You should, of course, feel free to discuss these problems with your tutor and with your class-mates. If you find any particularly interesting ways that you can optimise your code, consider posting on the course forum and describing what you’re doing and how you understand the difference. There are no right or wrong answers to this exercise, but there is plenty to be learnt from experimenting.