Outline

In this week’s lab, you will:

  • Learn to use pthread read/write locks
  • Optimize a parallel application with fine-grained synchronization
  • Learn different strategies for dividing work among multiple threads

Preparation#

Do an upstream pull as usual to get the latest files for this lab in your comp2310-labs repository.

Introduction#

The sum_parallel program uses multiple threads to compute the sum_n(n) function for different values of n. Each invocation of sum_n(n) computes the sum of all natural numbers up to n. The n values are stored in a global array called numbers. These initial n values are over-written by the results of sum_n(n) at the end of the program execution. The goal is to use multiple threads to populate the numbers array with results as quickly as possible.

Work Division: Each thread increments a shared global variable called shared_idx to obtain work. Specifically, the thread uses the index to read the value (n) stored in the numbers array at that index. It then computes the sum for n. The thread then stores the result into the same location in the numbers array where previously the number n was stored.

Caching Optimization: The sum_n function caches the last computed sum in a static variable. (Think of this optimization as a small result or compute cache.) If the next value of n is one more than the previous value, then we use the fast path to simply add the previous result stored in last_sum to n to compute the new sum for n+1. Otherwise, the sum_n function uses the slow path to iteratively compute the sum.

Synchronization Strategy: The program uses the P and V semaphore operations to protect access to shared data in the sum_n function. This strategy is an example of coarse-grained synchronization where a single (heavy or giant) lock is used to provide mutual exclusion between threads. Unfortunately, this strategy essentially serializes the execution of multiple threads and even if there are multiple processors (or cores) available the program cannot run faster than if it were executing on a single processor.

Pthread mutex and read/write locks#

As noted in the lectures, when sharing resources between multiple threads, it is safe to allow multiple readers to access the resource simultaneously as they would not alter any data, but only one writer should be allowed to access the resource at any given time.

The standard mutex, pthread_mutex_t, allows only one thread to access the resource irrespective of whether the thread is a reader or writer. The behavior even though safe, is not efficient, as the readers are made to wait even though they are not going to modify the data. We have neither seen pthread_mutex_t in lectures or labs. Please use your favorite search engine to learn about pthread mutex.

In summary, pthread mutex is part of the pthreads standard while the semaphore API (through semaphore.h) may or may not be present on your OS.

You should learn to use the pthread mutex instead of P and V operations. The basic API is below,

int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_trylock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex); 

Next, to allow multiple readers to access a resource at the same time we can use pthread_rwlock_t in pthreads. The pthread_rwlock_t which is a read-write lock allows multiple readers to access the resource, but allows only one writer at any given time.

pthread_rwlock_t rwlock;

We can initialize rwlock as follows,

pthread_rwlock_init(&rwlock,NULL);

We can destroy the lock to avoid memory leaks as follows,

pthread_rwlock_destroy(&rwlock);

Next, we can apply a read or a write lock depending on the scenario as follows,

pthread_rwlock_rdlock(&rwlock);
pthread_rwlock_wrlock(&rwlock);

Recall from lectures that it is alright to have multiple readers enter a critical section if there is no writer inside the critical section. On the other hand, only one writer can be inside the critical section at any given point. It is not considered optimal to apply a write lock using pthread_rwlock_wrlock if multiple threads only need to read a shared variable. For such situations, it is better to use a read lock pthread_rwlock_rdlock.

Documentation#

You can use your favorite search engines to learn more about pthread mutex and read/write locks. Here is one link where you can see the documentation for the pthreads API.

Task 1#

Your first task is to compile and run the program for different values of N and T and make sure the program behaves as expected. Also make sure you understand the code fully. How many times is the fast path called? What about the slow path? Can you see how the numbers array is being initialized?

No Makefile is provided. By now you should be able to compile programs or write a Makefile on your own.

Task 2#

The program uses the semaphore API but we would like to replace the P and V operations with pthread reader/writer-optimized locks. Replace the P and V operations with equivalent calls for using pthread_mutex. You will need to figure out the proper API for using a pthread mutex.

Task 3#

Your next task is to optimize the synchronization inside the sum_n function and make it as fine-grained as possible. You can make use of read/write locks. You need to reason, for example, is it necessary to put a giant lock inside the while(1) loop? Can you use write lock for updates to shared_idx and quickly unlock so others threads can make progress? Do you need read or write access to the numbers array when reading n to compute the next sum? Apply lock and unlock operations to make the synchronization as fine-grained as possible? You also need to have a strategy for making sure the resulting code is correct. For example, can you still keep the caching optimization discussed above? Is there a safe way to retain the optimization? Or is it the best to just get rid of the optimization?

Do not make any changes to the main() function in this task except initializing the lock variables.

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include "csapp.h"

typedef unsigned long long sum_t;

#define N 120
#define T 4 

sum_t numbers[N];
int shared_idx = -1;
void *sum_n(void *vargp);
void fill_array();
void print_array();
sum_t fastpath(sum_t n, sum_t prior);
sum_t slowpath(sum_t n);

pthread_rwlock_t lock;

int main() {
	pthread_rwlock_init(&lock, NULL);

	fill_array();

	print_array();

	pthread_t tid[T];
	for (int i = 0; i < T; i++)
		Pthread_create(&tid[i], NULL, sum_n, (void*) i);

	for (int i = 0; i < T; i++)
		Pthread_join(tid[i], NULL);

	print_array();

	return 0;
}

/* Thread routine */
void *sum_n(void *vargp)
{
	sum_t *last_n = calloc(sizeof(sum_t),0);
	sum_t *last_sum = calloc(sizeof(sum_t),0);	
	int exit = 0;

	while (1) {
		int temp = 0;
		pthread_rwlock_wrlock(&lock);
		shared_idx++;
		temp = shared_idx;
		pthread_rwlock_unlock(&lock);

		pthread_rwlock_rdlock(&lock);
		if (temp >= N)
			exit = 1;
		sum_t n = numbers[temp];
		pthread_rwlock_unlock(&lock);
		sum_t r;
		if (n == *last_n + 1)
			r = fastpath(n, *last_sum);
		else
			r = slowpath(n);

		pthread_rwlock_wrlock(&lock);
		numbers[temp] = r;
		pthread_rwlock_unlock(&lock);

		*last_n = n;
		*last_sum = r; 
		//printf(" %d %ld %d \n", (int)vargp, last_sum, shared_idx);

		if(exit)
			break;
	}
    free(last_n);
    free(last_sum);
}

sum_t fastpath(sum_t n, sum_t prior) {
	//printf("fastpath \n");
	return n + prior;
}

sum_t slowpath(sum_t n) {
	//printf("slowpath \n");
	sum_t sum = 0;
	if (n < 0)
		perror("wrong input!");
	else {
		for (int i = 1; i <= n; ++i) 
            		sum += i;
	}
	return sum;
}

void fill_array() {
	for(int i = 0; i < N; i++)
		numbers[i] = (rand() % 100) + 1;
	numbers[(N/5)+1] = numbers[(N/5)] + 1;
	numbers[(N/10)+1] = numbers[(N/10)] + 1;
	numbers[(N/20)+1] = numbers[(N/20)] + 1;
}

void print_array() {
	for(int i = 0; i < N; i++)
		printf("numbers[%d] = %llu \n", i, numbers[i]);
}

Task 4#

Your next task is to come up with a new work division strategy so multiple threads do not require any synchronization at all. Can you come up with such a strategy? In this task, you are free to make changes to the main() function as well.

bars search times arrow-up