Outline

In this week’s lab, you will:

Learn the system calls for file I/O in Linux
Learn the concept of standard I/O streams, and learn about file I/O utilities in the C standard I/O library
Understand the typical uses and advantages of file streams
Understand the uses and advantages of memory mapped I/O and the mmap() system call
Practice the uses of file I/O and mmap() by writing some interesting C programs

Preparation#

This lab uses a new repository, make sure you fork and clone it!

Clone the lab pack 3 repo from here.

Introduction

Input/Output (I/O): This specific tutorial covers file input/output (I/O) in C. You will learn how to read from and write to files stored on a storage device such as a disk drive. A file is a versatile abstraction on Unix-like systems. Many objects (including regular files containing user data, devices, and pipes for inter-process communication) are represented as files. We are concerned in this tutorial with the so-called regular files. A regular file contains bytes of data organized into a linear array called a byte stream. Linux does not impose any other structure on the files beyond the byte stream. Bytes may have value and represent anything from students’ marks to photos.

Today, we will do basic file operations such as open, close, read, and write. We will also learn two ways of performing read/write I/O on modern computer systems. The first method uses system calls for I/O (affectionately known as syscall I/O). The second method is memory-mapped I/O (or mmio for short). It uses the virtual memory abstraction and the operating system’s page faulting mechanism for performing I/O. The former is sometimes called explicit I/O because it explicitly uses system calls for reading and writing files. Mmio is also called implicit I/O because, from the programmer’s perspective, there is no distinction between accessing regular memory (i.e., allocated via malloc) and disk-resident files. In implicit I/O, both malloc’ed memory and files can be accessed via pointers, and pointer arithmetic works for both. Today, you will learn to perform this magic!

Somewhat annoyingly, we use disk, drive, storage, device, and storage device as alternative ways to refer to a canonical block device to distinguish it from byte-addressable main memory. Main memory is also referred to as physical memory or just memory or DRAM main memory.

System calls for file I/O

We will use the most basic method of accessing files via read() and write() system calls. Before a file can be read or written it must be opened via the open() system call. Once the program is done using the file, it must be closed via the close() system call.

Opening files#

A file is opened via the open() system call.

int open (const char *name, int flags);

The open() system call returns a file descriptor (an integer) to make it easy for the program to do operations on the file. The filepath and its name are only needed for opening a file. The operating system kernel maintains a per-process list of open files, called the file table. This table is indexed via file descriptors (a.k.a., fds). An fds has an int type in C. Each Linux process has a maximum number of files that it may open. File descriptors start at 0.

As an example, you can open a file somewhere in your home directory as follows:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int fd;
fd = open ("/home/students/myfile", O_RDONLY);

Note that we have opened the file for reading only by specifying the O_RDONLY flag. We can also open the file for writing only with O_WRONLY flag. Finally, we can open the file for both reading and writing with the O_RDWR flag. You can see the full list of flags with the man open command. You should at least try to read about the flags: O_CREAT, O_APPEND, O_DIRECT, and O_SYNC.

If there is an error in opening a file, the open() system call return -1.

Reading files#

Reading data from the file is done via the read() system call.

ssize_t read (int fd, void *buf, size_t len)

The system call reads the number of bytes specified by len into a buffer set up by the program from an open file with descriptor fd (or -1 on error). The buffer must be large enough to read the required number of bytes. On success, the number of bytes written into buf is returned. The file position is advanced by the number of bytes read from fd.

#include <unistd.h>

unsigned long word;
ssize_t nr;

nr = read (fd, &word, sizeof (unsigned long));
if (nr == -1)
	/* error code */

There are two problems with the above example.

The read() system call may return with less than len bytes. We will not delve into the full list of reasons for that scenario here, but common reasons are: (a) not enough bytes (i.e., equal to len) are available, (b) the system call is rudely interrupted due to some reason.
There is a possibility of return value of 0 when using read(). A return value of 0 means either the end-of-file (EOF) is reached.

Note that we need to handle the above two cases. We also need to handle any potential errors. The following code solves these problems.

ssize_t ret;

while (len != 0 && (ret = read (fd, buf, len)) != 0) {
	if (ret == -1) {
        	if (errno = EINTR)
                	continue;
  	    	perror("read");
            	break;
    	}
    	len -= ret;
    	buf += ret;
}

The loop reads len bytes from the current file position of fd into buf, which should be len bytes in length. It continues reading until it reads all len bytes, or until EOF is reached. Note the adjustments to len and buf in the case more than zero, but less than len bytes are read. perror prints a message on standard error.

What’s the deal with errno (again)?#

The C library provides support for detecting and reacting to errors in the execution of system calls. Below is a description of errno from the Linux manual page.

The <errno.h> header file defines the integer variable errno,
       which is set by system calls and some library functions in the
       event of an error to indicate what went wrong.

errno
       The value in errno is significant only when the return value of
       the call indicated an error (i.e., -1 from most system calls; -1
       or NULL from most library functions); a function that succeeds is
       allowed to change errno.  The value of errno is never set to zero
       by any system call or library function.

In our example code, EINTR means that a signal was received before any bytes were read and the system call can be reissued.

Writing files#

ssize_t write (int fd, const void *buf, size_t count);

As with reads, the most basic usage of the write system call is simple (see below). We leave it to you to check for possible occurrence of partial writes and errors.

const char *buf = "I am a good student!";
ssize_t nr;

nr = write(fd, buf, strlen(buf));

Closing files#

After a program has finished working with a file descriptor, it can unmap (close) the file descriptor from the associated file via the close() system call.

#include <unistd.h>

int close (int fd);

Seeking with `lseek()`#

Each open file has an associated meta-data called the file position or offset. When the file is first opened, the file position is zero. Each read and write operation starts at a specific offset (current position). As bytes in the file are read from or written to (e.g., via read() and write() syscalls), the file offset advances accordingly.Typically, I/O occurs linearly through a file, and the file position updates automatically. Some applications may want to jump around (seek) in a file, accessing random locations. The lseek() system call sets the file position of a file descriptor to a given value. It performs no other action and initiates no I/O.

Read the description of lseek() from the manual page. To save time, skip the discussion around seeking file data and holes.

Open, compile, and run the program seek_io.c. Read the comments/documentation to understand what the program is doing.

Page Cache and Buffering

Storage devices such as disks and solid-state drives are much slower than byte-addressable main memory (DRAM). The operating system kernel maintains an in-memory data structure called the page cache that stores (caches) recently accessed files from disk. The page cache allows the kernel to fulfill subsequent read requests for the file content from memory, avoiding repeated disk accesses. Therefore, when you issue the read() system call in the above examples, the operating system first transfers file data from the disk into the page cache. It then copies data from the page cache to the user-specified buffer in the application memory. We call the data structure page cache because a page is the basic unit of memory allocation inside the kernel.

The kernel also optimizes write I/O operations by buffering them in the page cache. Imagine a program issuing a batch of small writes (e.g., 32 bytes). If the kernel accesses the slow disk on every write, then the system operates very inefficiently. Indeed, the kernel performs write buffering. When a program issues a write request, the data is copied into an in-memory buffer, and the buffer is marked dirty, denoting that the in-memory copy is newer than the disk-resident copy. Subsequent writes hit in the buffer, avoiding disk accesses again. At some point, the kernel decides to flush the dirty buffer(s) to disk. In modern Linux kernels, the page cache and the buffer subsystem have been unified as a single structure (page cache). Before Linux kernel 2.4, there was both a page cache and a buffer cache.

In summary, syscall I/O results in an extra copy from the kernel space to the user space (in addition to disk transfer). We refer to buffering in the page cache as kernel-level buffering.

File Streams

To understand the motivation behind file streams, we need to briefly visit block devices and filesystem blocks.

What is a block device? What is a filesystem block?#

We consider storage devices, such as disks and solid-state drives (SSDs) block devices. The smallest addressable unit on a block device is the sector. We can access any (random) block. The sector is a physical attribute of the device (set at manufacturing time). Historically, disks have a typical sector size of 512 bytes. All I/O operations on a block device must occur at the granularity of one or more sectors. The other type of I/O device is the character device. A character device delivers or accepts a stream of characters. A keyboard is an example of a character device. If the user types “dog,”, for example, an application would want to read from the keyboard device the d, the o, and finally the g, in exactly that order. A block structure would make little sense and reading/writing characters in any order would also make little sense. Other examples of character devices include rats (for pointing), printers, and network interfaces.

A filesystem is a collection of files and directories organized into a hierarchy. Filesystems (that exist on top of block devices) use the abstraction of blocks for transferring file data to and from storage devices. The smallest addressable unit on a filesystem is a block. Blocks are power-of-two multiple of sector sizes. The most common block size on Linux is 4 KB. Inside the kernel, all I/O operations happen at the granularity of blocks. If an application requests 32 bytes, the kernel must transfer a block from the underlying storage (block) device. The same is true for writes. Everything with regards to storage I/O happens at the block granularity.

The usage of the term “block” is overloaded in the study of computer systems. The context clarifies it. For example, a physical block on a disk drive refers toa sector, and is different from a filesystem block (multiple of sector size). Historically, disk drives are called block devices and not sector devices.

If an application performs partial block operations, i.e., reading and writing less than the block size, this results in an inefficiency. The operating system needs to do extra work to make things happen at the block boundary. On the other hand, user-space applications operate in terms of high-level abstractions, such as, structs and arrays and strings, whose size is typically much smaller than the block.

There is one more issue with small I/O requests, e.g., requesting a single byte from a device. Each such request results in a system call, and too many system calls are bad for performance and slow down applications. Programs that need to issue many small I/O requests typically resort to user-buffered I/O. This type of buffering refers to buffering done in the user space, either manually by the programmer, or transparently in a library. Note that this user-level buffering is on top of the buffering done by the kernel in the page cache. The two serve different purposes!

Discuss with your neighbors the need for kernel-level buffering and user-level buffering.

Here, we will explore a platform-independent user-buffering solution provided by the standard C I/O library (called stdio). This means that C library routines internally perform buffering of user requests, and only invoke the kernel via system calls for reading/writing blocks. The result is a reduction in the frequency of system calls. The downside of user-level buffering is that there are two copies in addition to the disk transfer. One from the page cache to the standard library buffer and the second from the standard I/O library buffer to the buffer provided by the user (programmer).

Standard I/O library routines do not work directly with file descriptors. They instead use file pointers which are represented by the FILE object defined in stdio.h.

Opening and reading/writing file streams#

Files are opened for reading or writing via fopen():

#include <stdio.h>

FILE * fopen (const char *path, const char *mode);

This function opens the file path with the behavior given by mode and associates a new stream with it. The mode argument describes how to open the give file. The mode argument can be specified as one of the following strings (reproduced from the man page):

       r      Open text file for reading.  The stream is positioned at
              the beginning of the file.

       r+     Open for reading and writing.  The stream is positioned at
              the beginning of the file.

       w      Truncate file to zero length or create text file for
              writing.  The stream is positioned at the beginning of the
              file.

       w+     Open for reading and writing.  The file is created if it
              does not exist, otherwise it is truncated.  The stream is
              positioned at the beginning of the file.

       a      Open for appending (writing at end of file).  The file is
              created if it does not exist.  The stream is positioned at
              the end of the file.

       a+     Open for reading and appending (writing at end of file).
              The file is created if it does not exist.  Output is
              always appended to the end of the file.  POSIX is silent
              on what the initial read position is when using this mode.
              For glibc, the initial file position for reading is at the
              beginning of the file, but for Android/BSD/MacOS, the
              initial file position for reading is at the end of the
              file.

Once a file stream is opened, you can read from a stream via fread() and write to a stream via fwrite().

The standard I/O library provides a number of powerful routines for manipulating file data via streams. You should explore the full set of routines here.

Open, compile, and run the program fstream_write.c and fstream_read.c. Can you guess what these two programs are doing? Ok, read below for more details.

In the first program, we create a new file data.bin in the repository. We declare a structure threeNum with three numbers - n1, n2 and n3, and define it in the main function as num. Now, inside the for loop, we store the value into the file using fwrite(). The first parameter takes the address of num and the second parameter takes the size of the structure threeNum. Since we’re only inserting one instance of num, the third parameter is 1. And, the last parameter fptr points to the file we’re storing the data. Finally, we close the file.

In the second program, we read the same file data.bin and loop through the records one by one. In simple terms, you read one threeNum record of threeNum size from the file pointed by fptr into the structure num.

Memory-Mapped File I/O

We will now discuss a second way to perform I/O that does not require system calls and the C standard I/O library. This alternative to standard file I/O is to map a file into memory, which establishes a one-to-one correspondence between a memory address and a word in a file. The programmer can then access the file directly through memory, similar to any other memory-resident data. The ability to map disk-backed files into memory is truly phenomenal. It allows direct computation over files via load/store instructions. Let’s see how we can map a storage-backed file into memory.

The `mmap` system call#

Linux implements the POSIX-standard mmap() system call for mapping files into memory. mmap() is a system call that can be used by a user process to ask the operating system kernel to map a file into the memory (i.e., address space) of that process. The mmap() system call can also be used to allocate memory (an anonymous mapping), which was the subject of the first assignment. It is important to remember that the mapped pages are not actually brought into physical memory until they are referenced. Therefore, mmap() lazily loads pages into memory (a.k.a., demand paging).

The following figure illustrates the use of the mmap() system call:

mmio

Here is the function prototype for mmap():

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

There are six arguments to the mmap() system call:

addr - This argument is a hint to the operating system kernel to use this address at which the virtual mapping should start in the virtual memory (i.e., the virtual address space) of the process. The value can be specified as NULL to indicate that the kernel can place the virtual mapping anywhere it sees fit. If not NULL, then addr should be a multiple of the page size.
length - This argument specifies the length as number of bytes for the mapping. This length should be a multiple of the page size.
prot - The protection for the mapped memory. The value of prot is the bitwise or of various of the following single-bit values:
- PROT_READ - Enable the contents of the mapped memory to be readable by the process.
- PROT_WRITE - Enable the contents of the mapped memory to be writable by the process.
- PROT_EXEC - Enable the contents of the mapped memory to be executable by the process as CPU machine instructions.
flags - Various options controlling the mapping. Some of the more common flags values are described below:
- MAP_ANONYMOUS (or MAP_ANON) - Allocate anonymous memory; the pages are not backed by any file.
- MAP_FILE - The default setting; it need not be specified. The mapped region is backed by a regular file.
- MAP_FIXED - Don’t interpret addr as a hint: place the mapping at exactly that address, which must be a multiple of the page size.
- MAP_PRIVATE - Modifications to the mapped memory region are not visible to other processes mapping the same file.
- MAP_SHARED - Modifications to the mapped memory region are visible to other processes mapping the same file and are eventually reflected in the file.
fd - The open file descriptor for the file from which to populate the memory region. If MAP_ANONYMOUS is specified, then fd should be given as -1.
offset - If this is not an anonymous mapping, the memory mapped region will be populated with data starting at position offset bytes from the beginning of the file open as file descriptor fd. Should be a multiple of the page size.

On success, mmap() returns a pointer to the mapped area in virtual memory. On error, the value MAP_FAILED (i.e., (void *)(-1)) is returned and errno is set to indicate the reason. You can find full details on the mmap() system call by using the man mmap command.

The `munmap` system call#

munmap() is a system call used to unmap memory previously mapped with mmap(). The call removes the mapping for the memory from the address space of the calling process process:

int munmap(void *addr, size_t length);

There are two arguments to the munmap() system call:

addr - The address of the memory to unmap from the calling process’s virtual mapping. Should be a multiple of the page size.
length - The length of the memory (number of bytes) to unmap from the calling process’s virtual mapping. Should be a multiple of the page size.

On success, munmap() returns 0, on failure -1 and errno is set to indicate the reason. If successful, future accesses to the unmapped memory area will result in a segmentation fault (SIGSEGV).

Advantages of memory-mapped I/O#

There are three primary advantages of using mmap() to access files on disk.

Avoiding needless loading: One advantage of using mmap() is lazy loading. If no memory within a certain page is ever referenced, then that page is never loaded into physical memory. This can be crucial in certain applications in terms of saving both memory and time. In standard file I/O (or syscall-based I/O), programmers may end up bringing file data that is never used.
Performance: On modern systems, memory mapped I/O is typically faster compared to syscall-based I/O especially for large files. There are at least two sources of speed improvement:
- Eliminating syscall overhead: Traditional I/O involves a lot of system calls (e.g., calls to read()) to load data into memory. We have seen in the lectures that there are costs associated with systems calls, including switching from user to kernel mode, and such costs as error checking within the functions themselves.
- Eliminating copy overhead: Loading data into main memory also has to go through several layers of software abstraction, with the data being copied around in various buffers in the operating system before finally being placed in memory, which clearly will slow down the program. In fact, in the case of user-buffered I/O, there are two copies, one from the kernel page cache to the C library buffer, and another one from the C library buffer to the programmer’s buffer in the application code. Using mmap() avoids the overheads due to extra copying.
Programming convenience: Programmers find it convenient to manipulate files using pointers and pointer arithmetic due to its versatility. Also, there is no need for programmers to maintain additional user-level buffers as the operating system manages the buffers in the page cache and returns a pointer to the user.

Let’s see an example of memory-mapped I/O and do some exercises.

The following simple example program appears in your lab repo as exercise1.c. This program demonstrates a typical use case of mmap() for reading data from a file (the #include lines in this example have been omitted here for simplicity of presentation). Unfortunately, there are two bugs in the program. Can you fix these two bugs and try to run the program and check the output?

/*
 * Requirements:
 *   This program's source code must reside in the current working directory
 *   in a readable file named "exercise1.c".
 *
 * Output:
 *   Reads this program's source code, via mmap(), and prints that source
 *   code to stdout.
 */
int
main(void)
{
        struct stat stat;
        int fd, size;
        char *buf;

        // Open the file and get its size.
        fd = open("exercise1.c", O_RDONLY);
        if (fd < 0 || fstat(fd, &stat) < 0) {
                fprintf(stderr, "open() or fstat() failed.\n");
                return (1);
        }
        size = stat.st_size;

        // Map the file to a new virtual memory area.
        buf = mmap(NULL, size, PROT_NONE, MAP_PRIVATE, fd, 0);
        if (buf == MAP_FAILED) {
                fprintf(stderr, "mmap() failed.\n");
                return (1);
        }

        // Print out the contents of the file to stdout.
        for (int i = 0; i < size; i++) {
                printf("%c", buf[i++]);
        }

        // Clean up.  Ignore the return values because we exit anyway.
        (void)munmap(buf, size);
        (void)close(fd);

        return (0);
}

This program calls the open() system call to open its own source code file (the file exercise1.c) and then gets the size (in bytes) of this file by calling the fstat() system call. The fstat() system call returns a number of different pieces of information about the status of the file open on the given file descriptor. (The stat() system call is identical to fstat(), except that stat() takes the name of the file as its first argument rather than a file descriptor open to that file, as in fstat().) The program then uses mmap() to map that file into the process’s memory, and then prints out the contents of the file from the memory into which the file has been mapped.

Exercise 1

The example program above is provided as exercise1.c in your lab repository. This program should print its own source code to standard output. Make this program using

	 make exercise1

When you run it, you will notice that it causes a segmentation fault. Fix this bug in the program and run it. This time the program runs but it does not do what it is supposed to do, i.e., print out the contents of the file from the memory into which the file has been mapped. Fix the remaining bug so that it will properly print its own source code to standard output.

Exercise 2

In the file exercise2.c in your tutorial repository, we provide an incomplete implementation of a program to perform a fast copy of all of the data in a source file, indicated by the first supplied filename on the command line, to a destination file, indicated by the second supplied filename. Fill in the rest of the code so that file copying works. You can confirm if things are working properly by using the Unix/Linux diff command (see “man diff”, if you want).

Hint #1: You will need to adjust the size of the destination file so that it is large enough to accept all of the bytes from the source file, with no extra bytes at the end. You might find ftruncate() useful here.

Hint #2: You can alternatively find the size of the source file with lseek. Try it!

Hint # 3: You may find memcpy() useful (see “man memcpy” for details)

Exercise 3

Write a program to copy the source file into a destination file using syscall I/O. Don’t use user-buffered I/O (a.k.a., FILE object and streams). Directly make the system calls into the kernel from your user code.

Exercise 4

The program provided as exercise4.c in your lab repository maps a file (called tweets) into program’s virtual memory. This file contains a few random tweets (one per line). The program then modifies a random tweet. Run the program to verify that indeed the changes made by the program are reflected in tweets. Now, we would like local (i.e., per-process) modifications to a tweet to not be reflected in the actual tweets file. Change the code in exercise4.c such that changes made by the process are not reflected in the actual tweets file. There are different ways to confirm if your solution is working or not. Discuss two possible ways of checking if your solution is working (with your neighbors or tutors).

Exercise 5

The program provided as exercise5.c in your lab repository builds a linked list of student records in memory. The program contains an empty function, namely serialize_records. The function should write the names and ids of all students in the linked list to a file called studentdata. Use mmio to write the names and ids to a file and complete the definition of the serialize_records function. Note that the empty function takes a pointer to the linked list of records and a file descriptor. This process of writing in-memory data structures to a storage-resident file after eliminating needless meta-data, such as, links between student records (pointers), is called serialization. (The inverse of serialization is deserialization.) Optionally, write a deserialize_records function that reads a file with student records (similar to studentsdata) from storage and populates a linked list data structure in memory.

Exercise 6

Write a program that reads tweets from a file on disk and builds a words data structure in memory. This data structure contains all the unique words encountered so far (and their occurrence) linked together as a list. The data structure appears in memory as follows.

In-Memory Representation of Records

Use the C library function fgets to read one tweet at a time. Take proper care of buffer size for storing the newly read tweet, and for detecting the end of word, end of line, and end of file.

Hint # 1: A non-alphanumeric character begins a new word. You will find C library function isalnum() useful for detecting alphanumeric characters.

Hint # 2: You will find it useful (for reasons of modularity) to write a lookup() function to check if a word already exists in the linked list.

Hint # 3: Do not forget to insert a ‘\0’ after the word.

Hint # 4: It would be useful to create a struct for storing word and count. You can assume a maximum word size of 1024 characters. Since doing so is surely going to waste space, you can optimize for memory space by storing unique words differently.

Hint # 5: Write printing utilities to make sure your solution is working correctly.

Search this site

Lab 7: Syscall and Memory-Mapped File I/O

Outline

Preparation#

Introduction

System calls for file I/O

Opening files#

Reading files#

What’s the deal with errno (again)?#

Writing files#

Closing files#

Seeking with `lseek()`#

Page Cache and Buffering

File Streams

What is a block device? What is a filesystem block?#

Opening and reading/writing file streams#

Memory-Mapped File I/O

The `mmap` system call#

The `munmap` system call#

Advantages of memory-mapped I/O#

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Outline

Preparation#

Introduction

System calls for file I/O

Opening files#

Reading files#

What’s the deal with errno (again)?#

Writing files#

Closing files#

Seeking with lseek()#

Page Cache and Buffering

File Streams

What is a block device? What is a filesystem block?#

Opening and reading/writing file streams#

Memory-Mapped File I/O

The mmap system call#

The munmap system call#

Advantages of memory-mapped I/O#

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Seeking with `lseek()`#

The `mmap` system call#

The `munmap` system call#