Outline#

  • Deadline: 13 October 2024, 11:59 pm
  • Assessment template: link
  • Specification: keep reading 🙂
  • Weighting: 5%
  • Marked out of: _ / 5

Time Until Deadline (13 October 2024, 11:59 pm):#

...

Rules and Policies#

  • this is an individual assessment task, ensure you fork your repo as private fork guide
  • you may re-use code from your labs
  • late submission is not permitted without an extension

Background#

In this checkpoint you will practice file I/O in C. This checkpoint focuses on three things: serializing text-based input, deserializing binary data and performing queries on it

The specific data you will be working with is a dictionary that contains the definitions of words and phrases (not to be confused with the data structure). You will have to implement three main pieces of functionality:

  1. The ability to convert a text-based input file of terms and definitions into a “binary file”
  2. The ability to convert a binary file produced by the serialization step back into text-based form.
  3. The ability to process commands on a binary file that has been loaded into memory.

Text File Format#

Each line of the input file will contain a term, followed by a variable number of ;-separated definitions1. For example, consider the following:

computer;a machine for performing calculations automatically;an expert at calculation (or at operating calculating machines)

The term being defined is “computer”, which has two definitions:

  • “a machine for performing calculations automatically”
  • “an expert at calculation (or at operating calculating machines)”

Throughout this checkpoint specification the combination of a term and its variable number of associated definitions is referred to as an “entry”.

Note that both terms and definitions can contain whitespace characters!

Open one of the tests/test-N.text files in your Checkpoint Repository to familiarize yourself will the file layout.

Binary File Format#

Running queries against the text file directly is not optimal. Your code would have to perform excessive text processing (e.g., dealing with end of line and semicolon separators). We hence introduce a binary file format that eliminates some of the extra details required to make the data human readable.

In the binary file format your serialization code will produce, each entry is stored as:

  • Null-terminated term string, and potentially some additional bytes of padding
  • A four byte integer that indicates how many bytes the remaining data associated with this term takes up.
  • The number of definitions associated with this term.
  • Each definition as a null-terminated string.

Multiple entries are stored consecutively in the produced binary file.

Using a tool such as xxd or hexdump open one of the binary files in the tests/ directory of your repository to familiarize yourself with the binary file format. Can you see where each entry starts?

Description of Tasks#

Your task is to write a program dictionary in C that can:

  1. Convert data from the specified text format into the specified binary format
  2. Convert data from the specified binary format into the text format.
  3. Process commands that query the dictionary data loading into memory from a binary file.

Each of these three features are invoked by passing their respective names as the first argument to dictionary:

  • ./dictionary to_binary file1 file2 will convert the contents of file1 into binary format written to file2.
  • ./dictionary to_text file1 file2 will convert the contents of file1 into text format and write the results to file2
  • ./dictionary process file1 file2 will load the binary data stored in file1 into memory, and process a number of queries read from file2. The results of these queries will be written to stdout.

You are provided with template code in dictionary.c. The main function that handles argument parsing has already been written for you, and you will not need to modify it.

Do not modify the function signatures of any functions given to you in dictionary.c, as the additional test cases we run may call these functions directly.

It is a good idea to review Exercise 3 from Lab 7, which covers serialization and deserialization. The processes to convert between text and binary in this checkpoint will be conceptually similar to these concepts.

Conversion to Binary#

./dictionary to_binary input-file.text output-file.binary

When the program is invoked with the arguments above, your program will need to:

  1. Use File Stream I/O to open input-file.text and output-file.binary. If either file is unable to be opened, your program should use the function perror to print an appropriate error message and exit with return code 1.
  2. Use a function such as getline or fgets to read data from the file stream associated with input-file.text one line at a time.
  3. For each line in the file, you will need to transform the input string to conform to the above format. The result of this conversion will be written to the file stream associated with output-file.binary.
  4. When all input lines have been read, close the two file streams and free any dynamically allocated memory used by your program2

All of this behaviour will need to be contained in your create_binary_file function which accepts two file names as arguments. You may implement and call helper functions from within create_binary_file if needed.

You may assume the following:

  • Every term in the input file has at least one definition.
  • No term will have more than 2147483647 (maximum value of signed 4-byte integer) definitions associated with it.
  • Similarly, the total size of all definitions associated with a term will not exceed this value.
  • Each term has at least one character in it, and they may contain spaces.
  • It is possible that input-file.text will have read permissions enabled, but not write permissions, and it is possible that the converse is true for output-file.binary. Your code should still be able to read from and write to these files in this case, so make sure you set the mode you open the files in appropriately!

Implement the create_binary_file function in dictionary.

Alignment#

After you have written the key string to the destination file stream but before you write the offset of next term and the number of definition you may need to insert additional padding. This is to ensure that the two integers values included in the binary file format are always stored at a 4-byte aligned offset in the file.

What are the benefits of adding additional padding so the integer values are all written at offsets that are a multiple of 4?

To determine what offset you are currently “up to” when serializing data you can use the ftell or fseek functions.

Converting Terms and Definitions#

Supposing you had the following string as the first line read from input-file.text:

term;1st definition;2nd definition

When converted to binary using to_binary, the binary data when inspected with a hex editor (or a tool such as xxd) will look like the following:

          00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F | Decoded Text:
00000000: 74 65 72 6d 00 00 00 00 26 00 00 00 02 00 00 00 |  term....&.......
00000010: 31 73 74 20 64 65 66 69 6e 69 74 69 6f 6e 00 32 |  1st definition.2
00000020: 6e 64 20 64 65 66 69 6e 69 74 69 6f 6e 00 -- -- |  nd definition.
  1. Everything that appears before the first ; of the input is the term (the word “term” in this case), which is written as a null-terminated string starting at offset 0. Determining where the term string starts and ends is a matter of looking for the first ; in the input string.
  2. This term string will take up five bytes of memory, meaning the file position will be 0x5. To ensure the next two values written are aligned to a multiple of four-byte offsets, three additional null terminating bytes are written to the file.
  3. The next value written is the int value 0x26 (38 in base ten). This value indicates that the start of the next term stored in the binary file will be located at the current offset plus 0x26. You can calculate this value as the number of bytes taken up by the definition strings (null terminating characters included), plus the number of bytes needed to store the definition count (equal to sizeof(int), assumed to be four bytes).
  4. The next value is another integer representing the number of definitions associated with the term. You can easily determine this value by counting the number of occurrences of the character ; in the string.
  5. Each definition is then written as a null-terminated string. There is no need to pad the start position of each string in this case, as we only care about the alignment of the integer values. For the final definition, make sure you are not writing trailing newline character.
  6. Subsequent lines are processed in the exact same way, starting at the byte immediately after the end of the final definition string. In the above example, the term of the next line will be written starting at offset 0x2E.

To process the input lines, you may find the string utilities you implemented in Lab 7 or their C standard library equivalents useful.

Conversion to Text#

You are given pre-generated binary files for testing in your repository. As a result, you can complete this component independently of the first task.

The conversion to text process is invoked as:

./dictionary to_text input-file.binary output-file.text

Your second task is to convert binary data created by the previous step back into the given text format. You will need to complete two functions: load_binary_data (which will also be used in the “Process” step) and create_text_file. The output file created by create_text_file should be identical to the corresponding input file used to create the binary file. For example, running the following commands (assuming input.text conforms to the specified text file format):

./dictionary to_binary input.text output.binary
./dictionary to_text output.binary output.text
diff input.text output.text

Should result in diff declaring the two files identical.

Loading Binary Data#

The load_binary_data function will take the filename of a binary file as an argument and load its data into memory using Memory Mapped I/O. This function will need to set two global variables DATA and data_len to the pointer returned by mmap and the total number of bytes the loaded binary data takes up. If there is any error when opening the binary file or with mmap itself, you should print an appropriate error message and exit with return code 1.

Do not modify the behaviour of load_binary_data and make sure you set these global values correctly, as additional test cases we run may depend on its behaviour.

For the purposes of traversing the binary data stored in memory (both for deserializing and for processing queries), it is up to you whether you wish to traverse the memory mapped pages directly or create your own data structures that are populated in the load_binary_data function (e.g. a linked list of pointers to the different terms).

Complete the load_binary_file function as described above.

Writing Text-ified Output#

The create_text_file function takes a binary file name and a destination file name as arguments. It will call your load_binary_data function with the given binary file name to load its data into memory. It will also open the destination file in whichever I/O method you prefer for the purposes of writing the “deserialized” results.

You may assume:

  • Any binary file you are given to convert to text conforms to the specified format. This means you will not be tested on files that report an inaccurate number of definitions/total size, for example.
  • There is always 1 or more definition associated with a term, and all terms contain at least one character.

Complete the create_text_file function as described above.

Processing#

The third and final piece of functionality to implement is the ability to process queries on the dictionary data you load from memory.

The process_commands function accepts the name of a binary file and the name of a file containing different commands to process. The format of this command file will be newline separated commands, similar to that seen in Checkpoint 1. Each line starts with a “command identifier”, which contains no whitespace characters, and is followed by either 1 or 2 command arguments, the final of which is the term to perform the query on.

The three query types you will need to implement will print different information about a specific term:

  1. count term: This will print the number of definitions associated with term.

  2. all term: This will print all definitions associated with the given term, each separated by newlines. See the Example below for the exact format to use.

  3. definition N term: This will print definition N (counting starts at 1) associated with the given term. N is given as a signed integer value.

When processing commands, the following error conditions can occur:

  • For all three query types, if the given term does not appear in the dictionary you should print the string “Term 'term' not present in dictionary.
  • For a definition command with an invalid definition number (either less than 1 or greater than the number of associated definitions), the string Term 'term' only has N definition[s]. will be printed. If N=1 the s in “definitions” should be omitted.
  • If any invalid query type is given, the program should print Invalid command given: 'command', where command is replaced with the invalid command identifier.

Example#

Consider the following text input:

term 1;definition 1
term 2;definition 2;definition 3

Running the following commands on the binary data generated from the above will have the following results written to standard output:

count term 1
count term 2
definition 2 term 2
definition 2 term 1
all term 2
all term 1
count non-existent term
definition 5 term 2
Term 'term 1' has 1 definition.
Term 'term 2' has 2 definitions.
Definition 2 of term 'term 2':
  - definition 3
Term 'term 1' only has 1 definition.
Definition(s) of 'term 2':
  - definition 2
  - definition 3
Definition(s) of 'term 1':
  - definition 1
Term 'non-existent term' not present in dictionary.
Term 'term 2' only has 2 definitions.

Complete the process_commands function to match the above specification.

Error Handling#

While you may assume certain things about the content of the files you are given, you will still need to handle potential errors with the I/O functions you use. For example, your program should gracefully handle the case where you attempt to read from an input file that does not exist.

If any error occurs in a function that sets the global errno (such as fopen, mmap or stat), your code should print an appropriate error message to standard error using the perror function that includes the name of the function that failed. It should then call exit with status 1, regardless of the specific cause of the error.

Testing#

Some of the tests will compare the results of your serialization code against a pre-generated binary file. These test cases will assume that your machine is little endian (which it almost certainly will be).

We have provided some test cases for you to check the behaviour of your code by comparing the output with what is expected. These five test cases will be run by the CI when you push your changes to GitLab. To run these tests locally, we also provide a bash script that will run each of these five test cases for you. You can run this script with the command (or with the -n flag if you wish to disable colorized output):

./run_tests.sh

The following is a summary of the five different test cases you are given:

Test Number Behaviour
1 This will covert to binary the contents of tests/test-1.text and compare the results with tests/test-1.binary
2 This will covert the contents of tests/test-2.binary to text and compare the results with tests/test-2.binary
3 This will process a number of queries on the binary data from tests/test-1.binary. The expected results of these queries is in test-3.expected.
4 This will process a number of queries on the binary data from tests/test-2.binary. The expected results of these queries is in test-4.expected.
5 This will check how your code behaves when the to_binary command is invoked on a non-existent text file.

Tips, Tricks and Hints#

  • It may be difficult to read the contents of the binary files you create directly using an editor designed for reading text files. You can use a command line tool like hexdump or xxd to view the raw bytes contained in a file directly. Run man hexdump or man xxd for information on how to use these tools. If you are using VsCode, there is an Extension that allows you to read binary data easily as well.
  • If you are failing the serialization test, it can be difficult to use diff or cmp to see which bytes differ. You can try comparing the output of xxd on the expected and actual file contents instead:
xxd out/test1-output.binary > out/test1-output.hexdump 
xxd tests/test-1.binary > out/test-1-expected.hexdump
diff out/test1-output.hexdump out/test-1-expected.hexdump
  • Remember there is a difference between reading and writing an integer value and the ASCII representation of an integer. When creating the database, integers must be written as their value. When printing values as a response to a query or printing the database, integers must be written as their ASCII representation. Also remember that some keys could contain numeric characters - these should always remain in their ASCII representation.

  • The getline function can help you read in lines of arbitrary length. To understand how to use this, run man getline. You can also consider using stdlib functions fgets, strsep and strtok.

Marking#

Like Checkpoint 1, your total mark out of 5 will be determined by two things:

  1. The five test cases provided in the CI. The input used for these five test cases are the same as the tests you are given to run locally.
  2. Additional test cases not provided to you, to both verify that you have read this specification carefully and have not hardcoded the intended behaviour.

Each of the five test cases present in the CI will award 0.5 marks for passing, and 0 for failing.

Depending on the specifics of your machine, it may be the case that test cases which pass locally fail on the CI. Therefore it is very important that you check the CI results of your checkpoint, as failing tests will not be awarded partial marks.

Submission#

Submission is through GitLab, the most recently pushed commit of your fork of the assessment template before the deadline is taken to be your assessment submission.

Getting Started#

  1. read this assessment page completely
  2. fork and clone the assessment template
    • ensure you fork your project as private
    • do NOT change the name or path of the repo, or it may get missed by our software private fork
  3. work on each task, testing, committing and pushing as you go
  4. make a mistake or get stuck, then ask a good question on the course forum.

Completion Checklist#

  • you have completed the code in the files provided and have not renamed or moved them
  • you have run your files against local test(s) and they pass successfully
  • you have created your own tests for your files and they pass successfully
  • you have saved, committed and pushed your files to gitlab
  • you have filled out, committed, and pushed your statement of originality
  • you have checked the gitlab ci tests and they are passing
  • you have checked the pdf artifact on gitlab to ensure it is correct

Gitlab CI and Artifacts#

You should already be familiar with the CI jobs from Assignment 1 and Checkpoint 1. As a reminder:

To view the CI pipeline for your repo, click on the little icon to the right of your most recent commit.

ci icon

Then you’ll be presented with a list of the jobs, you’ll want to make sure that the jobs are passing once you’ve completed the corresponding task.

Filecheck stages ensure that the file exists and has been updated from the default. Test stages run the submitted files against our tests to measure correctness.

ci job status

FAQ#

My code don’t work, can I email you for help?#

Sorry, you won’t get help over email or Teams. We provide a course forum which is the only way we are able to help outside of labs.

Forum posts related to your assessment submission must be “private” if they contain code or other sensitive information (as for any individual assessment task).

It’s [5 minutes, 60 minutes, 12 hours] before the deadline and my CI Jobs aren’t finishing!#

Unfortunately on the day that an assessment is due, when many students are pushing updates at once, the CI servers can’t keep up. You may not see your CI jobs finish before the deadline. You will just have to manually check that your files have been submitted correctly and that you are passing tests locally.

The best way to avoid this issue is to start early and finish early 😇

If there’s any issues with your git repository after the deadline. Please let us know (after the deadline) through a private forum post and there may be something we can do.

How do I know my assessment has been submitted?#

If:

  1. the files in your fork of the assessment are correct (i.e., the files you intend to submit) when checking on the gitlab website
  2. the time is before the deadline

then your assessment has been submitted (well done!).

Please don’t ask us to “check”, we would be just doing exactly the same thing as the above steps which you can do yourself.

  1. All terms and definitions used in test files for this checkpoint were sourced from https://github.com/wordset/wordset-dictionary 

  2. Remember to free any memory allocated by standard library functions that call malloc themselves, such as getline

bars search times