Week 1: Lexical Analysis—Scanner

Description#

The first lab is intended to familiarise you with the build environment and structure of the project files, while also implementing the scanner for the MoJo language. The scanner takes input from the standard input stream (System.in) and writes output to the standard output stream (System.out). The scanner should print each recognized token and its corresponding lexeme on the standard output one line at a time. Input files can be any sequence of MoJo tokens, and are not required to be valid MoJo programs.

This lab is not assessed, but will be important to you starting in on the first assessed project assignment to build the parser for MoJo.

Getting Started#

You should fork the Lab 1 repository on gitlab. It contains a template file, a grading script, and some example test cases. The template for the JavaCC is included in the file src/mojo/Parser.jj and should look like this:

/* Copyright (C) 1997-2023, Antony L Hosking.
 * All rights reserved.  */

options {
  DEBUG_PARSER = false;
  DEBUG_LOOKAHEAD = false;
  DEBUG_TOKEN_MANAGER = false;
  STATIC = false;
  JDK_VERSION = "1.9";
}
PARSER_BEGIN(Parser)
public class Parser {}
PARSER_END(Parser)

/**************************************************
 * The lexical spec starts here                   *
 **************************************************/

TOKEN_MGR_DECLS :
{
  int comment, pragma;
  public static void main(String[] args) {
    SimpleCharStream stream = new SimpleCharStream(System.in);
    ParserTokenManager scanner = new ParserTokenManager(stream);
    while (true) {
      try {
        Token token = scanner.getNextToken();
        for (Token t = token.specialToken; t != null; t = t.specialToken)
          System.out.println(tokenImage[t.kind] + " " + t);
        if (token.kind == EOF) break;
        System.out.println(tokenImage[token.kind] + " " + token);
      } catch (TokenMgrError e) {
        System.err.println(e.getMessage());
        System.exit(-1);
      }
    }
  }
}

/* WHITE SPACE */
SKIP : { " " | "\t" | "\n" | "\r" | "\13" | "\f" }

/* KEYWORDS */
TOKEN :
{
  "break" | "class" | "const" | "else" | "extends" | "for" | "if" | "loop" | "method" |
  "override" | "proc" | "return" | "struct" | "type" | "until" | "val" | "var" | "while"
}

/* OPERATORS */
TOKEN :
{ "||" | "<"  | "<=" | "+" | "-" | "{" | "}" | ";" | ","
| "&&" | ">"  | ">=" | "*" | "/" | "(" | ")" | ":" | "."
| "==" | "!"  | "!=" | ".."| "%" | "[" | "]" | ":="| "=" | "^" }

/* TODO: comments */
SKIP : { "/* */" }

/* TODO: identifiers */
<ID> TOKEN : "TODO"

/* TODO: numbers */
<NUMBER> TOKEN : "42"

/* TODO: characters */
<CHAR> TOKEN: "'a'"

/* TODO: texts (strings) */
<TEXT> TOKEN: "\"TODO\""
``

To run the scanner you will need first to compile things with the
command `make`, and then invoke the class `ParserTokenManager` to read
from standard input (`System.in`):

```sh
java -cp bin ParserTokenManager

Expected Output#

Your program should tokenize its input, ignoring whitespace and comments. The output should be the tokens your scanner recognizes, one per line, like this:

<TOKEN> lexeme
<TOKEN> lexeme

For example, if you enter 'c' you should see the recognised token echoed back as <CHAR> 'c'.

The lexeme portion comes from the input program. It is the actual characters that were matched for that token. If the input contains an invalid token, your program should produce !ERROR on the standard output and then exit (note that an error message will also be produced to the standard error).

Testing#

The project repository from which you fork contains some test cases and the expected outputs. You can run these tests using the grade.sh script contained at the top level of the repository. Feel free to devise additional test cases (we don’t promise that what we have given you is comprehensive).

If you’re stuck, then you can reach out for help anytime—the course help page or discussion forum is a good place to start.