Computer Science 236

Datalog Scanner


Note: Projects are to be completed by each student individually (not by groups of students).

Note: Projects must be passed-off on the pass-off website or with a TA to be given credit.

Note: You may not use a regular expression library for this project.

Write a scanner that reads a sequence of characters from a text file, identifies the datalog language tokens found in the file, and outputs each token.

Example Input


Queries:
   marriedTo ('Bea' , 'Zed')?

Rules:
   marriedTo( X,Y ) :- marriedTo(Y,X) .


Example Output

(QUERIES,"Queries",2)
(COLON,":",2)
(ID,"marriedTo",3)
(LEFT_PAREN,"(",3)
(STRING,"'Bea'",3)
(COMMA,",",3)
(STRING,"'Zed'",3)
(RIGHT_PAREN,")",3)
(Q_MARK,"?",3)
(RULES,"Rules",5)
(COLON,":",5)
(ID,"marriedTo",6)
(LEFT_PAREN,"(",6)
(ID,"X",6)
(COMMA,",",6)
(ID,"Y",6)
(RIGHT_PAREN,")",6)
(COLON_DASH,":-",6)
(ID,"marriedTo",6)
(LEFT_PAREN,"(",6)
(ID,"Y",6)
(COMMA,",",6)
(ID,"X",6)
(RIGHT_PAREN,")",6)
(PERIOD,".",6)
(EOF,"",8)
Total Tokens = 26

Example Input

,
'a string'
# a comment
Schemes
FactsRules
::-

Example Output

(COMMA,",",1)
(STRING,"'a string'",2)
(COMMENT,"# a comment",3)
(SCHEMES,"Schemes",4)
(ID,"FactsRules",5)
(COLON,":",6)
(COLON_DASH,":-",6)
(EOF,"",7)
Total Tokens = 8

Testing

Here are some ideas for tests.

  1. An empty input file.
  2. A colon immediately followed by another token (no space between the colon and the next token).
  3. An identifier that contains a number.
  4. An identifier that contains a keyword.
  5. An empty string (nothing between the quotes '').
  6. An unterminated string.

Design

You will build a datalog parser in the next project. The datalog parser will read tokens from the datalog scanner. The scanner should be designed such that the parser is able to easily get the tokens from the scanner.

White Space

White space is a sequence of space, tab, or newline characters. Your lexer should always skip over white space between tokens. White space is not completely ignored because it is sometimes needed to separate tokens. For the C++ language, an easy way to recognize white space characters is to use the 'isspace' function.

Output Format

The expected output is a list of the tokens found in the input file followed by a count of the number of tokens found. The tokens are output one token per line.

Each line has the form:

(type,"value",line)

The 'type' must be one of the types listed in the table. The 'value' is the actual input text that forms the token. The 'line' is the line number where the token is found. Notice there are no spaces on either side of the commas separating the token's elements.

The last line of output has the form:

Total Tokens = N

where 'N' is the number of tokens found.

Input Errors

When the input contains errors, output tokens with the type UNDEFINED.

Undefined tokens are:

  1. A single character that cannot be the first character of a valid token.
  2. A string that is not terminated.
  3. A comment that is not terminated.

Example Input

Facts:
$
Rules:

Example Output

(FACTS,"Facts",1)
(COLON,":",1)
(UNDEFINED,"$",2)
(RULES,"Rules",3)
(COLON,":",3)
(EOF,"",4)
Total Tokens = 6

Token Types

The following table describes the types of tokens your lexer must recognize.

Token Type Description Examples
COMMA The ',' character ,
PERIOD The '.' character .
Q_MARK The '?' character ?
LEFT_PAREN The '(' character (
RIGHT_PAREN The ')' character )
COLON The ':' character :
COLON_DASH The string ":-" :-
MULTIPLY The '*' character *
ADD The '+' character +
SCHEMES The string "Schemes" Schemes
FACTS The string "Facts" Facts
RULES The string "Rules" Rules
QUERIES The string "Queries" Queries
ID An identifier is a letter followed by zero or more letters or digits, and is not a keyword (Schemes, Facts, Rules, Queries).
Note that for the input "1stPerson" the scanner would find two tokens: an 'undefined' token made from the character "1" and an 'identifier' token made from the characters "stPerson".
Valid Identifiers Invalid Identifiers
Identifier1 1stPerson
Person Schemes
STRING A string is a sequence of characters enclosed in single quotes. White space (space, tab) is not skipped when inside a string. Two adjacent single quotes within a string denote an apostrophe. The line number for a string token is the line where the string begins. If a string is not terminated (end of file is encountered before the end of the string), the token becomes an undefined token.

The 'value' of a token printed to the output is the sequence of input characters that form the token. For a string token this means that two adjacent single quotes in the input are printed as two adjacent single quotes in the output. (In other words, don't convert two adjacent single quotes in a string to just one apostrophe in the output.)
'This is a string'

'' -- (The empty string)

'This isn''t two strings'

COMMENT A line comment starts with a hash character (#) and ends at the end of the line or end of the file. # This is a comment
A block comment starts with #| and ends with |#. Block comments may cover multiple lines. Block comments can be empty and multiple comments can appear on the same line. The line number for a comment token is the line where the comment begins. If a block comment is not terminated (end of file is encountered before the end of the comment), the token becomes an undefined token. #||#

#| This is a
multiline comment |#

#| This is an illegal block comment
because it ends with end of file
UNDEFINED Any character not tokenized as a string, keyword, identifier, symbol, or white space is undefined. Additionally, any non-terminating string or non-terminating block comment is undefined. In both of the latter cases you reach EOF before finding the end of the string or the end of the comment. $&^ (Three undefined tokens)

'a string that does not end

EOF The end of the input file.