Datalog Scanner
Note: Projects are to be completed by each student individually (not by groups of students).
Note: Projects must be passed-off on the pass-off website or with a TA to be given credit.
Note: You may not use a regular expression library for this project.
Write a scanner that reads a sequence of characters from a text file, identifies the datalog language tokens found in the file, and outputs each token.
Example Input
Queries: marriedTo ('Bea' , 'Zed')? Rules: marriedTo( X,Y ) :- marriedTo(Y,X) .
Example Output
(QUERIES,"Queries",2) (COLON,":",2) (ID,"marriedTo",3) (LEFT_PAREN,"(",3) (STRING,"'Bea'",3) (COMMA,",",3) (STRING,"'Zed'",3) (RIGHT_PAREN,")",3) (Q_MARK,"?",3) (RULES,"Rules",5) (COLON,":",5) (ID,"marriedTo",6) (LEFT_PAREN,"(",6) (ID,"X",6) (COMMA,",",6) (ID,"Y",6) (RIGHT_PAREN,")",6) (COLON_DASH,":-",6) (ID,"marriedTo",6) (LEFT_PAREN,"(",6) (ID,"Y",6) (COMMA,",",6) (ID,"X",6) (RIGHT_PAREN,")",6) (PERIOD,".",6) (END,"",8) Total Tokens = 26
Example Input
, 'a string' # a comment Schemes FactsRules ::-
Example Output
(COMMA,",",1) (STRING,"'a string'",2) (COMMENT,"# a comment",3) (SCHEMES,"Schemes",4) (ID,"FactsRules",5) (COLON,":",6) (COLON_DASH,":-",6) (END,"",7) Total Tokens = 8
Testing
Here are some ideas for tests.
- An empty input file.
- A colon immediately followed by another token (no space between the colon and the next token).
- An identifier that contains a number.
- An identifier that contains a keyword.
- An empty string (nothing between the quotes '').
- An unterminated string.
Design
You will build a datalog parser in the next project. The datalog parser will read tokens from the datalog scanner. The scanner should be designed such that the parser is able to easily get the tokens from the scanner.
White Space
White space is a sequence of space, tab, or newline characters. Your lexer should always skip over white space between tokens. White space is not completely ignored because it is sometimes needed to separate tokens. For the C++ language, an easy way to recognize white space characters is to use the 'isspace' function.
Output Format
The expected output is a list of the tokens found in the input file followed by a count of the number of tokens found. The tokens are output one token per line.
Each line has the form:
(type,"value",line)
The 'type' must be one of the types listed in the table. The 'value' is the actual input text that forms the token. The 'line' is the line number where the token is found. Notice there are no spaces on either side of the commas separating the token's elements.
The last line of output has the form:
Total Tokens = N
where 'N' is the number of tokens found.
Input Errors
When the input contains errors, output tokens with the type UNDEFINED.
Undefined tokens are:
- A single character that cannot be the first character of a valid token.
- A string that is not terminated.
- A comment that is not terminated.
Example Input
Facts: $ Rules:
Example Output
(FACTS,"Facts",1) (COLON,":",1) (UNDEFINED,"$",2) (RULES,"Rules",3) (COLON,":",3) (END,"",4) Total Tokens = 6
Token Types
The following table describes the types of tokens your lexer must recognize.
Token Type | Description | Examples | ||||||
---|---|---|---|---|---|---|---|---|
COMMA | The ',' character | , | ||||||
PERIOD | The '.' character | . | ||||||
Q_MARK | The '?' character | ? | ||||||
LEFT_PAREN | The '(' character | ( | ||||||
RIGHT_PAREN | The ')' character | ) | ||||||
COLON | The ':' character | : | ||||||
COLON_DASH | The string ":-" | :- | ||||||
MULTIPLY | The '*' character | * | ||||||
ADD | The '+' character | + | ||||||
SCHEMES | The string "Schemes" | Schemes | ||||||
FACTS | The string "Facts" | Facts | ||||||
RULES | The string "Rules" | Rules | ||||||
QUERIES | The string "Queries" | Queries | ||||||
ID |
An identifier is a letter followed by
zero or more letters or digits,
and is not a keyword (Schemes, Facts, Rules, Queries).
Note that for the input "1stPerson" the scanner would find two tokens: an 'undefined' token made from the character "1" and an 'identifier' token made from the characters "stPerson". |
|
||||||
STRING |
A string is a sequence of characters enclosed in single quotes.
White space (space, tab) is not skipped when inside a string.
Two adjacent single quotes within a string denote an apostrophe.
The line number for a string token
is the line where the string begins.
If a string is not terminated
(end of file is encountered before the end of the string),
the token becomes an undefined token.
The 'value' of a token printed to the output is the sequence of input characters that form the token. For a string token this means that two adjacent single quotes in the input are printed as two adjacent single quotes in the output. (In other words, don't convert two adjacent single quotes in a string to just one apostrophe in the output.) |
'This is a string' '' -- (The empty string) 'This isn''t two strings' |
||||||
COMMENT | A line comment starts with a hash character (#) and ends at the end of the line or end of the file. |
# This is a comment |
||||||
A block comment starts with #| and ends with |#. Block comments may cover multiple lines. Block comments can be empty and multiple comments can appear on the same line. The line number for a comment token is the line where the comment begins. If a block comment is not terminated (end of file is encountered before the end of the comment), the token becomes an undefined token. |
#||# #| This is a multiline comment |# #| This is an illegal block comment because it ends with end of file |
|||||||
UNDEFINED | Any character not tokenized as a string, keyword, identifier, symbol, or white space is undefined. Additionally, any non-terminating string or non-terminating block comment is undefined. In both of the latter cases you reach the end of the file before finding the end of the string or the end of the comment. |
$&^ (Three undefined tokens) 'a string that does not end |
||||||
END | The end of the input file. |