Proparse Scanner

Contents

Introducing The Scanner
Using the Scanner
Using Queries with the Scanner

Introducing The Scanner

The scanner is like a dumbed-down version of the parser. It generates a list of nodes, a little like the list of tokens that come from the lexer as input to the parser.

The scanner doesn't do nearly as much as the lexer. For example, the lexer has built in preprocessing, the scanner does not. The lexer matches start and end symbols to glob together long strings of characters into single tokens. The scanner does not. An example of this is /* hello world */. The lexer would turn that into a single token with token type COMMENT. The scanner does not - it sees a COMMENTSTART, WS (whitespace), ID (identifier - random text that might be an identifier), WS, ID, WS, and finally COMMENTEND. Because the scanner doesn't generate "tokens" appropriate for handing off to the parser, we sometimes call the nodes "symbols" instead.

Why?
The parser's purpose is to allow us to work with context and semantics. The scanner's purpose is to allow us to work with text in source files. The scanner allows us to scan any source file, whether it compiles or not. What if the example above had been /* hello world {&endofcomment} - and {&endofcomment} was not defined in this file? Then the entire rest of the source file would have been considered comment.

Obviously, the finer granularity is necessary to allow us to work with any source file, without having to know anything about context. So what advantage does the scanner give us? The scanner gives us a mechanism for looking at source files, and for changing source files. Changing text in memory is tedious when you have to work with one character at a time. The symbol scanner makes things a little easier by allowing us to instead work with one symbol at a time.

For one thing, it does try to recognize keywords. For example, from "def var mychar as char" the scanner recognizes that we have DEFINE, VARIABLE, ID, AS, and CHARACTER symbols. Conversely, /* def var */ would yield COMMENTSTART, DEFINE, VARIABLE, and COMMENTEND symbols, so the lack of context is limiting.

We now have a powerful combination to work with: context and semantics from the parser, plus a granular method of working with text from a source file. With that in mind, see what your imagination can come up with.

Using the Scanner

Use parserParseCreate to create a scan, parserParseDelete to delete a scan, and parserParseGetTop to get the first node in the scan results. Other than those, the API for the scanner results is the same as the API for the parser results.

It is important to remember that a call to parserParse() still clears everything. That includes all handles, all queries, and all parse instances from parserParseCreate().

Scanner results are somewhat different than parser results. The scanner generates a one dimensional list of sibling nodes, so for example parserGetFirstChild() will never return any value.

Some scanner "symbol types" are different than the node types found in a tree generated by the parser. For example COMMENTSTART is a symbol type from the scanner, but you will never find a COMMENTSTART node in the parser's tree.

The first node from a symbol scan is a "Scanner_head" node, the last node is a "Scanner_tail" node. These are the first and last siblings in the chain, and are the only "synthetic nodes" in the scan results.

NOTE: If you are using the Progress 4GL API to write the contents of the scanner out to a file with PUT UNFORMATTED, be sure to use the BINARY option on your OUTPUT TO statement, otherwise Progress tends to double up the end-of-line characters.

Using Queries with the Scanner

A possible way of working with the parser and scanner together is:

  1. Parse a compile unit.
  2. Use a query to find a node of interest in the syntax tree.
  3. Use the parser node's filename in order to create a scan result for that file.
  4. Use the parser node's line number to find the related line and scanner node.
  5. Change the text of that scanner node.
  6. Write the scan results back out to the source file - with your new change.

To find a node by line number in the scan result, you must iteratively step through the nodes in the scan result until you come to a node with the correct line number. From the 4GL, this kind of iterative call to the DLL is very slow. In order to make it possible to quickly find the first node of a given line number in a scan result, a built-in function was added, and it is accessed via the existing queries API. For example:
parserQueryCreate(topNode, "myQuery", "first_where_line=" + STRING(myLine)).
...where topNode is the first node in a scan result, and myLine is an integer line number.

Queries with option first_where_line= are only sensible and only supported for scanner results.