Main Page | Report this Page
Computers Forum Index  »  Computer Compilers  »  Adding UTF8 IDENTIFIERS to Flex...
Page 1 of 1    

Adding UTF8 IDENTIFIERS to Flex...

Author Message
SeeScreen...
Posted: Sat Oct 24, 2009 8:48 pm
Guest
The solution is based on the GREEN portions of the first chart shown
on this link:
http://www.w3.org/2005/03/23-lex-U

UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]

D [0-9]
ASCII [\x0-\xFF]

U1 [a-zA-Z_]
U2 [\xC2-\xDF][\x80-\xBF]
U3 [\xE0][\xA0-\xBF][\x80-\xBF]
U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
U5 [\xED][\x80-\x9F][\x80-\xBF]
U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]

L {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
U [\x0-\xFF]|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

%{

#include <stdio.h>
#include <string.h>
#include <math.h>
#include "y.tab.h"
#define YY_NO_UNISTD_H
int lineNumber = 0;

%}

%%

{UTF8_BYTE_ORDER_MARK} { /* Byte Order Mark */ }

"int" { return (INT); }

{L}({L}|{D})* { return (IDENTIFIER); }
";" { return(';'); }
")" { return(')'); };
"(" { return('('); };
"=" { return('='); }
[ \t\v\f] { }
[\r\n]|[\r]|[\n] { lineNumber++; }
.. { /* ignore bad characters */ }

%%




int yywrap()
{
return(1);
}

***********************************************
The U pattern above correctly recognizes the entire UTF8 set of
patterns. The L pattern recognizes the same set of patterns, except
that it excludes characters that can not be used for {C, C++, or Java}
IDENTIFIERS.

This solution also correctly ignores the UTF8 Byte Order mark, (if
embedded at the beginning of the text file) as long as the source text
file begins with at least one blank character or one blank line.

The above works with very old versions of Flex, as long as the -L (lex
Compatability flag) is not specified. The -8 (generate eight bit
scanner flag) was also specified, even though it may be the default.
 
Hans Aberg...
Posted: Sun Oct 25, 2009 2:06 pm
Guest
SeeScreen wrote:
Quote:
The solution is based on the GREEN portions of the first chart shown
on this link:
http://www.w3.org/2005/03/23-lex-U

I hacked together this, which converts Unicode character ranges to Flex
like expressions:
http://lists.gnu.org/archive/html/help-flex/2005-01/msg00043.html

(For single characters <char>, one can just feed a UTF-8 .l file to Flex
in 8-bit mode, with "<char>" expressions.)

Hans
 
 
Page 1 of 1    
All times are GMT
The time now is Thu Dec 10, 2009 12:47 am