Character Encoding

Content characters

The content of a Stan program must be coded in ASCII. All identifiers must consist of only ASCII alpha-numeric characters and the underscore character. All arithmetic operators and punctuation must be coded in ASCII.

Compatibility with Latin-1 and UTF-8

The UTF-8 encoding of Unicode and the Latin-1 (ISO-8859-1) encoding share the first 128 code points with ASCII and thus cannot be distinguished from ASCII. That means you can set editors, etc., to use UTF-8 or Latin-1 (or the other Latin-n variants) without worrying that the content of a Stan program will be destroyed.

Comment characters

Any bytes on a line after a line-comment sequence (// or #) are ignored up until the ASCII newline character (\n). They may thus be written in any character encoding which is convenient.

Any content after a block comment open sequence in ASCII (/*) up to the closing block comment (*/) is ignored, and thus may also be written in whatever character set is convenient.

String literals

The raw byte sequence within a string literal is escaped according to the C++ standard. In particular, this means that UTF-8 encoded strings are supported, however they are not tested for invalid byte sequences. A print, reject, or fatal_error statement should properly display Unicode characters if your terminal supports the encoding used in the input. In other words, Stan simply preserves any string of bytes between two double quotes (") when passing to C++. On compliant terminals, this allows the use of glyphs and other characters from encodings such as UTF-8 that fall outside the ASCII-compatible range.

ASCII is the recommended encoding for maximum portability, because it encodes the ASCII characters (Unicode code points 0–127) using the same sequence of bytes as the UTF-8 encoding of Unicode and common ISO-8859 extensions of Latin.