Unicode is the solution to work with character sets other than ASCII or, in the
case of Ada, beyond Latin-1, which was the character set chosen for the
Character
type. There are a number of considerations for the use of Unicode
in Ada and in Alire crates, detailed in this section.
Alire v2.0 and onwards expects sources to be UTF-8 (through the -gnatW8
switch). Input/Output must be done using specialized crates (e.g. vss
,
uxstrings
) or with Wide_Wide_Text_IO
(for unencoded/UTF-32 strings) or with
Streams_IO/GNAT.IO
(for UTF-8-encoded strings). Avoid Text_IO
unless you
are sure of what you’re doing (using Latin-1 strings). Some friction may happen
with older sources not using UTF-8.
Read on for the gory details.
By default, any crate initialized via alr init
will have the -gnatW8
switch
in its build configuration, which presumes UTF-8 encoding of sources, and
tweaks some internals of the compiler accordingly.
This means that source files must use UTF-8 encoding when not using plain ASCII.
Since GNAT compiles specifications and generic bodies in the context of the
client project, once internalization is enabled in some parts of a build, it
becomes necessary for all parts of the build to use the same -gnatW8
setting.
Otherwise, a file containing non-ASCII literals could be interpreted
differently depending on the compilation context (as a standalone library or
from a client of such a library).
Most old GNAT projects are likely not to have -gnatW8
enabled (although an
UTF-8 file with BOM marker will have the same effect). For crates that do not
contain string literals outside of ASCII or engage on I/O of such strings, this
should not make any difference. For those for which internationalization
matters, however, there is no sensible way forward but to embrace Unicode, with
UTF-8 being the standard encoding nowadays for files and terminals.
This means that a certain amount of breakage might happen for ‘legacy’
libraries not yet adapted to -gnatW8
. From the Alire project we are trying to
get in front of this problem with early detection of such libraries in the
Alire ecosystem, and universally using -gnatW8
from version 2.0 on.
You can use a library designed to shield you from Unicode details, such as
vss
or uxstrings
.
Otherwise, these recommendations should keep you safe:
-gnatW8
to compile (this is Alire’s default) and save your sources in UTF-8 encoding.Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode
to get UTF-8 strings from Unicode literals.-gnatW8
is in effect.chcp 65001
in your terminal.GNAT.IO
or Ada.Streams.Stream_IO
to output properly encoded strings, as they don’t manipulate the bytes.Ada.Text_IO
to output UTF-8-encoded String
variables, as Latin-1 encoding is expected.You can experiment with the utf8test
crate at
https://github.com/mosteo/utf8test to check how your environment behaves in
regard to UTF-8 output.
Properly working with Unicode in Ada relies on these bits of info:
Character
type is Latin-1 [3], so:Ada.Text_IO
expects Latin-1-encoded strings (superset of ASCII), incompatible with UTF-8.Ada.Wide_Wide_Text_IO
expects UTF-32 strings, that is, regular Wide_Wide_Strings
in practice.It is common to use String
to store byte sequences in different encodings.
This is a source of problems when going out of ASCII, as depending on how these
strings are populated, they may easily end being either Latin-1 or UTF-8 or
something else.
UTF-8-encoded strings are more memory-efficient, but cannot be used to iterate
over characters without help of support libraries. Wide_Wide_Strings
retain
the 1:1 index-to-character ratio, so they are efficient for such iterations.
Wide_Strings
, which are defined to hold 2-byte Unicode code points (although
conceivably they could also be used with UTF-16), are probably a middle-ground
not useful in general anymore.
iconv
).-gnatW8
to your compiler switches.String
literals containing characters outside of the Latin-1 range (see next section).Ada.Text_IO
and replace it with Ada.Wide_Wide_Text_IO
or GNAT.IO
or Ada.Streams.Stream_IO
.
Ada.Characters.Conversions.To_Wide_Wide_String
to convert Latin-1 strings to UTF-32.Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Decode
to convert UTF-8 strings to UTF-32.Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode
to convert UTF-32 strings to UTF-8.A particularity of -gnatW8
is that it may affect compilation units in other
projects, for example when specifications or generic bodies are ‘withed’, as
these are compiled also in the context of the client project. This may cause
issues when not all projects use -gnatW8
, even if projects in isolation work
properly.
Trouble may thus arise from inconsistencies between source file encoding and
-gnatW8
being in effect, or even simply by saving Latin-1 files as UTF-8 in
order to enable -gnatW8
:
Character
instead of
Wide_Wide_Character
.
X : String := "€";
-gnatW8
, resulting in a three-byte string.-gnatW8
, as ‘€’ is out of Character
range.Wide_Wide_String
/Wide_Wide_Character
, or the string manually encoded as UTF-8 (which will not be apt for use with Ada.Text_IO
):
subtype UTF8_String is String;
X : UTF8_String := Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode ("€");
See a few more examples:
O_Acute : constant Character := 'ó';
-- With Latin-1 source and -gnatW8, it will fail as 'ó' won't be a proper UTF-8 sequence in the input file.
-- With UTF-8 source and -gnatW8, it will work **but** it will be stored in Latin-1 encoding in memory.
-- It can be converted to a Wide_Wide_String (that will be in UTF-32)
-- with `Ada.Characters.Conversions.To_Wide_Wide_String`. In this case,
-- using `Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Decode` would be wrong,
-- as the String is Latin-1 and not UTF-8 despite the source file encoding being UTF-8.
Bye : constant String := "Adiós";
-- Without -gnatW8 but UTF-8 sources without a BOM marker, this will end being an UTF-8 byte sequence,
-- as it is interpreted literally as Latin-1 characters.
-- With -gnatW8, the UTF-8 will be properly loaded and converted to Latin-1 for the in-memory representation.
Bye : constant String (1 .. 5) := "Adiós";
-- This is proper for a Latin-1-encoded file and compiling without -gnatW8.
-- With -gnatW8 and an UTF-8 file, it will work properly and also end as a 5-byte Latin-1 string.
-- With -gnatW8 but with Latin-1-encoded file, it will fail as the sequence won't be proper UTF-8.
-- Without -gnatW8 but with an UTF-8 file it will fail as the sequence will be 6 bytes long.
Euro : constant String := "€";
-- Without -gnatW8 and a UTF-8-encoded file, this results in a 3-byte UTF-8 string.
-- With -gnatW8 and a UTF-8-encoded file, this results in an error as '€' is outside of Character.
[1] http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-2-1.html [2] https://docs.adacore.com/live/wave/gnat_ugn/html/gnat_ugn/gnat_ugn/building_executable_programs_with_gnat.html#character-set-control [3] http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-3-5-2.html