• Articles about computing
• Articles about software development
UTF-8 and the problem of over-long characters
What is the problem?
Developers with an interest in IT security have been aware of
concerns about "over-long"
UTF-8 character encoding for at least ten years,
but the problem seems to have resurfaced recently. This article
describes what an over-long encoding is, and discusses whether developers
should be concerned it. However, to understand the
problem properly, we must first know something about UTF-8 encoding,
and how it works.
How UTF-8 works
The Unicode standard assigns a numerical value (called a
code point) to each character in a wide range of alphabets.
The first 127 code points correspond loosely to the traditional
ASCII character set, so the symbol "A" is number 65 in both ASCII
and Unicode. There are some technical complications regarding
the meanings of code points with values less than 32 (the
ASCII "control characters") but they need not concern us here.
The majority of written
symbols in common use have code points that will fit into a
16-bit number (1-65535) but,
if we want to represent less common scripts such as Takri or Cuneiform,
larger numbers are necessary. The largest code point in current use
is about 125 000; the Unicode specification limit is 1 112 064.
Even a number this large will certainly fit into a 32-bit number,
but representing every symbol in a text
as a 32-bit number is likely to be hopelessly inefficient where there are large
amounts of text.
Encoding is the process of writing the numeric value of a Unicode code
point in some agreed format. UTF-32 encoding is the nearest thing we have
to simply writing out the code point in binary. UTF-32 is easy to
read, write, and understand; but since the majority of all characters
in common use will fit into 16 bits, 50% or so of
the storage or transmission capacity is wasted, with most alphabets.
UTF-16 is an encoding
of at least 16 bits but, again, for Western alphabets, most of which
are based on latin characters, it's still rather inefficient.
UTF-8 encoding uses a simple 8-bit representation for the symbols that
are the same as traditional ASCII (code points lower than 127 —
most characters in Western
alphabets), and a variable-length encoding for all other code points.
Broadly speaking, the length of a UTF-8 character encoding increases
with the code point number; in practice few characters need
more than four bytes, and most need fewer than two. It is thus a
very storage-efficient encoding, but one that is fiddly to encode and
In a UTF-8-encoded character, if the first (most significant)
bit of the first (most significant)
byte is a zero, then this will be a one-byte sequence, interpreted
broadly the same as ASCII. Otherwise the first few bits of the
first byte indicate the length of the sequence (that's almost
true — see below). For example, if the first byte begins
110, then this is a two byte sequence; 1110 is a three-byte sequence,
and so on. This first byte is called the leading byte.
All other bytes (continuation bytes) start with binary 10.
It follows that the amount of bits available to encode the character
will vary according to the leading byte, and will never (except
in the simple ASCII case) amount to more than six bits in each byte. It is this
irregular distribution of data bits between framing bits that makes
UTF-8 so fiddly to encode and decode.
An example — and why the problem exists
Consider the symbol "A", code point 65, or 0x41 in hexadecimal. Because
this is a number less than 128, it can be represented as a one-byte
sequence, just as in ASCII. This is the natural, and preferred
Now consider an alternative way of encoding this symbol in UTF-8. It
turns out that it can be encoded as the sequence 0xC1, 0x81. Why?
In binary, this sequence is:
11 00 00 01 10 00 00 01
(I'm implicitly assuming a big-endian way of writing binary, here. Apologies
in advance to those who like to crack their boiled eggs at the other end.)
The first binary byte starts with 110, so this is a two-byte sequence.
The second byte begins with 10, marking this as a continuation byte.
These bits are removed when the sequence is decoded, leaving
0 00 01 00 00 01
These bits are interpreted as four-bit groups, starting from the
right (least significant bit) end of the bit pattern. Rewriting with this
grouping gives us:
0 00 01 00 00 01
So we have 0100 (4) and 0001 (1), giving 0x41, or code point 65.
So we have the symbol "A" legitimately encoded using two bytes, rather
It gets worse — the same symbol can be written as three bytes:
0xE0,0x81,0x81. Again in binary we have:
(1110) 0000 (10) [00 00] [01 (10) 00] 
I've put round brackets () around the framing bits, which are ignored,
and grouped the actual data using square brackets. The number encoded
here is 0000 0100 0001, or 0x041 or, again, 65. There isn't any difference
between 0x041 and 0x41, or even 0x0000000041 for that matter — they are
The problem of over-long encoding
Problems arise when an application processes UTF-8 data from some untrusted
source, and the developer assumes that the encoding will always be in the
shortest form. Consider a situation where the data specifies a filename,
and the application wants to ensure that this filename is constrained to be
within some particular directory and its subdirectories.
This problem commonly arises in web
browsers, where we want to ensure that the user cannot
specify a URL that maps to some arbitrary filename.
In a case like this, we will want to remove any characters that the
operating system will interpret as "higher-level directory." If we allow a
UTF-8 string like this as input:
there could well be trouble. We need at least to remove
the leading "/". This is easy enough if we can be sure that the
characters will always be in shortest form — it's just a matter
of looking for the byte 0x2F. You might wonder whether this byte
could legitimately appear in a UTF-8 multi-byte sequence; the answer
is 'no' because only one-byte encodings have a zero in their
most significant bit. So this is a safe thing to do.
Unfortunately, "/" can also be encoded as the over-long sequence
0xC0,0xAF. This is harder to spot in a string of bytes
than 0x2F, particularly as this sequence might conceivably
occur legitimately in some input. So the UTF-8 input might remain
un-sanitized, and the directory separator "/" eventually passed to
the operating system in a filename.
Worse, there are many other potential over-long encodings of the
"/" character, and of other characters for which might need to check.
The uncomfortable reality is that sanitizing UTF-8 is much more difficult
than it might first appear, if we don't somehow reject over-long
The scope of the problem
So how likely is it, that over-long characters might be used to
create a security
The problem I described above arose because we treated UTF-8 data as if it were ASCII. We are likely to do this — if we do — because, in many cases,
UTF-8 can be manipulated like ASCII. Because the null (0) byte cannot occur in a UTF-8
string except when encoding the null character, many of the same C functions
that work on ASCII work equally well on UTF-8. These functions all rely
on a text
string being represented as a sequence of bytes terminated with a null.
We can copy UTF-8 strings in memory using
strcmp(), compare them (badly) using
strcmp(), etc. We can even use
strcmp() to give the length of the
string, in terms of the amount of storage it occupies. It won't tell us
the number of characters in a UTF-8 string because some characters will
take more than one byte — but it's a common (incorrect) programming
practice that actually works a lot of the time. This kind of sloppy
programming — which is all it is — doesn't arise when handling
encodings like UTF-16 because in practice the data is full of zeros. These
zeros are treated as null terminators by the C library functions that
handle strings, and so the program just fails completely at an early
stage. With UTF-8, however, we can continue to treat UTF-8 like ASCII
a lot of the time without noticing a problem.
Manipulating UTF-8 as if it were ASCII is a very common (bad) thing to
do in C/C++ programming, because C does not have any standard, low-level
representation of a Unicode string. Various C++ libraries exist that
do provide such representations, but they are only helpful if developers
actually use them, and are not tempted to fall back on the familiar
built-in routines, that don't know the difference between UTF-8 and
In code that we write ourselves, it's easy to be careful about how we
process UTF-8; it's less obvious what to do when using somebody else's
code, or a library. Suppose you are given some code that implements a
function that removes certain character sequences from the
beginning of a string:
char *removeAllAtStart (char *inputString, char *substring);
You want to call it like this, to remove any "/" at the start
of your filename:
char *sanitizedFilename = removeAllAtStart (filename, "/");
Even if the function is fully UTF-8 aware
(and how can you be sure of this?) it's still going to fail here.
Why? Because you have specified the seaarch string as
which the compiler will treat as null-terminated ASCII, not unicode.
removeAtStart would have to be very clever
indeed to search for all the possible over-long representations, not
only in the input to be modified, but also of the substring the
In short, in any situation where UTF-8 strings are processed using
functions designed for ASCII, there is a possibility that over-long
characters can create a security weakness. The only solution is to
remove all over-long characters from the input, using a processor
that is fully aware of the subtleties of UTF-8 encoding.
Problems like the above are far less likely to arise in Java, because
Unicode strings and characters are first-class language elements. In Java,
when I write
String remove = "/";
both the literal "/" and the
String object to which it is
assigned are unicode. Consequently, even naive use of Java is unlikely to
expose a weakness.
It's not impossible, however. Suppose we are using an XML parser, that exposes the low-level byte stream for pre-processing. We want to filter out and replace certain characters or substrings ("/", "..", etc) as part of this
pre-processing. Again, if the input is UTF-8, and if we assume wrongly that
each byte can be checked independently, the we have the same problem
that the C/C++ programmer has.
The solution is simple, at least in principle — never, ever work on the
byte representations of text strings in Java unless you are really, 100%
sure you know what you are doing. Once the text has been converted to Java's
internal representation, it's perfectly safe to use Java's built in APIs to
As we have seen, over-long UTF-8 encoding can be a problem in any
development scenario where we need to sanitize text input, and
we use methods that are appropriate for ASCII strings, rather than
unicode. Any language is potentially at risk, but languages that
treat text strings as byte arrays are most vulnerable. This includes
PhP and Perl — although these languages do have support for Unicode,
there is an implicit assumption that strings are one-byte-per-character
unless special care is taken. Python, like Java, mostly uses Unicode
Even in a programming language that has native Unicode support, we still
have to be careful about the use of external libraries, particular those
written in C.
Some time back, over-long UTF-8 encodings were a real problem for
web browsers and web applications. The problem was solved simply by
removing them and replacing them with an error marker. This caused
further processing to fail, but in a relatively safe way. There is
not, and probably never has been, a legitimate use of
over-long encodings, so stripping them is not usually a hostile thing
Recently, over-long encodings have become a cause for concern again,
perhaps because enough time has elapsed for developers to forget the
risk they pose. The risk is smallest in languages like Java, that make
it quite difficult to manipulate the individual bytes in a text
string, but all langauges are potentially vulnerable if care is not taken.
The only definitive solution, when working with UTF-8 input, is to convert
it to a reliable unicode representation at the earliest possible
stage, and never work on the underlying bytes.