← Python 1.0

Handling UTF-8 Vulnerabilities in Python and Rust

The other day, I came across slides from a talk on hacking unicode, that is, using UTF-8 encoding edge cases to inject unexpected code. This post discusses one of the issues mentioned in the talk, and explores

As mentioned in the talk, the UTF-8 spec seems to allow multiple representations of the same codepoints, which leaves open the possibility of an attack, if protective code looks for a particular byte sequence rather than the represented value. To share the example given in the talk, a period (U+2E) is canonically represented in UTF-8 by the single byte, 0x2E (in binary: 0010 1110, but a loose UTF-8 parser might get the same character from the two-byte sequence 0xC0, 0xAE (binary 1100 0000 1010 1110), or the three-byte sequence 0xE0, 0x80, 0xAE (binary 1110 0000 1000 0000 1010 1110). Similar productions are possible for four, five, or six bytes.

You might notice that I said UTF-8 seems to allow these multiple representations. The spec actually mandates that only the minimal possible version of a given character is valid. The problem comes about when UTF-8 parsing tools apply the production rules without validating that the input is minimal.

I've recently been playing with Rust, which uses UTF-8 exclusively for its string types. As a learning exercise, I decided to see how it would handle these malformed strings, so I wrote up a quick rust program to explore the issue:

fn main() {
    let dot: &[u8] = [ 0x2E ];
    let dot2: &[u8] = [ 0xC0, 0xAE ];
    let dot3: &[u8] = [ 0xE0, 0x80, 0xAE ];
    let dot4: &[u8] = [ 0xF0, 0x80, 0x80, 0xAE ];

    println!("Converting {}: {}", dot, std::str::from_utf8(dot));
    println!("Converting {}: {}", dot2, std::str::from_utf8(dot2));
    println!("Converting {}: {}", dot3, std::str::from_utf8(dot3));
    println!("Converting {}: {}", dot4, std::str::from_utf8(dot4));
}

This program creates vector slices (&[]) of 8-bit unsigned integers, and converts them to string slices (&str) as UTF-8 sequences. The result of this conversion is what rust calls an Option type; it either returns Some(&str) if the conversion succeeds, or None if it fails.

The results:

$ ./utf-hacking
Converting [46]: Some(.)
Converting [192, 174]: None
Converting [224, 128, 174]: None
Converting [240, 128, 128, 174]: None

Non-minimal UTF-8 sequences are treated as invalid input by rust's UTF-8 parsing facilities. Happily, rust guards its users against this vulnerability.

What about good old python?

>>> '\x2e'.decode('utf-8')
u'.'
>>> '\xc0\xae'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jcdyer/.virtualenvs/harkablog/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte
>>> '\xe0\x80\xae'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jcdyer/.virtualenvs/harkablog/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte
>>> '\xf0\x80\x80\xae'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jcdyer/.virtualenvs/harkablog/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid continuation byte
>>>

Python also does the right thing, rejecting the malformed UTF-8 with a UnicodeDecodeError.

It looks like pythonistas and rustaceans can rest easy, knowing that their languages are doing the right thing when it comes to handling non-minimal UTF-8.

This doesn't mean that your code is protected from any sort of attack, just because you're using Rust or Python. There are other issues that require careful handling, including nefarious uses of RTL (right-to-left) direction marks, decomposed characters, and unicode normalization, but at least it's nice to know that your language of choice won't be fighting against you on it.

Do you work with languages other than Rust and Python? Let me know how they stack up in the comments below.

Comments !