iso8859_15. Notice the
U+FFFD Unicode character (�) that appears in the file.
Since we are using VSCode default decoder (UTF-8) unrecognizable sequences of bytes are substituted with this symbol.
In this article, we will take a look at the details behind Python strings, what are they and how they are represented in memory. The whole content of this article can be summarized in a sentence: a Python string is a sequence of characters which in turn are a sequences of bytes, and depending on which encoding was used to transform a particular character into bytes, that sequence of bytes may represent one character or another.
Code units, code points and encodings
If you were around when Python 2 was all the rage, you may remember that the str type was a little weird.
As you can see in the code snippet above (and test with this online compiler), in Python 3 you would be able
to define s2 but you wouldn't in Python 2. In particular, Python 2 (to be precise: 2.7.18) will output this error:
So, what does this error mean? To understand this problem we have to take a closer look at the problem of representing characters in memory.
A character to us (humans) is a symbol that carries meaning contingent on the context is presented.
For example, the dollar sign $ may be many different things: if it's preceded by a number, it may represent
the price of something in dollars, if it's followed by a series of letters it may be interpreted, by some, as a ticker symbol and so on. These are different meanings
we can assign to the dollar sign, but they are not the definition of the character $ which we could define,
for example, as a letter S superimposed with a bar |. This is what we refer to, when
talking about a character in this article
(characters in this sense are also referred to as grapheme).
Now, since computers only understand bits, in order to define a character we have to agree on a mapping from
a sequence of bits to a character. One of such agreements is the unicode standard, there are others but this is the most
widely used.
In Unicode, each character is assigned a code point, which is just a number. For example the character ò is assigned code point 242, which is then the thing that is represented in memory. Ok,
but how do we represent this number in memory?
The standard by which you represent a code point in memory is also called its encoding. Unicode defines three encodings: UTF-8, UTF-16 and UTF-32. Each of these encodings, has a different code unit, that is the minimal number of bits that are meaningful to that particular encoding. In UTF-8, the code unit is 8 and you can stack up to 4 groups of 8 bits to represent a particular code point.
To understand better this last point let's go back to ò, you want to represent its code point 242, which is less than 2^8 = 256, perfect I will need only one byte to represent it
in UTF-8, right? No. UTF-8 actually uses two bytes to represent ò, and uses one byte for characters that have code points up to
128. One of the reasons for this is that UTF-8 was designed to maintain retro-compatibility with ASCII which has a code unit of 7 bits
(2^7 = 128). Hence, this is the byte representation of the string andrò:
So, ò is represented by UTF-8 using two bytes: 11000011 10110010.
The first byte should be read like this 110xxxxx, where the first three bits 110 encodes
the information that the following is a 2 byte sequence, and the second byte should be read like this 10xxxxxx where 10 encodes the information that that byte is a continuation byte
(if you want to know how longer byte sequences are encoded in UTF-8 look here). Finally, if you want to recover
the code point from this byte sequence just take the other bits and put them together:
11000011 10110010
00011 110010 = 11110010 binary for 242
Going back to the question we have started with. In the code snippet at top of the article, there are two things that are happening.
First, in Python 2 str is an alias of bytes. So when you write ò, your text editor is
encoding it using its encoder, in case of VSCode you are probably using UTF-8, which translates into these two bytes: C3 B2.
When the interpreter creates the variable s2 takes C3 checks for the character matching its code point
(195), since you have not specified any codec it defaults to ASCII, but since the highest code point in ASCII is
128, returns an error.
Endianness
Let's try now to encode the example we started with using another encoding:
The bytes sequence representing the string is: FF FE a 00 n 00 d 00 r 00 F2 00 (note in Python bytes between 32 and 126 are represented using ASCII characters), which in decimal is: 255 254 97 0 110 0 114 0 242 0.
Now if we ignore the first two bytes, everything makes sense, since UTF-16 has a code unit of 16 bits, all the code points below
2^16 are represented using a single code unit, made of 16 bits (2 bytes). So what are those first two bytes?
FF FE represents the unicode character U+FEFF (a special invisible character),
which in UTF-16 is used to indicate the endianness
of a machine, that is the order by which you should read the bytes in a multibyte representation. For example, in a little-endian machine,
the order of the most significant byte (the one with higher value) is assigned to the smallest address number. Assuming that the leftmost positions are
the ones with higher address number, the sequence of bytes would be read as FEFF (65,279 in decimal), a special
invisible character.
These two bytes are called the Byte Order Mark (BOM) and inform the decoder by which byte order it should read all the other bytes
afterwards.
Note that if you read these two bytes, in big-endian order that is FFFE then the machine will read the U+FFFE Unicode character,
which is the Non Character, which will indicate to the codec that something went wrong.