What is len("😎")?
TLDR
# python
len("😎") == 1// golang
len("😎") == 4Why Can’t We All Just Take One Byte?
There was a time when this was the case… a simpler time. A valid ASCII string has this quality with 1 char being equivalent to 1 byte. When everything was in English the entire digital universe could be defined within 7 bits - 128 characters. Then came other languages with different alphabets and even emojis 😅. We need more bytes!!! So long story short our universe got way bigger and therefore some characters take 2, 3 or even 4 bytes. Read this classic post, if you’re interested in greater elucidation on this point.
Motivation
Recently I got caught out using len() on a string literal in Go because of the Pythonic
assumptions I had. len() exists in both languages but at least with strings the underlying
representations are quite different.
Python String Representation
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
struct {
unsigned int interned:2;
unsigned int kind:2;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
} state;
wchar_t *wstr;
} PyASCIIObject;
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length;
} PyCompactUnicodeObject;
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data;
} PyUnicodeObject;Unicode structs defined in CPython as per PEP 3931
A string in Python as defined in the CPython interpreter is implemented with what’s called flexible
string representation. It’s well worth checking the full PEP write-up but for our purposes each
str has an internal kind flag that decides the storage strategy based on the widest character
it contains. It’s a bit more complicated than that but essentially it corresponds to 4 different
encodings which are uninitialised, 1 byte (Latin-1), 2 bytes (UCS-2) and 4 bytes (USC-4). So a
Python string is an object that holds a sequence of Unicode code points and metadata like encoding
type and length (hello len()).
Go String Representation
type stringStruct struct {
str unsafe.Pointer
len int
}Internal runtime implementation of a string in Go2
As you can see it’s much simpler, just pointer + length. The go spec just says it’s an immutable
slice of bytes. Interpretation is up to the user as string type can hold arbitrary bytes. I
suppose there’s not much to say here except we can see exactly how len differs between the
languages.
A Difference in Philosophy
I believe this comes down to the assumption made of the users of both languages. We could extend this more specifically to the notion of higher vs lower level languages. By default, Python hides the complexity of Unicode world whereas Go assumes you’re closer to the machine in some sense.
This is not to say that you’re stuck with only one way; support exists for both visions. In Python,
you can explicitly deal with bytes through the bytes type and transition between the two with
encode/decode. I should say Python 3 because Python 2 had the opposite situation with the
str type defined as a sequence of bytes and the unicode type referring to a true Unicode string.
In Go, if we want to think about strings as a sequence of Code Points then we have to look at the
following type rune = int32. A rune is just a shorter term for Code Points as apparently the
latter term was “a bit of a mouthful”. Anyway if we want a string as a sequence of Code Points
then we would define the following.
coolGlasses := []rune{'😎'}So len() in this case is straightforward since it’s just the length of the slice and what is a
string but a slice? In my personal opinion it looks inelegant because if you want a word it looks
like
hello := []rune{'h', 'e', 'l', 'l', 'o'}Basically you should still always use the string type and if you’re unsure that you’re not using
just ASCII then you have to use the std library utf8 package or convert to and fro rune slices.
Conclusion
Defaults matter.
By default, we have the following for len():
- Python: counts code points
- Go: counts bytes
- Swift: counts grapheme clusters3
Wait… what the hell are grapheme clusters?