What is len("👨👩👧👦")?
As my previous post on this matter shows, it isn’t easy to answer this question in a simple manner - there are layers of complexity here. Before going any further I would be somewhat negligent in not mentioning that I sourced a lot of details from this excellent write-up by Mitchell Hashimoto.
So Grapheme Clusters… what are they? Well, this is a grapheme cluster 👨👩👧👦 it’s made up of a series of code points that resolve to a single grapheme. A grapheme itself is defined as a user perceived character so it’s actually what we thought code points were in the previous post. We can see this in the following comparison
// swift
"👨👩👧👦".count == 1# python
len("👨👩👧👦") == 7Since the Swift programming language defines the value of a Character type to be a grapheme
cluster it actually matches what we perceive. In Python though len() counts code points so we
apparently have 7 code points. If we render it as a stream of code points instead we get the
following 👨U+200D👩U+200D👧U+200D👦. We have found our 7 code points but what is U+200D? These
are called Zero-width joiners (ZWJ) and they tell
the renderer to show the code points in their connected form.
Perceiving Graphemes Through Helix
I came across an issue that was quite topical and it’s how graphemes are rendered in text editors - I use helix btw. This issue grew from an annoyance with using helix and perceiving the following line as so in the editor:
What is `len("👨👩👧👦 ")`?So the grapheme cluster itself is rendered correctly however the column width isn’t - it’s very distracting. Quite quickly I found an umbrella issue for Correct Grapheme Width and it’s been open since 2023 🙃.
Relevant File Search >> FAST FORWARD - The following is the offending function in question:
// helix/helix-core/src/graphemes.rs
pub fn grapheme_width(g: &str) -> usize {
...
} else {
// We use max(1) here because all grapeheme clusters--even illformed
// ones--should have at least some width so they can be edited
// properly.
// TODO properly handle unicode width for all codepoints
// example of where unicode width is currently wrong: 🤦🏼♂️ (taken from https://hsivonen.fi/string-length/)
UnicodeWidthStr::width(g).max(1)
}
}The unicode_width crate which gives the UnicodeWidthStr width function doesn’t handle column
width correctly as per the comments. Fortuitously, the solution is given within the issue itself
assume libc
wcwidthunless the terminal responds to mode 2027 and then use Unicode standard character width.
He is EVERYWHERE. Also, what is mode 2027?
Enter Mode 2027
It’s a part of the broader Unicode Core Spec. The specific specification details specifically how adhering terminals should standardise specific Unicode features - this is specificity. Terminal program authors though are more interested in the mode switching and detection aspects of the spec. This an emerging standard so only a couple of terminals comply with it at the moment. There are some reservations about using the standard at all, I know for instance that Kitty isn’t planning to jump on board. I am for it like I’m for browsers having web standards because such standards allow for a less broken and more consistent experience for terminal users and developers alike.
Implementation
As to the actual grapheme width calculation I use the recommended fish library and account for variation selectors separately.
I was able to get the correct grapheme width calculation with the following:
// 64KB table, created once on first call.
// O(1) lookups for codepoints <= 0xFFFF, falls back to binary search for others.
static WC_TABLE: LazyLock<WcLookupTable> = LazyLock::new(|| WcLookupTable::new());
...
pub fn grapheme_width(g: &str) -> usize {
...
} else {
let mut width = 1usize; // minimum 1 so it's always editable
for c in g.chars() {
// VS15 - text presentation - force width 1
if c == '\u{FE0E}' {
width = 1;
break;
}
// VS16 - emoji presentation - force width 2
if c == '\u{FE0F}' {
width = 2;
break;
}
let w = WC_TABLE.classify(c).width_unicode_9_or_later() as usize;
width = width.max(w);
}
width
}
}Given our promise to adhere to mode 2027, the following code is a
DECRQM query to determine whether the underlying
terminal supports mode 2027 which basically outputs a string like CSI ? 2027 $ p.
// helix/helix-tui/src/backend/termina.rs
Csi::Mode(csi::Mode::QueryDecPrivateMode(csi::DecPrivateMode::Code(
csi::DecPrivateModeCode::GraphemeClustering
))),The following code is the response from the terminal as a
DECRPM response - again the terminal would output a
string like CSI ? 2027 ; 1 $ y.
// helix/helix-tui/src/backend/termina.rs
Event::Csi(Csi::Mode(csi::Mode::ReportDecPrivateMode {
mode: csi::DecPrivateMode::Code(csi::DecPrivateModeCode::GraphemeClustering),
setting: csi::DecModeSetting::Set | csi::DecModeSetting::Reset,
})) => {
capabilities.grapheme_clustering = true;
}Can’t Be Rusted
It was quite the relief to get this all setup because I couldn’t wait to see helix render the column width correctly. Slightly strange to get excited for something relatively small but it is what it is. I compile my version of the editor and…
What is `len("👨👩👧👦 ")`?NOTHING CHANGED!
Does anything work the first time round? I began retracing some of my steps so I had a look at the
DECRQM query and it seemed to be outputting correctly. The villain had to be the reading of the
response from the terminal because I also found that the terminal was responding correctly. A
clearer hypothesis was forming - a bug with the Event Enum provided for by the
termina crate. Given that the response would be something
like CSI ? 2027 ; 1 $ y I was immediately drawn to src/parse.rs.
// termina/src/parse.rs
fn parse_csi(buffer: &[u8]) -> Result<Option<Event>> {
assert!(buffer.starts_with(b"\x1B["));
if buffer.len() == 2 {
return Ok(None);
}
let maybe_event = match buffer[2] {
b'[' => match buffer.get(3) {
None => None,
Some(b @ b'A'..=b'E') => Some(Event::Key(KeyCode::Function(1 + b - b'A').into())),
Some(_) => bail!(),
},
...
b'?' => match buffer[buffer.len() - 1] {
b'u' => return parse_csi_keyboard_enhancement_flags(buffer),
b'c' => return parse_csi_primary_device_attributes(buffer),
b'n' => return parse_csi_theme_mode(buffer),
b'y' => return parse_csi_synchronized_output_mode(buffer),
_ => None,
},
}
...
}Do you see the issue here? Of course, I provided a convenient visual aid. I assumed the termina crate could parse mode 2027 DECRPM responses but as you can see it can only parse a single mode - synchronised output mode. Indeed, this was a blocker and so I dealt with it first which you can have a look here. I half expected the PR to sit there for a couple of weeks but Michael Davis approved, merged and released a patch within a couple of days… No more blockers 🚀🚀🚀.
What Have We Learned?
I have a newfound appreciation for all the developer hours dedicated to graphemes. I emerge out of the rabbit hole with some dedicated developer hours myself which should marginally improve my experience of using helix.