The Tale of the Whitespace


⚠️

Work in progress

Actively working on this post. If you want to solve your unicode whitespace woes read on!

Can you tell the difference between ” ” and ” “? I certainly can’t. But the computer can.

" ".ord
# => 32
" ".ord
# => 160

This resulted in one of the more confusing bugs that I’ve seen as a developer. And it gets into the topic of unicode whitespaces. The particular two codepoints represented above are 32 for an ASCII space and 160 for a non-breaking space. Let’s say that I wanted to transliterate a string. Transliteration means to swap characters with something that looks similar. It does not take into account sound. E.g. transliterating é will give e.

I18n.transliterate("hëllo wōrld")
# => "hello world"

I18n.transliterate("hëllo wōrld")
# => "hello?world"

You might want to transliterate to remove any characters you can’t handle (or don’t want to handle). Well the transliteration, in the example above, picks up the non breaking space and treats it like an unknown character. Something that it can’t transliterate. We can solve this through regex. But we can’t use the regex meta-characters because they don’t encompass non-ASCII characters (https://ruby-doc.org/core-3.0.2/Regexp.html)

/s/.match("hello world")
# => nil

/s/.match("hello world")
# => #<MatchData " ">

But we can use the POSIX bracket expressions

# Hard to tell but the first line is a non breaking space!
/[[:space:]]/.match("hello world")
# => #<MatchData " ">

/[[:space:]]/.match("hello world")
# => #<MatchData " ">

We can then use the bracket expression to substitute the non breaking space with an ASCII space.

I18n.transliterate("hëllo wōrld")
# => "hello?world"

I18n.transliterate("hëllo wōrld".gsub(/[[:space:]]/, " "))
# => "hello world"