1
0
mirror of https://github.com/jbranchaud/til synced 2026-01-03 15:18:01 +00:00

Add Decompose Unicode Character With Diacritic Mark as a Ruby TIL

This commit is contained in:
jbranchaud
2025-07-21 17:38:41 -05:00
parent ae2974e3b8
commit cb94142042
2 changed files with 57 additions and 1 deletions

View File

@@ -10,7 +10,7 @@ pairing with smart people at Hashrocket.
For a steady stream of TILs, [sign up for my newsletter](https://crafty-builder-6996.ck.page/e169c61186). For a steady stream of TILs, [sign up for my newsletter](https://crafty-builder-6996.ck.page/e169c61186).
_1652 TILs and counting..._ _1653 TILs and counting..._
See some of the other learning resources I work on: See some of the other learning resources I work on:
- [Get Started with Vimium](https://egghead.io/courses/get-started-with-vimium~3t5f7) - [Get Started with Vimium](https://egghead.io/courses/get-started-with-vimium~3t5f7)
@@ -1314,6 +1314,7 @@ If you've learned something here, support my efforts writing daily TILs by
- [Create Listing Of All Middleman Pages](ruby/create-listing-of-all-middleman-pages.md) - [Create Listing Of All Middleman Pages](ruby/create-listing-of-all-middleman-pages.md)
- [Create Named Structs With Struct.new](ruby/create-named-structs-with-struct-new.md) - [Create Named Structs With Struct.new](ruby/create-named-structs-with-struct-new.md)
- [Create Thumbnail Image For A PDF](ruby/create-thumbnail-image-for-a-pdf.md) - [Create Thumbnail Image For A PDF](ruby/create-thumbnail-image-for-a-pdf.md)
- [Decompose Unicode Character With Diacritic Mark](ruby/decompose-unicode-character-with-diacritic-mark.md)
- [Defaulting To Frozen String Literals](ruby/defaulting-to-frozen-string-literals.md) - [Defaulting To Frozen String Literals](ruby/defaulting-to-frozen-string-literals.md)
- [Define A Custom RSpec Matcher](ruby/define-a-custom-rspec-matcher.md) - [Define A Custom RSpec Matcher](ruby/define-a-custom-rspec-matcher.md)
- [Define A Method On A Struct](ruby/define-a-method-on-a-struct.md) - [Define A Method On A Struct](ruby/define-a-method-on-a-struct.md)

View File

@@ -0,0 +1,55 @@
# Decompose Unicode Character With Diacritic Mark
A character like the `ñ` is typically represented by the unicode codepoint of
`U+00F1`. However, it is also possible to represent it with two unicode
codepoints -- the `n` (`U+006E`) and the combining diacritical mark `˜`
(`U+0303`).
We can see that by comparing a typed `ñ` with one where we split it apart into
the separate codepoints. We can do that with
[`#unicode_normalize`](https://apidock.com/ruby/v2_5_5/String/unicode_normalize)
and the `:nfd` argument which stands for _Normalized Form Decomposed_.
```ruby
> "ñ" == "ñ".unicode_normalize(:nfd)
=> false
> "ñ".unicode_normalize(:nfd).length
=> 2
> "ñ".length
=> 1
```
We can inspect the exact codepoints by iterating over each character and
printing out the codepoint value.
```ruby
"ñ".each_char.with_index do |char, i|
puts "#{i}: '#{char}' -> U+#{char.ord.to_s(16).upcase.rjust(4, '0')}"
end
# 0: 'ñ' -> U+00F1
# => "ñ"
"ñ".unicode_normalize(:nfd).each_char.with_index do |char, i|
puts "#{i}: '#{char}' -> U+#{char.ord.to_s(16).upcase.rjust(4, '0')}"
end
# 0: 'n' -> U+006E
# 1: '̃' -> U+0303
#=> "ñ"
```
Notice the difference after the character has been decomposed such that the
diacritic is separated from the character.
This can be done with other characters containing diacritics.
And here we go the other direction with
[`#pack`](https://ruby-doc.org/core-3.0.1/Array.html#method-i-pack).
```ruby
> [0x006E, 0x0303].pack("U*")
=> "ñ"
> [0x00F1].pack("U*")
=> "ñ"
> [0x006E, 0x0303].pack("U*") == [0x00F1].pack("U*")
=> false
```