From cb941420422df384ace4e88a2dfc6346879efb19 Mon Sep 17 00:00:00 2001 From: jbranchaud Date: Mon, 21 Jul 2025 17:38:41 -0500 Subject: [PATCH] Add Decompose Unicode Character With Diacritic Mark as a Ruby TIL --- README.md | 3 +- ...e-unicode-character-with-diacritic-mark.md | 55 +++++++++++++++++++ 2 files changed, 57 insertions(+), 1 deletion(-) create mode 100644 ruby/decompose-unicode-character-with-diacritic-mark.md diff --git a/README.md b/README.md index 65e00fb..673c909 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ pairing with smart people at Hashrocket. For a steady stream of TILs, [sign up for my newsletter](https://crafty-builder-6996.ck.page/e169c61186). -_1652 TILs and counting..._ +_1653 TILs and counting..._ See some of the other learning resources I work on: - [Get Started with Vimium](https://egghead.io/courses/get-started-with-vimium~3t5f7) @@ -1314,6 +1314,7 @@ If you've learned something here, support my efforts writing daily TILs by - [Create Listing Of All Middleman Pages](ruby/create-listing-of-all-middleman-pages.md) - [Create Named Structs With Struct.new](ruby/create-named-structs-with-struct-new.md) - [Create Thumbnail Image For A PDF](ruby/create-thumbnail-image-for-a-pdf.md) +- [Decompose Unicode Character With Diacritic Mark](ruby/decompose-unicode-character-with-diacritic-mark.md) - [Defaulting To Frozen String Literals](ruby/defaulting-to-frozen-string-literals.md) - [Define A Custom RSpec Matcher](ruby/define-a-custom-rspec-matcher.md) - [Define A Method On A Struct](ruby/define-a-method-on-a-struct.md) diff --git a/ruby/decompose-unicode-character-with-diacritic-mark.md b/ruby/decompose-unicode-character-with-diacritic-mark.md new file mode 100644 index 0000000..d040697 --- /dev/null +++ b/ruby/decompose-unicode-character-with-diacritic-mark.md @@ -0,0 +1,55 @@ +# Decompose Unicode Character With Diacritic Mark + +A character like the `ñ` is typically represented by the unicode codepoint of +`U+00F1`. However, it is also possible to represent it with two unicode +codepoints -- the `n` (`U+006E`) and the combining diacritical mark `˜` +(`U+0303`). + +We can see that by comparing a typed `ñ` with one where we split it apart into +the separate codepoints. We can do that with +[`#unicode_normalize`](https://apidock.com/ruby/v2_5_5/String/unicode_normalize) +and the `:nfd` argument which stands for _Normalized Form Decomposed_. + +```ruby +> "ñ" == "ñ".unicode_normalize(:nfd) +=> false +> "ñ".unicode_normalize(:nfd).length +=> 2 +> "ñ".length +=> 1 +``` + +We can inspect the exact codepoints by iterating over each character and +printing out the codepoint value. + +```ruby +"ñ".each_char.with_index do |char, i| + puts "#{i}: '#{char}' -> U+#{char.ord.to_s(16).upcase.rjust(4, '0')}" +end +# 0: 'ñ' -> U+00F1 +# => "ñ" + +"ñ".unicode_normalize(:nfd).each_char.with_index do |char, i| + puts "#{i}: '#{char}' -> U+#{char.ord.to_s(16).upcase.rjust(4, '0')}" +end +# 0: 'n' -> U+006E +# 1: '̃' -> U+0303 +#=> "ñ" +``` + +Notice the difference after the character has been decomposed such that the +diacritic is separated from the character. + +This can be done with other characters containing diacritics. + +And here we go the other direction with +[`#pack`](https://ruby-doc.org/core-3.0.1/Array.html#method-i-pack). + +```ruby +> [0x006E, 0x0303].pack("U*") +=> "ñ" +> [0x00F1].pack("U*") +=> "ñ" +> [0x006E, 0x0303].pack("U*") == [0x00F1].pack("U*") +=> false +```