Add Count The Number Of Words On A Webpage as a Unix TIL

2026-07-06 09:10:34 +00:00 · 2025-02-05 11:28:06 -06:00
parent 96c394c198
commit 633c1fa0a5
2 changed files with 27 additions and 1 deletions
@@ -10,7 +10,7 @@ pairing with smart people at Hashrocket.

 For a steady stream of TILs, [sign up for my newsletter](https://crafty-builder-6996.ck.page/e169c61186).

-_1584 TILs and counting..._
+_1585 TILs and counting..._

 See some of the other learning resources I work on:
 - [Ruby Operator Lookup](https://www.visualmode.dev/ruby-operators)
@@ -1498,6 +1498,7 @@ See some of the other learning resources I work on:
 - [Count The Lines In A CSV Where A Column Is Empty](unix/count-the-lines-in-a-csv-where-a-column-is-empty.md)
 - [Count The Number Of Matches In A Grep](unix/count-the-number-of-matches-in-a-grep.md)
 - [Count The Number Of ripgrep Pattern Matches](unix/count-the-number-of-ripgrep-pattern-matches.md)
+- [Count The Number Of Words On A Webpage](unix/count-the-number-of-words-on-a-webpage.md)
 - [Create A File Descriptor with Process Substitution](unix/create-a-file-descriptor-with-process-substitution.md)
 - [Create A Sequence Of Values With A Step](unix/create-a-sequence-of-values-with-a-step.md)
 - [Curl With Cookies](unix/curl-with-cookies.md)
@@ -0,0 +1,25 @@
+# Count The Number Of Words On A Webpage
+
+I was reading through a couple sections of the `postfix` documentation and I
+was astounded at how large the webpage is, and that is just for the `main.cf`
+file format.
+
+Curiosity got the best of me and I wanted to get a sense of the magnitude of
+the page. A word count seemed like a good measure.
+
+Using `pandoc` and a couple other unix utilities, I was able to quickly get
+that number.
+
+```bash
+curl -s http://www.postfix.org/postconf.5.html\#virtual_mailbox_maps | pandoc -f html -t plain | wc -w
+   88383
+```
+
+Generically, that is:
+
+```bash
+curl -s url | pandoc -f html -t plain | wc -w
+```
+
+Pandoc produces a plain-text version of the HTML page that was pulled in by
+`curl` and then we use `wc` to get a word (`-w`) count.