1
0
mirror of https://github.com/jbranchaud/til synced 2026-01-02 22:58:01 +00:00

Add Count The Number Of Words On A Webpage as a Unix TIL

This commit is contained in:
jbranchaud
2025-02-05 11:28:06 -06:00
parent 96c394c198
commit 633c1fa0a5
2 changed files with 27 additions and 1 deletions

View File

@@ -10,7 +10,7 @@ pairing with smart people at Hashrocket.
For a steady stream of TILs, [sign up for my newsletter](https://crafty-builder-6996.ck.page/e169c61186).
_1584 TILs and counting..._
_1585 TILs and counting..._
See some of the other learning resources I work on:
- [Ruby Operator Lookup](https://www.visualmode.dev/ruby-operators)
@@ -1498,6 +1498,7 @@ See some of the other learning resources I work on:
- [Count The Lines In A CSV Where A Column Is Empty](unix/count-the-lines-in-a-csv-where-a-column-is-empty.md)
- [Count The Number Of Matches In A Grep](unix/count-the-number-of-matches-in-a-grep.md)
- [Count The Number Of ripgrep Pattern Matches](unix/count-the-number-of-ripgrep-pattern-matches.md)
- [Count The Number Of Words On A Webpage](unix/count-the-number-of-words-on-a-webpage.md)
- [Create A File Descriptor with Process Substitution](unix/create-a-file-descriptor-with-process-substitution.md)
- [Create A Sequence Of Values With A Step](unix/create-a-sequence-of-values-with-a-step.md)
- [Curl With Cookies](unix/curl-with-cookies.md)

View File

@@ -0,0 +1,25 @@
# Count The Number Of Words On A Webpage
I was reading through a couple sections of the `postfix` documentation and I
was astounded at how large the webpage is, and that is just for the `main.cf`
file format.
Curiosity got the best of me and I wanted to get a sense of the magnitude of
the page. A word count seemed like a good measure.
Using `pandoc` and a couple other unix utilities, I was able to quickly get
that number.
```bash
curl -s http://www.postfix.org/postconf.5.html\#virtual_mailbox_maps | pandoc -f html -t plain | wc -w
88383
```
Generically, that is:
```bash
curl -s url | pandoc -f html -t plain | wc -w
```
Pandoc produces a plain-text version of the HTML page that was pulled in by
`curl` and then we use `wc` to get a word (`-w`) count.