diff --git a/README.md b/README.md index 2a5a699..b3ff05b 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ working across different projects via [VisualMode](https://www.visualmode.dev/). For a steady stream of TILs, [sign up for my newsletter](https://visualmode.kit.com/newsletter). -_1749 TILs and counting..._ +_1750 TILs and counting..._ See some of the other learning resources I work on: @@ -1648,6 +1648,7 @@ If you've learned something here, support my efforts writing daily TILs by - [Check The Current Working Directory](unix/check-the-current-working-directory.md) - [Check The Installed OpenSSL Version](unix/check-the-installed-openssl-version.md) - [Clear The Screen](unix/clear-the-screen.md) +- [Combine All My TILs Into A Single File](unix/combine-all-my-tils-into-a-single-file.md) - [Command Line Length Limitations](unix/command-line-length-limitations.md) - [Compare Two Variables In A Bash Script](unix/compare-two-variables-in-a-bash-script.md) - [Configure cd To Behave Like pushd In Zsh](unix/configure-cd-to-behave-like-pushd-in-zsh.md) diff --git a/unix/combine-all-my-tils-into-a-single-file.md b/unix/combine-all-my-tils-into-a-single-file.md new file mode 100644 index 0000000..7941353 --- /dev/null +++ b/unix/combine-all-my-tils-into-a-single-file.md @@ -0,0 +1,35 @@ +# Combine All My TILs Into A Single File + +In [Build A Small Text-based Training +Dataset](https://www.visualmode.dev/build-a-small-text-training-dataset), I went +over my need for a sizeable and interesting corpus of text that I could use as a +training dataset I could use to run against [my own naive Byte Pair Encoding +implementation](https://github.com/jbranchaud/build-an-llm-from-scratch/blob/main/chapter-02/bpe_tokenizer.py). +My repo of hand-written TILs is a great candidate, but I need those smashed all +into one file. + +Here is a formatted version of the one-liner I ended up with: + +```bash +{ + cat README.md; \ + find */ -name '*.md' -print0 \ + | sort -z \ + | xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; \ +} > combined.md +``` + +This combines all 1700+ of my TILs into a single file separated by the +`<|endoftext|>` delimiter. + +The two things I find most interesting about this command are: + +1. The use of a null byte (`\0`) separator between the filenames in case there + is anything weird (like spaces) in those filenames. This starts with + `-print0`. The `-z` of `sort` maintains that null byte separator. And then + `xargs` knows to handle it by the `-0` flag. + +2. We can coerce `xargs` into running multiple commands by having it spawn a + single shell process that runs each of those commands. To reliably pass the + filename into that shell process, we have `xargs` constitute it as the second + argument (`$1`) by substituting in the filename where `{}` appears.