mirror of
https://github.com/jbranchaud/til
synced 2026-03-03 22:48:45 +00:00
Add Combine All My TILs Into A Single File as a Unix TIL
This commit is contained in:
@@ -10,7 +10,7 @@ working across different projects via [VisualMode](https://www.visualmode.dev/).
|
|||||||
|
|
||||||
For a steady stream of TILs, [sign up for my newsletter](https://visualmode.kit.com/newsletter).
|
For a steady stream of TILs, [sign up for my newsletter](https://visualmode.kit.com/newsletter).
|
||||||
|
|
||||||
_1749 TILs and counting..._
|
_1750 TILs and counting..._
|
||||||
|
|
||||||
See some of the other learning resources I work on:
|
See some of the other learning resources I work on:
|
||||||
|
|
||||||
@@ -1648,6 +1648,7 @@ If you've learned something here, support my efforts writing daily TILs by
|
|||||||
- [Check The Current Working Directory](unix/check-the-current-working-directory.md)
|
- [Check The Current Working Directory](unix/check-the-current-working-directory.md)
|
||||||
- [Check The Installed OpenSSL Version](unix/check-the-installed-openssl-version.md)
|
- [Check The Installed OpenSSL Version](unix/check-the-installed-openssl-version.md)
|
||||||
- [Clear The Screen](unix/clear-the-screen.md)
|
- [Clear The Screen](unix/clear-the-screen.md)
|
||||||
|
- [Combine All My TILs Into A Single File](unix/combine-all-my-tils-into-a-single-file.md)
|
||||||
- [Command Line Length Limitations](unix/command-line-length-limitations.md)
|
- [Command Line Length Limitations](unix/command-line-length-limitations.md)
|
||||||
- [Compare Two Variables In A Bash Script](unix/compare-two-variables-in-a-bash-script.md)
|
- [Compare Two Variables In A Bash Script](unix/compare-two-variables-in-a-bash-script.md)
|
||||||
- [Configure cd To Behave Like pushd In Zsh](unix/configure-cd-to-behave-like-pushd-in-zsh.md)
|
- [Configure cd To Behave Like pushd In Zsh](unix/configure-cd-to-behave-like-pushd-in-zsh.md)
|
||||||
|
|||||||
35
unix/combine-all-my-tils-into-a-single-file.md
Normal file
35
unix/combine-all-my-tils-into-a-single-file.md
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
# Combine All My TILs Into A Single File
|
||||||
|
|
||||||
|
In [Build A Small Text-based Training
|
||||||
|
Dataset](https://www.visualmode.dev/build-a-small-text-training-dataset), I went
|
||||||
|
over my need for a sizeable and interesting corpus of text that I could use as a
|
||||||
|
training dataset I could use to run against [my own naive Byte Pair Encoding
|
||||||
|
implementation](https://github.com/jbranchaud/build-an-llm-from-scratch/blob/main/chapter-02/bpe_tokenizer.py).
|
||||||
|
My repo of hand-written TILs is a great candidate, but I need those smashed all
|
||||||
|
into one file.
|
||||||
|
|
||||||
|
Here is a formatted version of the one-liner I ended up with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
{
|
||||||
|
cat README.md; \
|
||||||
|
find */ -name '*.md' -print0 \
|
||||||
|
| sort -z \
|
||||||
|
| xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; \
|
||||||
|
} > combined.md
|
||||||
|
```
|
||||||
|
|
||||||
|
This combines all 1700+ of my TILs into a single file separated by the
|
||||||
|
`<|endoftext|>` delimiter.
|
||||||
|
|
||||||
|
The two things I find most interesting about this command are:
|
||||||
|
|
||||||
|
1. The use of a null byte (`\0`) separator between the filenames in case there
|
||||||
|
is anything weird (like spaces) in those filenames. This starts with
|
||||||
|
`-print0`. The `-z` of `sort` maintains that null byte separator. And then
|
||||||
|
`xargs` knows to handle it by the `-0` flag.
|
||||||
|
|
||||||
|
2. We can coerce `xargs` into running multiple commands by having it spawn a
|
||||||
|
single shell process that runs each of those commands. To reliably pass the
|
||||||
|
filename into that shell process, we have `xargs` constitute it as the second
|
||||||
|
argument (`$1`) by substituting in the filename where `{}` appears.
|
||||||
Reference in New Issue
Block a user