1.5 KiB
Combine All My TILs Into A Single File
In Build A Small Text-based Training Dataset, I went over my need for a sizeable and interesting corpus of text that I could use as a training dataset I could use to run against my own naive Byte Pair Encoding implementation. My repo of hand-written TILs is a great candidate, but I need those smashed all into one file.
Here is a formatted version of the one-liner I ended up with:
{
cat README.md; \
find */ -name '*.md' -print0 \
| sort -z \
| xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; \
} > combined.md
This combines all 1700+ of my TILs into a single file separated by the
<|endoftext|> delimiter.
The two things I find most interesting about this command are:
-
The use of a null byte (
\0) separator between the filenames in case there is anything weird (like spaces) in those filenames. This starts with-print0. The-zofsortmaintains that null byte separator. And thenxargsknows to handle it by the-0flag. -
We can coerce
xargsinto running multiple commands by having it spawn a single shell process that runs each of those commands. To reliably pass the filename into that shell process, we havexargsconstitute it as the second argument ($1) by substituting in the filename where{}appears.