diff --git a/README.md b/README.md index 92ec913..dba0297 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ working across different projects via [VisualMode](https://www.visualmode.dev/). For a steady stream of TILs, [sign up for my newsletter](https://visualmode.kit.com/newsletter). -_1772 TILs and counting..._ +_1773 TILs and counting..._ See some of the other learning resources I work on: @@ -716,6 +716,7 @@ If you've learned something here, support my efforts writing daily TILs by ### LLM +- [Count Number Of Tokens In A File](llm/count-number-of-tokens-in-a-file.md) - [Send cURL To Claude Text Completion API](llm/send-curl-to-claude-text-completion-api.md) - [Use The llm CLI With Claude Models](llm/use-the-llm-cli-with-claude-models.md) diff --git a/llm/count-number-of-tokens-in-a-file.md b/llm/count-number-of-tokens-in-a-file.md new file mode 100644 index 0000000..5d5476b --- /dev/null +++ b/llm/count-number-of-tokens-in-a-file.md @@ -0,0 +1,26 @@ +# Count Number Of Tokens In A File + +Over time you have accumulated a bunch of small directives, corrections, and +project details in your `CLAUDE.md` or `AGENTS.md` file. The file doesn't seem +too big, but you are mindful that it is being included in every prompt. How many +tokens is it eating from the context window? + +OpenAI's BPE (Byte Pair Encoding) tokenization library, +[`tiktoken`](https://github.com/openai/tiktoken), is an open-source Python +package. If it is installed on our machine, then we can use it as part of the +following one-liner to check a file: + +```bash +❯ python -c "import tiktoken, sys; print(len(tiktoken.encoding_for_model('gpt-4o').encode(open(sys.argv[1], 'r', encoding='utf-8').read())))" \ + AGENTS.md +1018 +``` + +I ran this against the `AGENTS.md` file in a team project I'm on. It came out to +1018 tokens. This is a very good approximation based on the tokenizer trained +for `gpt-4o`. The tokenizers may vary a little from model to model, but the +differences for our purposes here are going to be negligible. + +This one-liner gets the "first" argument to the command, reads it in, and runs +that string against the tokenizer. The length of the tokenized encoding is then +printed.