1
0
mirror of https://github.com/jbranchaud/til synced 2026-07-02 23:58:25 +00:00

Add Count Number Of Tokens In A File as an LLM TIL

This commit is contained in:
jbranchaud
2026-04-03 09:23:11 -05:00
parent b873f86f5b
commit 0c4702be97
2 changed files with 28 additions and 1 deletions
+26
View File
@@ -0,0 +1,26 @@
# Count Number Of Tokens In A File
Over time you have accumulated a bunch of small directives, corrections, and
project details in your `CLAUDE.md` or `AGENTS.md` file. The file doesn't seem
too big, but you are mindful that it is being included in every prompt. How many
tokens is it eating from the context window?
OpenAI's BPE (Byte Pair Encoding) tokenization library,
[`tiktoken`](https://github.com/openai/tiktoken), is an open-source Python
package. If it is installed on our machine, then we can use it as part of the
following one-liner to check a file:
```bash
python -c "import tiktoken, sys; print(len(tiktoken.encoding_for_model('gpt-4o').encode(open(sys.argv[1], 'r', encoding='utf-8').read())))" \
AGENTS.md
1018
```
I ran this against the `AGENTS.md` file in a team project I'm on. It came out to
1018 tokens. This is a very good approximation based on the tokenizer trained
for `gpt-4o`. The tokenizers may vary a little from model to model, but the
differences for our purposes here are going to be negligible.
This one-liner gets the "first" argument to the command, reads it in, and runs
that string against the tokenizer. The length of the tokenized encoding is then
printed.