mirror of
https://github.com/jbranchaud/til
synced 2026-07-02 23:58:25 +00:00
Add Count Number Of Tokens In A File as an LLM TIL
This commit is contained in:
@@ -10,7 +10,7 @@ working across different projects via [VisualMode](https://www.visualmode.dev/).
|
|||||||
|
|
||||||
For a steady stream of TILs, [sign up for my newsletter](https://visualmode.kit.com/newsletter).
|
For a steady stream of TILs, [sign up for my newsletter](https://visualmode.kit.com/newsletter).
|
||||||
|
|
||||||
_1772 TILs and counting..._
|
_1773 TILs and counting..._
|
||||||
|
|
||||||
See some of the other learning resources I work on:
|
See some of the other learning resources I work on:
|
||||||
|
|
||||||
@@ -716,6 +716,7 @@ If you've learned something here, support my efforts writing daily TILs by
|
|||||||
|
|
||||||
### LLM
|
### LLM
|
||||||
|
|
||||||
|
- [Count Number Of Tokens In A File](llm/count-number-of-tokens-in-a-file.md)
|
||||||
- [Send cURL To Claude Text Completion API](llm/send-curl-to-claude-text-completion-api.md)
|
- [Send cURL To Claude Text Completion API](llm/send-curl-to-claude-text-completion-api.md)
|
||||||
- [Use The llm CLI With Claude Models](llm/use-the-llm-cli-with-claude-models.md)
|
- [Use The llm CLI With Claude Models](llm/use-the-llm-cli-with-claude-models.md)
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,26 @@
|
|||||||
|
# Count Number Of Tokens In A File
|
||||||
|
|
||||||
|
Over time you have accumulated a bunch of small directives, corrections, and
|
||||||
|
project details in your `CLAUDE.md` or `AGENTS.md` file. The file doesn't seem
|
||||||
|
too big, but you are mindful that it is being included in every prompt. How many
|
||||||
|
tokens is it eating from the context window?
|
||||||
|
|
||||||
|
OpenAI's BPE (Byte Pair Encoding) tokenization library,
|
||||||
|
[`tiktoken`](https://github.com/openai/tiktoken), is an open-source Python
|
||||||
|
package. If it is installed on our machine, then we can use it as part of the
|
||||||
|
following one-liner to check a file:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
❯ python -c "import tiktoken, sys; print(len(tiktoken.encoding_for_model('gpt-4o').encode(open(sys.argv[1], 'r', encoding='utf-8').read())))" \
|
||||||
|
AGENTS.md
|
||||||
|
1018
|
||||||
|
```
|
||||||
|
|
||||||
|
I ran this against the `AGENTS.md` file in a team project I'm on. It came out to
|
||||||
|
1018 tokens. This is a very good approximation based on the tokenizer trained
|
||||||
|
for `gpt-4o`. The tokenizers may vary a little from model to model, but the
|
||||||
|
differences for our purposes here are going to be negligible.
|
||||||
|
|
||||||
|
This one-liner gets the "first" argument to the command, reads it in, and runs
|
||||||
|
that string against the tokenizer. The length of the tokenized encoding is then
|
||||||
|
printed.
|
||||||
Reference in New Issue
Block a user