contextf

intelligent context builder. less tokens, same quality.

github eval results

// overview

contextf builds relevant context from document collections using search patterns and token-aware processing. instead of stuffing entire documents into a prompt, it extracts only the sections that matter.

two approaches:

llm-generated patterns — give it a natural language query, it generates search patterns via openai
manual patterns — specify your own search terms for precise control

// installation

pip install contextF
pip install contextF[pdf]    # with pdf parsing support

// quick start

from contextf import ContextBuilder

# with llm-generated patterns
builder = ContextBuilder(
    docs_path="./documents",
    max_context_tokens=200000,
    openai_api_key="your-key"
)

result = builder.build_context(
    query="what are the key findings on hallucination detection?"
)

print(f"tokens used: {result['context_tokens']}")
print(f"files matched: {len(result['files_used'])}")
print(result['context'])

# with manual patterns
result = builder.build_context(
    patterns=["hallucination", "detection method", "semantic entropy"],
    file_patterns=["*.md"]
)

// configuration

configure via json file or direct parameters:

{
  "search": {
    "docs_path": "./documents",
    "file_patterns": ["*.md", "*.txt"],
    "max_patterns_per_query": 3,
    "max_matches_per_file": 5,
    "case_sensitive": false
  },
  "tokens": {
    "max_context_tokens": 200000,
    "context_window_tokens": 8000,
    "max_file_tokens": 50000,
    "encoding": "cl100k_base"
  },
  "llm": {
    "enabled": true,
    "model": "gpt-4.1-mini",
    "temperature": 0.7
  }
}

// api reference

ContextBuilder

docs_path str — directory containing documents (default: "./documents")

max_context_tokens int — maximum total context tokens (default: 200000)

context_window_tokens int — window size around matches (default: 8000)

max_patterns_per_query int — max search patterns to generate (default: 5)

max_matches_per_file int — max matches per file (default: 5)

case_sensitive bool — case-sensitive search (default: false)

encoding str — tokenizer encoding (default: "cl100k_base")

build_context(query, patterns, docs_path, file_patterns)

returns a dictionary:

context str — merged context text from all matches

context_tokens int — total token count

files_used list — file details and contribution metadata

matches list — all pattern matches with locations

// eval results

evaluated against naive full-document context using 7 research papers and 10 queries. scored by gpt-4.1 as judge across accuracy, completeness, relevance, and clarity (1-10 each).

efficiency

token usage: 112,715 (full) vs 16,701 (contextf) — 85.2% reduction
processing time: 46.9s vs 22.0s — 2.1x faster

quality

contextf average: 38.0/40 (95%)
full context average: 37.7/40 (94.3%)
99.2% quality retention at a fraction of the tokens

when to use what

contextf works best for: focused queries, latency-sensitive apps, cost optimization, targeted deep dives.

full context works best for: comprehensive literature synthesis, cross-document comparison, exhaustive coverage.

full evaluation code and results: contextf-eval

// utilities

PDFParser

from contextf.utils import PDFParser

# single pdf
PDFParser.convert_pdf_to_markdown("paper.pdf", "paper.md")

# batch conversion
PDFParser.convert_pdfs_to_markdown("./pdfs/", "./markdown/")

TokenCounter

from contextf.utils import TokenCounter

count = TokenCounter.count_tokens_in_file("document.md")
summary = TokenCounter.get_directory_summary("./documents/")
TokenCounter.print_directory_report("./documents/")