why catch-cap?

llms hallucinate. they generate fluent, confident responses that are factually wrong. catch-cap detects these hallucinations (or confabulations) before they reach your end-users.

multi-method detection

combines 2 detection methods for reliable hallucination detection

near production ready

automatic retries, rate limiting, graceful degradation

confidence scores

every detection includes 0-1 confidence score

auto-correction

provides web-grounded corrections when hallucinations detected

how it works

catch-cap uses multiple signals to detect hallucinations:

  • semantic entropy: does the model give consistent answers when asked multiple times?
  • log probabilities: is the model uncertain about specific tokens?
  • web grounding: do real-world sources support the claim?
  • llm judge: does another model think it's accurate?

all signals are combined into a single confidence score, making it easy to decide whether to trust the response.

installation

pip install catch-cap

api keys

set your api keys as environment variables:

export OPENAI_API_KEY="your-key"
export GEMINI_API_KEY="your-key"
export TAVILY_API_KEY="your-key"

or create a .env file and catch-cap will load it automatically.

quick start

import asyncio
from catch_cap import CatchCap, CatchCapConfig, ModelConfig

async def main():
    config = CatchCapConfig(
        generator=ModelConfig(provider="openai", name="gpt-4.1-mini")
    )
    
    detector = CatchCap(config)
    result = await detector.run("how many r's are in strawberry?")
    
    print(f"hallucination detected: {result.confabulation_detected}")
    print(f"confidence: {result.metadata['confidence_level']}")
    
    if result.corrected_answer:
        print(f"corrected: {result.corrected_answer}")

asyncio.run(main())

understanding results

every detection returns a CatchCapResult with:

  • confabulation_detected: true if hallucination found
  • confidence_score: 0-1 confidence in the verdict
  • confidence_level: human-readable ("high", "medium", "low")
  • corrected_answer: web-grounded correction if available
  • metadata: detection time, methods used, errors

semantic entropy

when a model is confident, it gives consistent answers. when it's hallucinating, responses vary wildly.

semantic entropy measures this consistency by:

  1. generating multiple responses to the same query
  2. converting responses to embeddings (vector representations)
  3. calculating how similar the embeddings are
  4. high similarity = low entropy = confident model
  5. low similarity = high entropy = likely hallucinating

configuration

config = CatchCapConfig(
    semantic_entropy=SemanticEntropyConfig(
        n_responses=5,      # generate 5 responses
        threshold=0.25      # flag if entropy > 0.25
    )
)

tuning tip: use 3-5 responses for real-time apps, 7-10 for critical applications where accuracy matters most.

log probabilities

every token the model generates has a probability. low probabilities mean the model is uncertain—often a sign of hallucination.

catch-cap flags responses where:

  • many tokens have low probabilities (suspicious pattern)
  • absolute count of uncertain tokens exceeds threshold

configuration

config = CatchCapConfig(
    logprobs=LogProbConfig(
        min_logprob=-4.5,           # flag tokens below this
        fraction_threshold=0.2,      # or if >20% tokens suspicious
        min_flagged_tokens=5         # or if >=5 tokens flagged
    )
)

provider support: openai has full support. gemini and groq have limited log probability support.

web grounding

validate responses against real-world information by searching the web and comparing results.

how it works

  1. search the web for relevant information (tavily or searxng)
  2. synthesize search results into a coherent answer
  3. compare model response with web-grounded answer
  4. flag discrepancies

configuration

config = CatchCapConfig(
    web_search=WebSearchConfig(
        provider="tavily",
        max_results=5,
        synthesizer_model=ModelConfig(
            provider="openai",
            name="gpt-4.1-nano"  # cheap model for synthesis
        )
    )
)

cost optimization: use cheaper models (gpt-4.1-nano, gemini-flash) for synthesis since it's a simple task.

confidence scoring

all detection signals are combined into a single 0-1 confidence score.

confidence levels

  • very high (0.9+): extremely confident in verdict
  • high (0.7-0.9): strong evidence
  • medium (0.5-0.7): moderate evidence
  • low (0.3-0.5): weak evidence
  • very low (0-0.3): insufficient evidence

using confidence in production

result = await detector.run(query)
confidence = result.metadata['confidence_score']

if result.confabulation_detected and confidence >= 0.7:
    # high confidence - use corrected answer
    return result.corrected_answer
elif result.confabulation_detected and confidence >= 0.4:
    # medium confidence - flag for human review
    flag_for_review(result)
else:
    # low confidence or no detection - use original
    return result.responses[0].text

configuration guide

minimal config

config = CatchCapConfig(
    generator=ModelConfig(provider="openai", name="gpt-4.1-mini")
)

production config

config = CatchCapConfig(
    generator=ModelConfig(
        provider="openai",
        name="gpt-4.1-mini",
        temperature=0.6
    ),
    semantic_entropy=SemanticEntropyConfig(
        n_responses=5,
        threshold=0.25
    ),
    logprobs=LogProbConfig(
        min_logprob=-4.5,
        fraction_threshold=0.15
    ),
    web_search=WebSearchConfig(
        provider="tavily",
        synthesizer_model=ModelConfig(
            provider="openai",
            name="gpt-4.1-nano"
        )
    ),
    judge=JudgeConfig(
        model=ModelConfig(provider="openai", name="gpt-4.1-nano")
    ),
    enable_correction=True,
    rate_limit_rpm=60
)

fast & cheap config

config = CatchCapConfig(
    generator=ModelConfig(provider="gemini", name="gemini-2.0-flash"),
    semantic_entropy=SemanticEntropyConfig(n_responses=3),
    logprobs=LogProbConfig(enabled=False),
    web_search=WebSearchConfig(provider="none")
)

model providers

openai

best overall support. full log probability support, batched embeddings.

recommended models: gpt-4.1-mini (balanced), gpt-4.1-nano (fast/cheap), gpt-4.1 (highest quality)

gemini

very fast and cost-effective. limited log probability support.

recommended models: gemini-2.0-flash, gemini-1.5-pro

groq

extremely fast inference. limited log probability support, no native embeddings.

recommended models: llama-3.3-70b-versatile

mixing providers: use different providers for different components (e.g., gemini for generation, openai for embeddings).

examples

batch processing

async def process_batch(queries):
    config = CatchCapConfig(
        generator=ModelConfig(provider="openai", name="gpt-4.1-mini"),
        rate_limit_rpm=30
    )
    detector = CatchCap(config)
    
    results = []
    for query in queries:
        result = await detector.run(query)
        results.append({
            'query': query,
            'hallucination': result.confabulation_detected,
            'confidence': result.metadata['confidence_score']
        })
    
    return results

error handling

from catch_cap.exceptions import CatchCapError

try:
    result = await detector.run(query)
    
    # check for component failures
    if result.metadata.get('errors'):
        print(f"warnings: {result.metadata['errors']}")
    
except CatchCapError as e:
    print(f"detection failed: {e}")

custom judge prompts

config = CatchCapConfig(
    judge=JudgeConfig(
        model=ModelConfig(provider="openai", name="gpt-4"),
        instructions="""compare responses for numerical accuracy.
        numbers must match exactly. even 1 digit difference = INCONSISTENT.
        return only CONSISTENT or INCONSISTENT."""
    )
)

api reference

catchcap class

# constructor
detector = CatchCap(config: CatchCapConfig)

# detect hallucinations
result = await detector.run(query: str) -> CatchCapResult

result object

attribute type description
confabulation_detected bool true if hallucination found
responses list all generated responses
corrected_answer str web-grounded correction
metadata dict confidence, timing, errors

metadata fields

field description
confidence_score 0-1 confidence in verdict
confidence_level "very high", "high", "medium", "low", "very low"
detection_time_seconds total pipeline execution time
detection_methods list of methods used
errors component failures (if any)

key features

  • automatic retries: 3 attempts with exponential backoff
  • batched embeddings: 10x faster, 10x cheaper
  • graceful degradation: continues if components fail
  • rate limiting: prevent quota exhaustion
  • structured logging: full pipeline observability

catch-cap | built by axon labs

github | pypi