why catch-cap?
llms hallucinate. they generate fluent, confident responses that are factually wrong. catch-cap detects these hallucinations (or confabulations) before they reach your end-users.
multi-method detection
combines 2 detection methods for reliable hallucination detection
near production ready
automatic retries, rate limiting, graceful degradation
confidence scores
every detection includes 0-1 confidence score
auto-correction
provides web-grounded corrections when hallucinations detected
how it works
catch-cap uses multiple signals to detect hallucinations:
- semantic entropy: does the model give consistent answers when asked multiple times?
- log probabilities: is the model uncertain about specific tokens?
- web grounding: do real-world sources support the claim?
- llm judge: does another model think it's accurate?
all signals are combined into a single confidence score, making it easy to decide whether to trust the response.
installation
pip install catch-cap
api keys
set your api keys as environment variables:
export OPENAI_API_KEY="your-key"
export GEMINI_API_KEY="your-key"
export TAVILY_API_KEY="your-key"
or create a .env file and catch-cap will load it automatically.
quick start
import asyncio
from catch_cap import CatchCap, CatchCapConfig, ModelConfig
async def main():
config = CatchCapConfig(
generator=ModelConfig(provider="openai", name="gpt-4.1-mini")
)
detector = CatchCap(config)
result = await detector.run("how many r's are in strawberry?")
print(f"hallucination detected: {result.confabulation_detected}")
print(f"confidence: {result.metadata['confidence_level']}")
if result.corrected_answer:
print(f"corrected: {result.corrected_answer}")
asyncio.run(main())
understanding results
every detection returns a CatchCapResult with:
confabulation_detected: true if hallucination foundconfidence_score: 0-1 confidence in the verdictconfidence_level: human-readable ("high", "medium", "low")corrected_answer: web-grounded correction if availablemetadata: detection time, methods used, errors
semantic entropy
when a model is confident, it gives consistent answers. when it's hallucinating, responses vary wildly.
semantic entropy measures this consistency by:
- generating multiple responses to the same query
- converting responses to embeddings (vector representations)
- calculating how similar the embeddings are
- high similarity = low entropy = confident model
- low similarity = high entropy = likely hallucinating
configuration
config = CatchCapConfig(
semantic_entropy=SemanticEntropyConfig(
n_responses=5, # generate 5 responses
threshold=0.25 # flag if entropy > 0.25
)
)
tuning tip: use 3-5 responses for real-time apps, 7-10 for critical applications where accuracy matters most.
log probabilities
every token the model generates has a probability. low probabilities mean the model is uncertain—often a sign of hallucination.
catch-cap flags responses where:
- many tokens have low probabilities (suspicious pattern)
- absolute count of uncertain tokens exceeds threshold
configuration
config = CatchCapConfig(
logprobs=LogProbConfig(
min_logprob=-4.5, # flag tokens below this
fraction_threshold=0.2, # or if >20% tokens suspicious
min_flagged_tokens=5 # or if >=5 tokens flagged
)
)
provider support: openai has full support. gemini and groq have limited log probability support.
web grounding
validate responses against real-world information by searching the web and comparing results.
how it works
- search the web for relevant information (tavily or searxng)
- synthesize search results into a coherent answer
- compare model response with web-grounded answer
- flag discrepancies
configuration
config = CatchCapConfig(
web_search=WebSearchConfig(
provider="tavily",
max_results=5,
synthesizer_model=ModelConfig(
provider="openai",
name="gpt-4.1-nano" # cheap model for synthesis
)
)
)
cost optimization: use cheaper models (gpt-4.1-nano, gemini-flash) for synthesis since it's a simple task.
confidence scoring
all detection signals are combined into a single 0-1 confidence score.
confidence levels
- very high (0.9+): extremely confident in verdict
- high (0.7-0.9): strong evidence
- medium (0.5-0.7): moderate evidence
- low (0.3-0.5): weak evidence
- very low (0-0.3): insufficient evidence
using confidence in production
result = await detector.run(query)
confidence = result.metadata['confidence_score']
if result.confabulation_detected and confidence >= 0.7:
# high confidence - use corrected answer
return result.corrected_answer
elif result.confabulation_detected and confidence >= 0.4:
# medium confidence - flag for human review
flag_for_review(result)
else:
# low confidence or no detection - use original
return result.responses[0].text
configuration guide
minimal config
config = CatchCapConfig(
generator=ModelConfig(provider="openai", name="gpt-4.1-mini")
)
production config
config = CatchCapConfig(
generator=ModelConfig(
provider="openai",
name="gpt-4.1-mini",
temperature=0.6
),
semantic_entropy=SemanticEntropyConfig(
n_responses=5,
threshold=0.25
),
logprobs=LogProbConfig(
min_logprob=-4.5,
fraction_threshold=0.15
),
web_search=WebSearchConfig(
provider="tavily",
synthesizer_model=ModelConfig(
provider="openai",
name="gpt-4.1-nano"
)
),
judge=JudgeConfig(
model=ModelConfig(provider="openai", name="gpt-4.1-nano")
),
enable_correction=True,
rate_limit_rpm=60
)
fast & cheap config
config = CatchCapConfig(
generator=ModelConfig(provider="gemini", name="gemini-2.0-flash"),
semantic_entropy=SemanticEntropyConfig(n_responses=3),
logprobs=LogProbConfig(enabled=False),
web_search=WebSearchConfig(provider="none")
)
model providers
openai
best overall support. full log probability support, batched embeddings.
recommended models: gpt-4.1-mini (balanced), gpt-4.1-nano (fast/cheap), gpt-4.1 (highest quality)
gemini
very fast and cost-effective. limited log probability support.
recommended models: gemini-2.0-flash, gemini-1.5-pro
groq
extremely fast inference. limited log probability support, no native embeddings.
recommended models: llama-3.3-70b-versatile
mixing providers: use different providers for different components (e.g., gemini for generation, openai for embeddings).
examples
batch processing
async def process_batch(queries):
config = CatchCapConfig(
generator=ModelConfig(provider="openai", name="gpt-4.1-mini"),
rate_limit_rpm=30
)
detector = CatchCap(config)
results = []
for query in queries:
result = await detector.run(query)
results.append({
'query': query,
'hallucination': result.confabulation_detected,
'confidence': result.metadata['confidence_score']
})
return results
error handling
from catch_cap.exceptions import CatchCapError
try:
result = await detector.run(query)
# check for component failures
if result.metadata.get('errors'):
print(f"warnings: {result.metadata['errors']}")
except CatchCapError as e:
print(f"detection failed: {e}")
custom judge prompts
config = CatchCapConfig(
judge=JudgeConfig(
model=ModelConfig(provider="openai", name="gpt-4"),
instructions="""compare responses for numerical accuracy.
numbers must match exactly. even 1 digit difference = INCONSISTENT.
return only CONSISTENT or INCONSISTENT."""
)
)
api reference
catchcap class
# constructor
detector = CatchCap(config: CatchCapConfig)
# detect hallucinations
result = await detector.run(query: str) -> CatchCapResult
result object
| attribute | type | description |
|---|---|---|
| confabulation_detected | bool | true if hallucination found |
| responses | list | all generated responses |
| corrected_answer | str | web-grounded correction |
| metadata | dict | confidence, timing, errors |
metadata fields
| field | description |
|---|---|
| confidence_score | 0-1 confidence in verdict |
| confidence_level | "very high", "high", "medium", "low", "very low" |
| detection_time_seconds | total pipeline execution time |
| detection_methods | list of methods used |
| errors | component failures (if any) |
key features
- automatic retries: 3 attempts with exponential backoff
- batched embeddings: 10x faster, 10x cheaper
- graceful degradation: continues if components fail
- rate limiting: prevent quota exhaustion
- structured logging: full pipeline observability
catch-cap | built by axon labs