chunk-it-pro
semantic double-pass chunking for rag. boundaries by meaning, not character count.
// overview
traditional chunking splits documents at fixed token intervals, breaking semantic coherence. chunk-it-pro uses embedding similarity to find natural boundaries, then a second pass merges related chunks that were separated by dissimilar content (like code blocks between explanations).
- semantic boundaries — chunks split where meaning changes, not at arbitrary positions
- double-pass merging — lookahead logic finds related content separated by dissimilar material
- multi-format — pdf, docx, txt, markdown
- visual analytics — generates plots of similarity patterns
// installation
pip install chunk-it-pro
requires at least one embedding provider api key (openai or local sentence transformers).
// quick start
from chunk_it_pro import chunk_document
chunks = chunk_document(
file_path="paper.pdf",
embedding_provider="openai",
threshold_method="percentile"
)
for chunk in chunks:
print(f"chunk {chunk.index}: {len(chunk.text)} chars")
print(chunk.text[:200])
print("---")
# using the pipeline class for more control
from chunk_it_pro import SemanticChunkingPipeline
pipeline = SemanticChunkingPipeline(
embedding_provider="sentence_transformers",
threshold_method="gradient",
max_chunk_length=1000
)
chunks = pipeline.process("document.md")
// how it works
pass 1: semantic chunking
splits text into sentences, computes embedding similarity between adjacent sentences, and splits at points where similarity drops below the threshold. this creates many small, semantically coherent units.
pass 2: lookahead merging
examines non-adjacent chunks for semantic similarity. if chunk A and chunk C are related but separated by chunk B (e.g. a code block between two paragraphs of explanation), they get merged. this keeps equations, code snippets, and their explanations together.
// configuration
threshold methods
- percentile — splits at similarity values below a percentile threshold
- gradient — splits where the rate of similarity change is steepest
- local maxima — splits at local minima in the similarity curve
embedding providers
- openai — text-embedding-3-small/large (requires api key)
- sentence_transformers — local models, default uses wasserstoff-ai's legal-embed
// api reference
chunk_document()
SemanticChunkingPipeline
same parameters as above, plus: