chunk-it-pro

semantic double-pass chunking for rag. boundaries by meaning, not character count.

pypi github

// overview

traditional chunking splits documents at fixed token intervals, breaking semantic coherence. chunk-it-pro uses embedding similarity to find natural boundaries, then a second pass merges related chunks that were separated by dissimilar content (like code blocks between explanations).

// installation

pip install chunk-it-pro

requires at least one embedding provider api key (openai or local sentence transformers).

// quick start

from chunk_it_pro import chunk_document

chunks = chunk_document(
    file_path="paper.pdf",
    embedding_provider="openai",
    threshold_method="percentile"
)

for chunk in chunks:
    print(f"chunk {chunk.index}: {len(chunk.text)} chars")
    print(chunk.text[:200])
    print("---")
# using the pipeline class for more control
from chunk_it_pro import SemanticChunkingPipeline

pipeline = SemanticChunkingPipeline(
    embedding_provider="sentence_transformers",
    threshold_method="gradient",
    max_chunk_length=1000
)

chunks = pipeline.process("document.md")

// how it works

pass 1: semantic chunking

splits text into sentences, computes embedding similarity between adjacent sentences, and splits at points where similarity drops below the threshold. this creates many small, semantically coherent units.

pass 2: lookahead merging

examines non-adjacent chunks for semantic similarity. if chunk A and chunk C are related but separated by chunk B (e.g. a code block between two paragraphs of explanation), they get merged. this keeps equations, code snippets, and their explanations together.

// configuration

threshold methods

embedding providers

// api reference

chunk_document()

file_path str — path to the document
embedding_provider str — "openai" or "sentence_transformers"
threshold_method str — "percentile", "gradient", or "local_maxima"
max_chunk_length int — maximum characters per chunk

SemanticChunkingPipeline

same parameters as above, plus:

plot_similarity bool — generate similarity visualization