chunk-it-pro

semantic double-pass chunking for rag. boundaries by meaning, not character count.

pypi github

// overview

traditional chunking splits documents at fixed token intervals, breaking semantic coherence. chunk-it-pro uses embedding similarity to find natural boundaries, then a second pass merges related chunks that were separated by dissimilar content (like code blocks between explanations).

semantic boundaries — chunks split where meaning changes, not at arbitrary positions
double-pass merging — lookahead logic finds related content separated by dissimilar material
multi-format — pdf, docx, txt, markdown
visual analytics — generates plots of similarity patterns

// installation

pip install chunk-it-pro

requires at least one embedding provider api key (openai or local sentence transformers).

// quick start

from chunk_it_pro import chunk_document

chunks = chunk_document(
    file_path="paper.pdf",
    embedding_provider="openai",
    threshold_method="percentile"
)

for chunk in chunks:
    print(f"chunk {chunk.index}: {len(chunk.text)} chars")
    print(chunk.text[:200])
    print("---")

# using the pipeline class for more control
from chunk_it_pro import SemanticChunkingPipeline

pipeline = SemanticChunkingPipeline(
    embedding_provider="sentence_transformers",
    threshold_method="gradient",
    max_chunk_length=1000
)

chunks = pipeline.process("document.md")

// how it works

pass 1: semantic chunking

splits text into sentences, computes embedding similarity between adjacent sentences, and splits at points where similarity drops below the threshold. this creates many small, semantically coherent units.

pass 2: lookahead merging

examines non-adjacent chunks for semantic similarity. if chunk A and chunk C are related but separated by chunk B (e.g. a code block between two paragraphs of explanation), they get merged. this keeps equations, code snippets, and their explanations together.

// configuration

threshold methods

percentile — splits at similarity values below a percentile threshold
gradient — splits where the rate of similarity change is steepest
local maxima — splits at local minima in the similarity curve

embedding providers

openai — text-embedding-3-small/large (requires api key)
sentence_transformers — local models, default uses wasserstoff-ai's legal-embed

// api reference

chunk_document()

file_path str — path to the document

embedding_provider str — "openai" or "sentence_transformers"

threshold_method str — "percentile", "gradient", or "local_maxima"

max_chunk_length int — maximum characters per chunk

SemanticChunkingPipeline

same parameters as above, plus:

plot_similarity bool — generate similarity visualization