Find answers from the community

Updated 6 months ago

i have markdown files to be vectorized

At a glance

The community member has markdown files that they want to vectorize, but the current parser, MarkdownReader, is splitting the markdown based on headings and code blocks, resulting in small chunks that lack context. The community member wants to change the strategy of dividing the document chunks. In the comments, another community member suggests using a normal text splitter or the community member's own parsing strategy. Another community member provides a reference, suggesting the use of SentenceSplitter from llama_index.core.node_parser with a chunk size of 1024 and a chunk overlap of 128.

i have markdown files to be vectorized current parser MarkdownReader is splitting the markdown based on headings eg (`#, code block) . I want to change the strategy of dividing the document chunk. As in my use case the document extracted doesn't have more context due to small chunks
L
p
3 comments
You can always use a normal text splitter, or your own parsing strategy
can u provide any reference
Plain Text
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=128)
nodes = splitter(documents)
Add a reply
Sign up and join the conversation on Discord