i have markdown files to be vectorized

At a glance

The community member has markdown files that they want to vectorize, but the current parser, MarkdownReader, is splitting the markdown based on headings and code blocks, resulting in small chunks that lack context. The community member wants to change the strategy of dividing the document chunks. In the comments, another community member suggests using a normal text splitter or the community member's own parsing strategy. Another community member provides a reference, suggesting the use of SentenceSplitter from llama_index.core.node_parser with a chunk size of 1024 and a chunk overlap of 128.

ppayload

i have markdown files to be vectorized current parser MarkdownReader is splitting the markdown based on headings eg (`#, code block) . I want to change the strategy of dividing the document chunk. As in my use case the document extracted doesn't have more context due to small chunks

3 comments

LLogan M

You can always use a normal text splitter, or your own parsing strategy

ppayload

can u provide any reference

LLogan M

Plain Text

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=128)
nodes = splitter(documents)

Add a reply

Find answers from the community

i have markdown files to be vectorized