Find answers from the community

Updated 6 months ago

Hello all,

At a glance
Hello all,
I’m looking to expand past XML to PDFs, and the one big issue is the one issue everyone has—tables. Is there a recommended OSS way to read them? Specifically something you’d recommend be used with LlamaIndex?
L
i
s
15 comments
probably unstructured will be the best OSS solution
but overall tables are hard
marked is another OSS library that does ok-ish
Is OCR an acceptable solution
OCR is really only half of the solution
Sure you can get the text -- but then you need to make sure its formatted nicely
Oh of course yeah
And then there’s the issue of hyperlinks
God I hate PDFs
it really is the worst file format possible lol
and the most used
@isaackogan mind sharing the pdf file you're trying to read?
no sorry I’m testing with my employee pay statement 💀
Add a reply
Sign up and join the conversation on Discord