Not have this program do any OCR #1
Labels
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: varia/go-sh-manymanuals#1
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
just use https://github.com/ledongthuc/pdf and read the contents of the already OCRed pdf.
I use a program called ocrmypdf and it works pretty well. the idea to have this program also do that is maybe asking for too much functionality handled well by other programs.
also: how do we know wether a PDF has text in it already or is just a bunch of scans?
We might be making a bad OCR file of a good pdf.
[Idea] Not have this program do any OCRto Not have this program do any OCRHmmm yeh 🤔 Maybe we can call out to an external file also? Fine to leave it out altogether. It does seem a little messy... reading contents for now seems fine!
Assume we'd have some UI to
xdg-open foo.pdf
to have a look & then choose an OCR tool OCR it. So it is not an automated thing, just semi-automated.As it turns out, https://github.com/ledongthuc/pdf can not really parse that many PDF files from the https://vvvvvvaria.org/~crunk/datasheets.zip so it's not that good.. Instead, following
pdf2text
(https://poppler.freedesktop.org), I wired up https://pkg.go.dev/github.com/kyoushuu/go-poppler and it can parse a lot more, but not all!So, OCR may be required to actually get the contents of those exceptional / edge case PDFs which have weird content. When the program boots up, we could run the parsing and then look for ones which failed. Then we could OCR those in the background and stick them in a cache.
For now, let's test what
go-poppler
can parse and maybe it's good enough.Aha, it's
apt install poppler-utils
that gives uspdftotext
!go-poppler
is fine for now it seems.