I use a program called ocrmypdf and it works pretty well. the idea to have this program also do that is maybe asking for too much functionality handled well by other programs.
also: how do we know wether a PDF has text in it already or is just a bunch of scans?
We might be making a bad OCR file of a good pdf.
just use https://github.com/ledongthuc/pdf and read the contents of the already OCRed pdf.
I use a program called ocrmypdf and it works pretty well. the idea to have this program also do that is maybe asking for too much functionality handled well by other programs.
also: how do we know wether a PDF has text in it already or is just a bunch of scans?
We might be making a bad OCR file of a good pdf.
maybe asking for too much functionality handled well by other programs.
Hmmm yeh 🤔 Maybe we can call out to an external file also? Fine to leave it out altogether. It does seem a little messy... reading contents for now seems fine!
also: how do we know wether a PDF has text in it already or is just a bunch of scans?
Assume we'd have some UI to xdg-open foo.pdf to have a look & then choose an OCR tool OCR it. So it is not an automated thing, just semi-automated.
> maybe asking for too much functionality handled well by other programs.
Hmmm yeh 🤔 Maybe we can call out to an external file also? Fine to leave it out altogether. It does seem a little messy... reading contents for now seems fine!
> also: how do we know wether a PDF has text in it already or is just a bunch of scans?
Assume we'd have some UI to `xdg-open foo.pdf` to have a look & then choose an OCR tool OCR it. So it is not an automated thing, just semi-automated.
So, OCR may be required to actually get the contents of those exceptional / edge case PDFs which have weird content. When the program boots up, we could run the parsing and then look for ones which failed. Then we could OCR those in the background and stick them in a cache.
For now, let's test what go-poppler can parse and maybe it's good enough.
As it turns out, https://github.com/ledongthuc/pdf can not really parse that many PDF files from the https://vvvvvvaria.org/~crunk/datasheets.zip so it's not that good.. Instead, following `pdf2text` (https://poppler.freedesktop.org), I wired up https://pkg.go.dev/github.com/kyoushuu/go-poppler and it can parse a lot more, but not all!
So, OCR may be required to actually get the contents of those exceptional / edge case PDFs which have weird content. When the program boots up, we could run the parsing and then look for ones which failed. Then we could OCR those in the background and stick them in a cache.
For now, let's test what `go-poppler` can parse and maybe it's good enough.
just use https://github.com/ledongthuc/pdf and read the contents of the already OCRed pdf.
I use a program called ocrmypdf and it works pretty well. the idea to have this program also do that is maybe asking for too much functionality handled well by other programs.
also: how do we know wether a PDF has text in it already or is just a bunch of scans?
We might be making a bad OCR file of a good pdf.
[Idea] Not have this program do any OCRto Not have this program do any OCR 2 years agoHmmm yeh 🤔 Maybe we can call out to an external file also? Fine to leave it out altogether. It does seem a little messy... reading contents for now seems fine!
Assume we'd have some UI to
xdg-open foo.pdf
to have a look & then choose an OCR tool OCR it. So it is not an automated thing, just semi-automated.As it turns out, https://github.com/ledongthuc/pdf can not really parse that many PDF files from the https://vvvvvvaria.org/~crunk/datasheets.zip so it's not that good.. Instead, following
pdf2text
(https://poppler.freedesktop.org), I wired up https://pkg.go.dev/github.com/kyoushuu/go-poppler and it can parse a lot more, but not all!So, OCR may be required to actually get the contents of those exceptional / edge case PDFs which have weird content. When the program boots up, we could run the parsing and then look for ones which failed. Then we could OCR those in the background and stick them in a cache.
For now, let's test what
go-poppler
can parse and maybe it's good enough.Aha, it's
apt install poppler-utils
that gives uspdftotext
!go-poppler
is fine for now it seems.