Not have this program do any OCR #1

New Issue

crunk · 2022-04-26T09:51:23+02:00

crunk commented

2022-04-26 09:51:23 +02:00

just use https://github.com/ledongthuc/pdf and read the contents of the already OCRed pdf.

I use a program called ocrmypdf and it works pretty well. the idea to have this program also do that is maybe asking for too much functionality handled well by other programs.

also: how do we know wether a PDF has text in it already or is just a bunch of scans?
We might be making a bad OCR file of a good pdf.

just use https://github.com/ledongthuc/pdf and read the contents of the already OCRed pdf. I use a program called ocrmypdf and it works pretty well. the idea to have this program also do that is maybe asking for too much functionality handled well by other programs. also: how do we know wether a PDF has text in it already or is just a bunch of scans? We might be making a bad OCR file of a good pdf.

crunk added the

discussion

label 2022-04-26 09:53:12 +02:00

crunk changed title from ~~[Idea] Not have this program do any OCR~~ to Not have this program do any OCR

2022-04-26 09:53:21 +02:00

decentral1se commented

2022-04-26 19:27:35 +02:00

maybe asking for too much functionality handled well by other programs.

Hmmm yeh 🤔 Maybe we can call out to an external file also? Fine to leave it out altogether. It does seem a little messy... reading contents for now seems fine!

also: how do we know wether a PDF has text in it already or is just a bunch of scans?

Assume we'd have some UI to xdg-open foo.pdf to have a look & then choose an OCR tool OCR it. So it is not an automated thing, just semi-automated.

> maybe asking for too much functionality handled well by other programs. Hmmm yeh 🤔 Maybe we can call out to an external file also? Fine to leave it out altogether. It does seem a little messy... reading contents for now seems fine! > also: how do we know wether a PDF has text in it already or is just a bunch of scans? Assume we'd have some UI to `xdg-open foo.pdf` to have a look & then choose an OCR tool OCR it. So it is not an automated thing, just semi-automated.

decentral1se commented

2023-05-10 15:07:25 +02:00

As it turns out, https://github.com/ledongthuc/pdf can not really parse that many PDF files from the https://vvvvvvaria.org/~crunk/datasheets.zip so it's not that good.. Instead, following pdf2text (https://poppler.freedesktop.org), I wired up https://pkg.go.dev/github.com/kyoushuu/go-poppler and it can parse a lot more, but not all!

So, OCR may be required to actually get the contents of those exceptional / edge case PDFs which have weird content. When the program boots up, we could run the parsing and then look for ones which failed. Then we could OCR those in the background and stick them in a cache.

For now, let's test what go-poppler can parse and maybe it's good enough.

As it turns out, https://github.com/ledongthuc/pdf can not really parse that many PDF files from the https://vvvvvvaria.org/~crunk/datasheets.zip so it's not that good.. Instead, following `pdf2text` (https://poppler.freedesktop.org), I wired up https://pkg.go.dev/github.com/kyoushuu/go-poppler and it can parse a lot more, but not all! So, OCR may be required to actually get the contents of those exceptional / edge case PDFs which have weird content. When the program boots up, we could run the parsing and then look for ones which failed. Then we could OCR those in the background and stick them in a cache. For now, let's test what `go-poppler` can parse and maybe it's good enough.

decentral1se commented

2023-05-10 15:18:00 +02:00

Aha, it's apt install poppler-utils that gives us pdftotext!

Aha, it's `apt install poppler-utils` that gives us `pdftotext`!

decentral1se commented

2023-05-15 21:17:17 +02:00

go-poppler is fine for now it seems.

`go-poppler` is fine for now it seems.

decentral1se closed this issue

2023-05-15 21:17:17 +02:00

Sign in to join this conversation.