Not have this program do any OCR #1

Closed
opened 2022-04-26 09:51:23 +02:00 by crunk · 4 comments
Owner

just use https://github.com/ledongthuc/pdf and read the contents of the already OCRed pdf.

I use a program called ocrmypdf and it works pretty well. the idea to have this program also do that is maybe asking for too much functionality handled well by other programs.

also: how do we know wether a PDF has text in it already or is just a bunch of scans?
We might be making a bad OCR file of a good pdf.

just use https://github.com/ledongthuc/pdf and read the contents of the already OCRed pdf. I use a program called ocrmypdf and it works pretty well. the idea to have this program also do that is maybe asking for too much functionality handled well by other programs. also: how do we know wether a PDF has text in it already or is just a bunch of scans? We might be making a bad OCR file of a good pdf.
crunk added the
discussion
label 2022-04-26 09:53:12 +02:00
crunk changed title from [Idea] Not have this program do any OCR to Not have this program do any OCR 2022-04-26 09:53:21 +02:00
Owner

maybe asking for too much functionality handled well by other programs.

Hmmm yeh 🤔 Maybe we can call out to an external file also? Fine to leave it out altogether. It does seem a little messy... reading contents for now seems fine!

also: how do we know wether a PDF has text in it already or is just a bunch of scans?

Assume we'd have some UI to xdg-open foo.pdf to have a look & then choose an OCR tool OCR it. So it is not an automated thing, just semi-automated.

> maybe asking for too much functionality handled well by other programs. Hmmm yeh 🤔 Maybe we can call out to an external file also? Fine to leave it out altogether. It does seem a little messy... reading contents for now seems fine! > also: how do we know wether a PDF has text in it already or is just a bunch of scans? Assume we'd have some UI to `xdg-open foo.pdf` to have a look & then choose an OCR tool OCR it. So it is not an automated thing, just semi-automated.
Owner

As it turns out, https://github.com/ledongthuc/pdf can not really parse that many PDF files from the https://vvvvvvaria.org/~crunk/datasheets.zip so it's not that good.. Instead, following pdf2text (https://poppler.freedesktop.org), I wired up https://pkg.go.dev/github.com/kyoushuu/go-poppler and it can parse a lot more, but not all!

So, OCR may be required to actually get the contents of those exceptional / edge case PDFs which have weird content. When the program boots up, we could run the parsing and then look for ones which failed. Then we could OCR those in the background and stick them in a cache.

For now, let's test what go-poppler can parse and maybe it's good enough.

As it turns out, https://github.com/ledongthuc/pdf can not really parse that many PDF files from the https://vvvvvvaria.org/~crunk/datasheets.zip so it's not that good.. Instead, following `pdf2text` (https://poppler.freedesktop.org), I wired up https://pkg.go.dev/github.com/kyoushuu/go-poppler and it can parse a lot more, but not all! So, OCR may be required to actually get the contents of those exceptional / edge case PDFs which have weird content. When the program boots up, we could run the parsing and then look for ones which failed. Then we could OCR those in the background and stick them in a cache. For now, let's test what `go-poppler` can parse and maybe it's good enough.
Owner

Aha, it's apt install poppler-utils that gives us pdftotext!

Aha, it's `apt install poppler-utils` that gives us `pdftotext`!
Owner

go-poppler is fine for now it seems.

`go-poppler` is fine for now it seems.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: varia/go-sh-manymanuals#1
No description provided.