Pedro Sá Couto
5 years ago
2 changed files with 103 additions and 0 deletions
Binary file not shown.
@ -0,0 +1,103 @@ |
|||
<h1 align="center">DIY Book Scanner Workflow</h1> |
|||
|
|||
## Getting started |
|||
|
|||
These set of scripts was written for the Text Laundrette workshop. It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF. |
|||
|
|||
In case you want to skip any of the scripts just comment out in the shell code, <em>workshop_stream.sh</em>. |
|||
|
|||
##Dependencies |
|||
###Brew (MAC) or apt-get (LINUX) |
|||
<p>You’ll need the command-line tools for Xcode installed.</p> |
|||
```bash |
|||
xcode-select --install |
|||
``` |
|||
|
|||
<p>After install Homebrew.</p> |
|||
```bash |
|||
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" |
|||
``` |
|||
|
|||
<p>Run the following command once you’re done to ensure Homebrew is installed and working properly:</p> |
|||
```bash |
|||
brew doctor |
|||
``` |
|||
|
|||
```bash |
|||
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite |
|||
``` |
|||
|
|||
```bash |
|||
brew install python3 python3-pip imagemagick poppler pdfunite |
|||
``` |
|||
|
|||
###PIP3 |
|||
sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract |
|||
|
|||
|
|||
##How to use |
|||
<p>Add your pictures from the book scanner to the folder "/scans"</p> |
|||
|
|||
<p>Make all the files executable.</p> |
|||
```bash |
|||
sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh |
|||
``` |
|||
|
|||
<p>Run ./workshop_stream.sh</p> |
|||
|
|||
<p>Wait :)</p> |
|||
|
|||
|
|||
##Aditional information |
|||
###Create 5 directories |
|||
```bash |
|||
mkdir split |
|||
mkdir rotated |
|||
mkdir ocred |
|||
mkdir bounding_box |
|||
mkdir cropped |
|||
``` |
|||
###Merge the files in the directory <em>scans</em> |
|||
<p>All the scans will be appended to one pdf called out.pdf</p> |
|||
```bash |
|||
./merge_scans.sh |
|||
``` |
|||
|
|||
###Burst the pdf in <em>scans</em> |
|||
<p>Burst this pdf, renaming all the files so they can be iterated later.</p> |
|||
```bash |
|||
python3 burstpdf.py |
|||
``` |
|||
|
|||
###Rotate the pdfs |
|||
<p>The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.</p> |
|||
```bash |
|||
python3 rotation.py |
|||
``` |
|||
|
|||
###Cropping the bounding boxes |
|||
<p>The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.</p> |
|||
```bash |
|||
python3 bounding_box.py |
|||
``` |
|||
|
|||
###Cropping the mirror |
|||
<p>The pages are now cropped, but the mirror is still visible in the middle.</p> |
|||
```bash |
|||
python3 mirror_crop.py |
|||
``` |
|||
|
|||
###OCR |
|||
<p>In this part we OCR the jpg, turning these into PDFs.</p> |
|||
```bash |
|||
python3 tesseract_ocr.py |
|||
``` |
|||
|
|||
###Merge all the files and create the pdf |
|||
<p>The OCRed pages are now joined into their final PDF, your book is ready :)</p> |
|||
```bash |
|||
./merge_files.sh |
|||
``` |
|||
|
|||
## License |
|||
The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT). |
Loading…
Reference in new issue