Scripts used to process a book scanned through the DIY Book Scanner.

Pedro Sá Couto f1bbfc0c24 Update 'readme.md'		5 years ago
scans	Should work, needs heavy testing	5 years ago
.DS_Store	Added readme file	5 years ago
bounding_box.py	Should work, needs heavy testing	5 years ago
burstpdf.py	Still need to fix the mirror margins	5 years ago
chmod.sh	Needs to be tested	5 years ago
merge_files.sh	Should work, needs heavy testing	5 years ago
merge_scans.sh	Still need to fix the mirror margins	5 years ago
mirror_crop.py	Should work, needs heavy testing	5 years ago
readme.md	Update 'readme.md'	5 years ago
rotation.py	Should work, needs heavy testing	5 years ago
tesseract_ocr.py	Still need to fix the mirror margins	5 years ago
workshop_stream.sh	Should work, needs heavy testing	5 years ago

readme.md

DIY Book Scanner Workflow

Getting started

This set of scripts was written for the Text Laundrette workshop.
It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.

In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.

##Dependencies ###Brew (MAC) or apt-get (LINUX)

You’ll need the command-line tools for Xcode installed.

xcode-select --install

After install Homebrew.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Run the following command once you’re done to ensure Homebrew is installed and working properly:

brew doctor

sudo apt-get install python3 python3-pip imagemagick poppler pdfunite

brew install python3 python3-pip imagemagick poppler pdfunite

###PIP3 sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract

##How to use

Add your pictures from the book scanner to the folder "/scans"

Make all the files executable.

sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh

Run ./workshop_stream.sh

Wait :)

##Aditional information ###Create 5 directories

mkdir split
mkdir rotated
mkdir ocred
mkdir bounding_box
mkdir cropped

###Merge the files in the directory scans

All the scans will be appended to one pdf called out.pdf

```bash ./merge_scans.sh ```

###Burst the pdf in scans

Burst this pdf, renaming all the files so they can be iterated later.

```bash python3 burstpdf.py ```

###Rotate the pdfs

The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.

```bash python3 rotation.py ```

###Cropping the bounding boxes

The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.

```bash python3 bounding_box.py ```

###Cropping the mirror

The pages are now cropped, but the mirror is still visible in the middle.

```bash python3 mirror_crop.py ```

###OCR

In this part we OCR the jpg, turning these into PDFs.

```bash python3 tesseract_ocr.py ```

###Merge all the files and create the pdf

The OCRed pages are now joined into their final PDF, your book is ready :)

```bash ./merge_files.sh ```

## License The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).