Scripts used to process a book scanned through the DIY Book Scanner.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

111 lines
2.6 KiB

<h1 align="center">DIY Book Scanner Workflow</h1>
## Getting started
This set of scripts was written for the Text Laundrette workshop.<br>It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.
In case you want to skip any of the scripts just comment out in the shell code, <em>workshop_stream.sh</em>.
##Dependencies
###Brew (MAC) or apt-get (LINUX)
<p>You’ll need the command-line tools for Xcode installed.</p>
```bash
xcode-select --install
```
<p>After install Homebrew.</p>
```bash
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```
<p>Run the following command once you’re done to ensure Homebrew is installed and working properly:</p>
```bash
brew doctor
```
```bash
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
```
```bash
brew install python3 python3-pip imagemagick poppler pdfunite
```
###PIP3
sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract
##How to use
<p>Add your pictures from the book scanner to the folder "/scans"</p>
<p>Make all the files executable.</p>
```bash
sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh
```
<p>Run ./workshop_stream.sh</p>
<p>Wait :)</p>
##Aditional information
###Create 5 directories
```bash
mkdir split
mkdir rotated
mkdir ocred
mkdir bounding_box
mkdir cropped
```
###Merge the files in the directory <em>scans</em>
<p>All the scans will be appended to one pdf called out.pdf</p>
```bash
./merge_scans.sh
```
###Burst the pdf in <em>scans</em>
<p>Burst this pdf, renaming all the files so they can be iterated later.</p>
```bash
python3 burstpdf.py
```
###Rotate the pdfs
<p>The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.</p>
```bash
python3 rotation.py
```
###Cropping the bounding boxes
<p>The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.</p>
```bash
python3 bounding_box.py
```
###Cropping the mirror
<p>The pages are now cropped, but the mirror is still visible in the middle.</p>
```bash
python3 mirror_crop.py
```
###OCR
<p>In this part we OCR the jpg, turning these into PDFs.</p>
```bash
python3 tesseract_ocr.py
```
###Merge all the files and create the pdf
<p>The OCRed pages are now joined into their final PDF, your book is ready :)</p>
```bash
./merge_files.sh
```
## License
The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).