Metadata-Version: 1.0
Name: pypdfocr
Version: 0.7.4
Summary: Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
Home-page: UNKNOWN
Author: Virantha N. Ekanayake
Author-email: virantha@gmail.com
License: ASL 2.0
Description: PyPDFOCR - Tesseract-OCR based PDF filing
        =========================================
        
        |image0| |image1| |Coverage Status|
        
        This program will help manage your scanned PDFs by doing the following:
        
        -  Take a scanned PDF file and run OCR on it (using the Tesseract OCR
           software from Google), generating a searchable PDF
        -  Optionally, watch a folder for incoming scanned PDFs and
           automatically run OCR on them
        -  Optionally, file the scanned PDFs into directories based on simple
           keyword matching that you specify
        -  Evernote auto-upload and filing based on keyword search
        -  Email status when it files your PDF
        
        More links:
        
        -  `Blog @ virantha.com <http://virantha.com/category/pypdfocr.html>`__
        -  `Documentation @
           documentup.com <http://documentup.com/virantha/pypdfocr>`__
        -  `Source @ github <https://www.github.com/virantha/pypdfocr>`__
        -  `API docs @ gitpages <http://virantha.github.com/pypdfocr/html>`__
        
        Usage:
        ------
        
        Single conversion:
        ~~~~~~~~~~~~~~~~~~
        
        ::
        
            pypdfocr filename.pdf
        
            --> filename_ocr.pdf will be generated
        
        If you have a language pack installed, then you can specify it with the
        ``-l`` option:
        
        ::
        
            pypdfocr -l spa filename.pdf
        
        Folder monitoring:
        ~~~~~~~~~~~~~~~~~~
        
        ::
        
            pypdfocr -w watch_directory
        
            --> Every time a pdf file is added to `watch_directory` it will be OCR'ed
        
        Automatic filing:
        ~~~~~~~~~~~~~~~~~
        
        To automatically move the OCR'ed pdf to a directory based on a keyword,
        use the -f option and specify a configuration file (described below):
        
        ::
        
            pypdfocr filename.pdf -f -c config.yaml
        
        You can also do this in folder monitoring mode:
        
        ::
        
            pypdfocr -w watch_directory -f -c config.yaml
        
        Filing based on filename match:
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        If no keywords match the contents of the filename, you can optionally
        allow it to fallback to trying to find keyword matches with the PDF
        filename using the -n option. For example, you may have receipts always
        named as ``receipt_2013_12_2.pdf`` by your scanner, and you want to move
        this to a folder called 'receipts'. Assuming you have a keyword
        ``receipt`` matching to folder ``receipts`` in your configuration file
        as described below, you can run the following and have this filed even
        if the content of the pdf does not contain the text 'receipt':
        
        ::
        
            pypdfocr filename.pdf -f -c config.yaml -n
        
        Configuration file for automatic PDF filing
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        The config.yaml file above is a simple folder to keyword matching text
        file. It determines where your OCR'ed PDFs (and optionally, the original
        scanned PDF) are placed after processing. An example is given below:
        
        ::
        
            target_folder: "docs/filed"
            default_folder: "docs/filed/manual_sort"
            original_move_folder: "docs/originals"
        
            folders:
                finances:
                    - american express
                    - chase card
                    - internal revenue service
                travel:
                    - boarding pass
                    - airlines
                    - expedia
                    - orbitz
                receipts:
                    - receipt
        
        The ``target_folder`` is the root of your filing cabinet. Any PDF moving
        will happen in sub-directories under this directory.
        
        The ``folders`` section defines your filing directories and the keywords
        associated with them. In this example, we have three filing directories
        (finances, travl, receipts), and some associated keywords for each
        filing directory. For example, if your OCR'ed PDF contains the phrase
        "american express" (in any upper/lower case), it will be filed into
        ``docs/filed/finances``
        
        The ``default_folder`` is where the OCR'ed PDF is moved to if there is
        no keyword match.
        
        The ``original_move_folder`` is optional (you can comment it out with
        ``#`` in front of that line), but if specified, the original scanned PDF
        is moved into this directory after OCR is done. Otherwise, if this field
        is not present or commented out, your original PDF will stay where it
        was found.
        
        If there is any naming conflict during filing, the program will add an
        underscore followed by a number to each filename, in order to avoid
        overwriting files that may already be present.
        
        Evernote upload:
        ~~~~~~~~~~~~~~~~
        
        Evernote authentication token
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        To enable Evernote support, you will need to `get a developer token for
        your Evernote
        account. <https://www.evernote.com/api/DeveloperToken.action>`__. You
        should note that this script will never delete or modify existing notes
        in your account, and limits itself to creating new Notebooks and Notes.
        Once you get that token, you copy and paste it into your configuration
        file as shown below
        
        Evernote filing usage
        ^^^^^^^^^^^^^^^^^^^^^
        
        To automatically upload the OCR'ed pdf to a folder based on a keyword,
        use the ``-e`` option instead of the ``-f`` auto filing option.
        
        ::
        
            pypdfocr filename.pdf -e -c config.yaml
        
        Similarly, you can also do this in folder monitoring mode:
        
        ::
        
            pypdfocr -w watch_directory -e -c config.yaml
        
        Evernote filing configuration file
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        The config file shown above only needs to change slightly. The folders
        section is completely unchanged, but note that ``target_folder`` is the
        name of your "Notebook stack" in Evernote, and the ``default_folder``
        should just be the default Evernote upload notebook name.
        
        ::
        
            target_folder: "evernote_stack"
            default_folder: "default"
            original_move_folder: "docs/originals"
            evernote_developer_token: "YOUR_TOKEN"
        
            folders:
                finances:
                    - american express
                    - chase card
                    - internal revenue service
                travel:
                    - boarding pass
                    - airlines
                    - expedia
                    - orbitz
                receipts:
                    - receipt
        
        Auto email
        ~~~~~~~~~~
        
        You can have PyPDFOCR email you everytime it converts a file and files
        it. You need to first specify the following lines in the configuration
        file and then use the ``-m`` option when invoking ``pypdfocr``:
        
        ::
        
            mail_smtp_server: "smtp.gmail.com:587"
            mail_smtp_login: "virantha@gmail.com"
            mail_smtp_password: "PASSWORD"
            mail_from_addr: "virantha@gmail.com"
            mail_to_list: 
                - "virantha@gmail.com"
                - "person2@gmail.com"
        
        Fine-tuning Tesseract/Ghostscript
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        At the moment, the only options allowed for Tesseract and Ghostscript
        are specifying their executable locations manually. Use the following in
        your configuration file:
        
        ::
        
            tesseract:
                binary: "/usr/bin/tesseract"
        
            ghostscript:
                binary: "/usr/local/bin/gs"
        
        Installation
        ------------
        
        Using pip
        ~~~~~~~~~
        
        PyPDFOCR is available in PyPI, so you can just run:
        
        ::
        
            pip install pypdfocr
        
        For those on **Windows**, because it's such a pain to get all the PIL
        and PDF dependencies installed, I've gone ahead and made an executable
        called
        `pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__
        
        You still need to install Tesseract and GhostScript as detailed below in
        the external dependencies list.
        
        Manual install
        ~~~~~~~~~~~~~~
        
        Clone the source directly from github (you need to have git installed):
        
        ::
        
            git clone https://github.com/virantha/pypdfocr.git
        
        Then, install the following third-party python libraries:
        
        -  PIL (Python Imaging Library) http://www.pythonware.com/products/pil/
        -  ReportLab (PDF generation library)
           http://www.reportlab.com/software/opensource/
        -  Watchdog (Cross-platform fhlesystem events monitoring)
           https://pypi.python.org/pypi/watchdog
        -  PyPDF2 (Pure python pdf library)
        
        These can all be installed via pip:
        
        ::
        
            pip install pil
            pip install reportlab
            pip install watchdog
            pip install pypdf2
        
        You will also need to install the external dependencies listed below.
        
        External Dependencies
        ---------------------
        
        PyPDFOCR relies on the following (free) programs being installed and in
        the path:
        
        -  Tesseract OCR software https://code.google.com/p/tesseract-ocr/
        -  GhostScript http://www.ghostscript.com/
        
        In addition, if you want it to figure out the original PDF resolution
        automatically, you need to have pdfimages in your path, which is part of
        the `xpdf <http://www.foolabs.com/xpdf/download.html>`__ or poppler
        packages.
        
        On Mac OS X, you can install these using homebrew:
        
        ::
        
            brew install tesseract
            brew install ghostscript
            brew install poppler
        
        On Windows, please use the installers provided on their download pages.
        
        \*\* Important \*\* Tesseract version 3.02.02 or newer required
        (apparently 3.02.01-6 and possibly others do not work due to a hocr
        output format change that I'm not planning to address). On Ubuntu, you
        may need to compile and install it manually by following `these
        instructions <http://miphol.com/muse/2013/05/install-tesseract-ocr-on-ubunt.html>`__
        
        Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees)
        then you need to find your tessdata directory and do the following:
        
        ::
        
            cd /usr/local/share/tessdata 
            cp eng.traineddata osd.traineddata 
        
        ``osd`` stands for Orientation and Script Detection, so you need to copy the .traineddata
        for whatever language you want to scan in as ``osd.traineddata``.  If you don't do this step, 
        then any landscape document will produce garbage
        
        Disclaimer
        ----------
        
        While test coverage is at 90% right now, Sphinx docs generation is at an
        early stage. The software is distributed on an "AS IS" BASIS, WITHOUT
        WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        
        .. |image0| image:: https://badge.fury.io/py/pypdfocr.png
           :target: https://pypi.python.org/pypi/pypdfocr
        .. |image1| image:: https://pypip.in/d/pypdfocr/badge.png
        .. |Coverage Status| image:: https://coveralls.io/repos/virantha/pypdfocr/badge.png?branch=develop
           :target: https://coveralls.io/r/virantha/pypdfocr
        
        =======  ========   ======
        Version  Date       Changes
        -------  --------   ------
        
        v0.7.4   3/28/14    Bug fix on pdf assembly
        v0.7.3   3/27/14    Modified internals to use single image per page (instead of multipage tiff). Also enabled orientation detection
        v0.7.2   3/26/14    Switched from Pil to Pillow. Now uses original images from PDF in output pdf (no dpi/color/quality changes!)
        v0.7.1   3/25/14    OCR Language is now an option
        v0.7.0   3/25/14    Now honors original pdf resolution
        v0.6.1   2/16/14    Bug fix for pdfs with only numbers in the filename
        v0.6.0   1/16/14    Added filing based on filename match as fallback, added tesseract version check
        v0.5.4   1/12/14    Fixed bug with reordering of text pages on certain platforms(glob)
        v0.5.3   12/12/13   Fix to evernote server specification
        v0.5.2   12/08/13   Fix to lowercase keywords
        v0.5.1   11/02/13   Fixed a bunch of windows critical path handling issues
        v0.5.0   10/30/13   Email status added, 90% test coverage
        v0.4.1   10/28/13   Made HOCR parsing more robust
        v0.4.0   10/28/13   Added early Evernote upload support
        v0.3.1   10/24/13   Path fix on windows
        v0.3.0   10/23/13   Added filing of converted pdfs using a configuration file to specify target directories based on keyword matches in the pdf text
        v0.2.2   10/22/13   Added a console script to put the pypdfocr script into your bin
        v0.2.1   10/22/13   Fix to initial packaging problem.
        v0.2.0   10/21/13   Initial release.
        =======  ========   ======
        
        Todo list
        =========
        
        - Switch to page-by-page pnm conversion so we start adding support for unpaper
        - Make tesseract run multiprocess
        - Run more robustness tests for watching networked shares
        - Add more docstrings
        - Add more option specifiers to tesseract and ghostscript
        
Platform: UNKNOWN
