Merge pull request #203 from danielquinn/feature/reminders

Feature: Reminders
chore: update the changelog for reminders
2025-12-22 11:01:23 +00:00 · 2017-03-25 16:27:28 +00:00 · 2017-03-25 16:22:04 +00:00 · 2017-03-25 16:21:46 +00:00 · 2017-03-25 16:20:59 +00:00 · 2017-03-25 16:18:34 +00:00
42 changed files with 703 additions and 363 deletions
--- a/.travis.yml
+++ b/.travis.yml
@@ -8,7 +8,9 @@ matrix:
          env: TOXENV=py34
        - python: 3.5
          env: TOXENV=py35
-        - python: 3.5
+        - python: 3.6
          env: TOXENV=py36
        - python: 3.6
          env: TOXENV=pep8
 install:
--- a/docs/changelog.rst
+++ b/docs/changelog.rst
@@ -1,6 +1,26 @@
 Changelog
 #########
 * 0.4.0
  * Introducing reminders.  See `#199`_ for more information, but the short
    explanation is that you can now attach simple notes & times to documents
    which are made available via the API.  Currently, the default API
    (basically just the Django admin) doesn't really make use of this, but
    `Thomas Brueggemann`_ over at `Paperless Desktop`_ has said that he would
    like to make use of this feature in his project.
 * 0.3.6
  * Fix for `#200`_ (!!) where the API wasn't configured to allow updating the
    correspondent or the tags for a document.
  * The ``content`` field is now optional, to allow for the edge case of a
    purely graphical document.
  * You can no longer add documents via the admin.  This never worked in the
    first place, so all I've done here is remove the link to the broken form.
  * The consumer code has been heavily refactored to support a pluggable
    interface.  Install a paperless consumer via pip and tell paperless about
    it with an environment variable, and you're good to go.  Proper
    documentation is on its way.
 * 0.3.5
  * A serious facelift for the documents listing page wherein we drop the
    tabular layout in favour of a tiled interface.
@@ -161,6 +181,8 @@ Changelog
 .. _Tim White: https://github.com/timwhite
 .. _Florian Harr: https://github.com/evils
 .. _Justin Snyman: https://github.com/stringlytyped
 .. _Thomas Brueggemann: https://github.com/thomasbrueggemann
 .. _Paperless Desktop: https://github.com/thomasbrueggemann/paperless-desktop
 .. _#20: https://github.com/danielquinn/paperless/issues/20
 .. _#44: https://github.com/danielquinn/paperless/issues/44
@@ -187,3 +209,5 @@ Changelog
 .. _#171: https://github.com/danielquinn/paperless/issues/171
 .. _#172: https://github.com/danielquinn/paperless/issues/172
 .. _#179: https://github.com/danielquinn/paperless/pull/179
 .. _#199: https://github.com/danielquinn/paperless/issues/199
 .. _#200: https://github.com/danielquinn/paperless/issues/200
--- a/src/documents/admin.py
+++ b/src/documents/admin.py
@@ -62,15 +62,19 @@ class DocumentAdmin(CommonAdmin):
    list_filter = ("tags", "correspondent", MonthListFilter)
    ordering = ["-created", "correspondent"]
    def has_add_permission(self, request):
        return False
    def created_(self, obj):
        return obj.created.date().strftime("%Y-%m-%d")
    created_.short_description = "Created"
    def thumbnail(self, obj):
        png_img = self._html_tag(
            "img",
            src="/fetch/thumb/{}".format(obj.id),
            width=180,
-            alt="thumbnail",
+            alt="Thumbnail of {}".format(obj.file_name),
            title=obj.file_name
        )
        return self._html_tag("a", png_img, href=obj.download_url)
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -1,35 +1,21 @@
 import datetime
 import hashlib
 import logging
 import os
 import re
 import uuid
 import shutil
 import hashlib
 import logging
 import datetime
 import tempfile
 import itertools
 import subprocess
 from multiprocessing.pool import Pool
 import pyocr
 import langdetect
 from PIL import Image
 from django.conf import settings
 from django.utils import timezone
 from paperless.db import GnuPG
 from pyocr.tesseract import TesseractError
 from pyocr.libtesseract.tesseract_raw import \
    TesseractError as OtherTesseractError
-from .models import Tag, Document, FileInfo
+from .models import Document, FileInfo, Tag
 from .parsers import ParseError
 from .signals import (
-    document_consumption_started,
+    document_consumer_declaration,
-    document_consumption_finished
+    document_consumption_finished,
    document_consumption_started
 )
 from .languages import ISO639
 class OCRError(Exception):
    pass
 class ConsumerError(Exception):
@@ -47,13 +33,7 @@ class Consumer(object):
    """
    SCRATCH = settings.SCRATCH_DIR
    CONVERT = settings.CONVERT_BINARY
    UNPAPER = settings.UNPAPER_BINARY
    CONSUME = settings.CONSUMPTION_DIR
    THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
    DENSITY = settings.CONVERT_DENSITY if settings.CONVERT_DENSITY else 300
    DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
    def __init__(self):
@@ -78,6 +58,16 @@ class Consumer(object):
            raise ConsumerError(
                "Consumption directory {} does not exist".format(self.CONSUME))
        self.parsers = []
        for response in document_consumer_declaration.send(self):
            self.parsers.append(response[1])
        if not self.parsers:
            raise ConsumerError(
                "No parsers could be found, not even the default.  "
                "This is a problem."
            )
    def log(self, level, message):
        getattr(self.logger, level)(message, extra={
            "group": self.logging_group
@@ -109,6 +99,13 @@ class Consumer(object):
                self._ignore.append(doc)
                continue
            parser_class = self._get_parser_class(doc)
            if not parser_class:
                self.log(
                    "info", "No parsers could be found for {}".format(doc))
                self._ignore.append(doc)
                continue
            self.logging_group = uuid.uuid4()
            self.log("info", "Consuming {}".format(doc))
@@ -119,25 +116,26 @@ class Consumer(object):
                logging_group=self.logging_group
            )
-            tempdir = tempfile.mkdtemp(prefix="paperless", dir=self.SCRATCH)
+            parsed_document = parser_class(doc)
-            imgs = self._get_greyscale(tempdir, doc)
+            thumbnail = parsed_document.get_thumbnail()
            thumbnail = self._get_thumbnail(tempdir, doc)
            try:
-
+                document = self._store(
-                document = self._store(self._get_ocr(imgs), doc, thumbnail)
+                    parsed_document.get_text(),
-
+                    doc,
-            except OCRError as e:
+                    thumbnail
                )
            except ParseError as e:
                self._ignore.append(doc)
-                self.log("error", "OCR FAILURE for {}: {}".format(doc, e))
+                self.log("error", "PARSE FAILURE for {}: {}".format(doc, e))
-                self._cleanup_tempdir(tempdir)
+                parsed_document.cleanup()
                continue
            else:
-                self._cleanup_tempdir(tempdir)
+                parsed_document.cleanup()
                self._cleanup_doc(doc)
                self.log(
@@ -151,142 +149,20 @@ class Consumer(object):
                    logging_group=self.logging_group
                )
-    def _get_greyscale(self, tempdir, doc):
+    def _get_parser_class(self, doc):
        """
-        Greyscale images are easier for Tesseract to OCR
+        Determine the appropriate parser class based on the file
        """
-        self.log("info", "Generating greyscale image from {}".format(doc))
+        options = []
        for parser in self.parsers:
            result = parser(doc)
            if result:
                options.append(result)
-        # Convert PDF to multiple PNMs
+        # Return the parser with the highest weight.
-        pnm = os.path.join(tempdir, "convert-%04d.pnm")
+        return sorted(
-        run_convert(
+            options, key=lambda _: _["weight"], reverse=True)[0]["parser"]
            self.CONVERT,
            "-density", str(self.DENSITY),
            "-depth", "8",
            "-type", "grayscale",
            doc, pnm,
        )
        # Get a list of converted images
        pnms = []
        for f in os.listdir(tempdir):
            if f.endswith(".pnm"):
                pnms.append(os.path.join(tempdir, f))
        # Run unpaper in parallel on converted images
        with Pool(processes=self.THREADS) as pool:
            pool.map(run_unpaper, itertools.product([self.UNPAPER], pnms))
        # Return list of converted images, processed with unpaper
        pnms = []
        for f in os.listdir(tempdir):
            if f.endswith(".unpaper.pnm"):
                pnms.append(os.path.join(tempdir, f))
        return sorted(filter(lambda __: os.path.isfile(__), pnms))
    def _get_thumbnail(self, tempdir, doc):
        """
        The thumbnail of a PDF is just a 500px wide image of the first page.
        """
        self.log("info", "Generating the thumbnail")
        run_convert(
            self.CONVERT,
            "-scale", "500x5000",
            "-alpha", "remove",
            doc, os.path.join(tempdir, "convert-%04d.png")
        )
        return os.path.join(tempdir, "convert-0000.png")
    def _guess_language(self, text):
        try:
            guess = langdetect.detect(text)
            self.log("debug", "Language detected: {}".format(guess))
            return guess
        except Exception as e:
            self.log("warning", "Language detection error: {}".format(e))
    def _get_ocr(self, imgs):
        """
        Attempts to do the best job possible OCR'ing the document based on
        simple language detection trial & error.
        """
        if not imgs:
            raise OCRError("No images found")
        self.log("info", "OCRing the document")
        # Since the division gets rounded down by int, this calculation works
        # for every edge-case, i.e. 1
        middle = int(len(imgs) / 2)
        raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
        guessed_language = self._guess_language(raw_text)
        if not guessed_language or guessed_language not in ISO639:
            self.log("warning", "Language detection failed!")
            if settings.FORGIVING_OCR:
                self.log(
                    "warning",
                    "As FORGIVING_OCR is enabled, we're going to make the "
                    "best with what we have."
                )
                raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
                return raw_text
            raise OCRError("Language detection failed")
        if ISO639[guessed_language] == self.DEFAULT_OCR_LANGUAGE:
            raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
            return raw_text
        try:
            return self._ocr(imgs, ISO639[guessed_language])
        except pyocr.pyocr.tesseract.TesseractError:
            if settings.FORGIVING_OCR:
                self.log(
                    "warning",
                    "OCR for {} failed, but we're going to stick with what "
                    "we've got since FORGIVING_OCR is enabled.".format(
                        guessed_language
                    )
                )
                raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
                return raw_text
            raise OCRError(
                "The guessed language is not available in this instance of "
                "Tesseract."
            )
    def _assemble_ocr_sections(self, imgs, middle, text):
        """
        Given a `middle` value and the text that middle page represents, we OCR
        the remainder of the document and return the whole thing.
        """
        text = self._ocr(imgs[:middle], self.DEFAULT_OCR_LANGUAGE) + text
        text += self._ocr(imgs[middle + 1:], self.DEFAULT_OCR_LANGUAGE)
        return text
    def _ocr(self, imgs, lang):
        """
        Performs a single OCR attempt.
        """
        if not imgs:
            return ""
        self.log("info", "Parsing for {}".format(lang))
        with Pool(processes=self.THREADS) as pool:
            r = pool.map(image_to_string, itertools.product(imgs, [lang]))
            r = " ".join(r)
        # Strip out excess white space to allow matching to go smoother
        return strip_excess_whitespace(r)
    def _store(self, text, doc, thumbnail):
@@ -332,10 +208,6 @@ class Consumer(object):
        return document
    def _cleanup_tempdir(self, d):
        self.log("debug", "Deleting directory {}".format(d))
        shutil.rmtree(d)
    def _cleanup_doc(self, doc):
        self.log("debug", "Deleting document {}".format(doc))
        os.unlink(doc)
@@ -361,41 +233,3 @@ class Consumer(object):
        with open(doc, "rb") as f:
            checksum = hashlib.md5(f.read()).hexdigest()
        return Document.objects.filter(checksum=checksum).exists()
 def strip_excess_whitespace(text):
    collapsed_spaces = re.sub(r"([^\S\r\n]+)", " ", text)
    no_leading_whitespace = re.sub(
        "([\n\r]+)([^\S\n\r]+)", '\\1', collapsed_spaces)
    no_trailing_whitespace = re.sub("([^\S\n\r]+)$", '', no_leading_whitespace)
    return no_trailing_whitespace
 def image_to_string(args):
    img, lang = args
    ocr = pyocr.get_available_tools()[0]
    with Image.open(os.path.join(Consumer.SCRATCH, img)) as f:
        if ocr.can_detect_orientation():
            try:
                orientation = ocr.detect_orientation(f, lang=lang)
                f = f.rotate(orientation["angle"], expand=1)
            except (TesseractError, OtherTesseractError):
                pass
        return ocr.image_to_string(f, lang=lang)
 def run_unpaper(args):
    unpaper, pnm = args
    subprocess.Popen(
        (unpaper, pnm, pnm.replace(".pnm", ".unpaper.pnm"))).wait()
 def run_convert(*args):
    environment = os.environ.copy()
    if settings.CONVERT_MEMORY_LIMIT:
        environment["MAGICK_MEMORY_LIMIT"] = settings.CONVERT_MEMORY_LIMIT
    if settings.CONVERT_TMPDIR:
        environment["MAGICK_TMPDIR"] = settings.CONVERT_TMPDIR
    subprocess.Popen(args, env=environment).wait()
--- a/src/documents/filters.py
+++ b/src/documents/filters.py
@@ -8,7 +8,7 @@ class CorrespondentFilterSet(FilterSet):
    class Meta(object):
        model = Correspondent
        fields = {
-            'name': [
+            "name": [
                "startswith", "endswith", "contains",
                "istartswith", "iendswith", "icontains"
            ],
@@ -21,7 +21,7 @@ class TagFilterSet(FilterSet):
    class Meta(object):
        model = Tag
        fields = {
-            'name': [
+            "name": [
                "startswith", "endswith", "contains",
                "istartswith", "iendswith", "icontains"
            ],
--- a/src/documents/migrations/0001_initial.py
+++ b/src/documents/migrations/0001_initial.py
@@ -3,6 +3,7 @@
 from __future__ import unicode_literals
 from django.db import migrations, models
 from django.conf import settings
 class Migration(migrations.Migration):
@@ -19,7 +20,7 @@ class Migration(migrations.Migration):
                ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
                ('sender', models.CharField(blank=True, db_index=True, max_length=128)),
                ('title', models.CharField(blank=True, db_index=True, max_length=128)),
-                ('content', models.TextField(db_index=True)),
+                ('content', models.TextField(db_index=("mysql" not in settings.DATABASES["default"]["ENGINE"]))),
                ('created', models.DateTimeField(auto_now_add=True)),
                ('modified', models.DateTimeField(auto_now=True)),
            ],
--- a/src/documents/migrations/0003_sender.py
+++ b/src/documents/migrations/0003_sender.py
@@ -47,7 +47,11 @@ class Migration(migrations.Migration):
            ],
        ),
        migrations.RunPython(move_sender_strings_to_sender_model),
-        migrations.AlterField(
+        migrations.RemoveField(
            model_name='document',
            name='sender',
        ),
        migrations.AddField(
            model_name='document',
            name='sender',
            field=models.ForeignKey(blank=True, on_delete=django.db.models.deletion.CASCADE, to='documents.Sender'),
--- a/src/documents/migrations/0016_auto_20170325_1558.py
+++ b/src/documents/migrations/0016_auto_20170325_1558.py
@@ -0,0 +1,20 @@
 # -*- coding: utf-8 -*-
 # Generated by Django 1.10.5 on 2017-03-25 15:58
 from __future__ import unicode_literals
 from django.db import migrations, models
 class Migration(migrations.Migration):
    dependencies = [
        ('documents', '0015_add_insensitive_to_match'),
    ]
    operations = [
        migrations.AlterField(
            model_name='document',
            name='content',
            field=models.TextField(blank=True, db_index=True, help_text='The raw, text-only data of the document.  This field is primarily used for searching.'),
        ),
    ]
--- a/src/documents/mixins.py
+++ b/src/documents/mixins.py
@@ -1,8 +1,3 @@
 from django.contrib.auth.mixins import AccessMixin
 from django.contrib.auth import authenticate, login
 import base64
 class Renderable(object):
    """
    A handy mixin to make it easier/cleaner to print output based on a
@@ -12,46 +7,3 @@ class Renderable(object):
    def _render(self, text, verbosity):
        if self.verbosity >= verbosity:
            print(text)
 class SessionOrBasicAuthMixin(AccessMixin):
    """
    Session or Basic Authentication mixin for Django.
    It determines if the requester is already logged in or if they have
    provided proper http-authorization and returning the view if all goes
    well, otherwise responding with a 401.
    Base for mixin found here: https://djangosnippets.org/snippets/3073/
    """
    def dispatch(self, request, *args, **kwargs):
        # check if user is authenticated via the session
        if request.user.is_authenticated:
            # Already logged in, just return the view.
            return super(SessionOrBasicAuthMixin, self).dispatch(
                request, *args, **kwargs
            )
        # apparently not authenticated via session, maybe via HTTP Basic?
        if 'HTTP_AUTHORIZATION' in request.META:
            auth = request.META['HTTP_AUTHORIZATION'].split()
            if len(auth) == 2:
                # NOTE: Support for only basic authentication
                if auth[0].lower() == "basic":
                    authString = base64.b64decode(auth[1]).decode('utf-8')
                    uname, passwd = authString.split(':')
                    user = authenticate(username=uname, password=passwd)
                    if user is not None:
                        if user.is_active:
                            login(request, user)
                            request.user = user
                            return super(
                                SessionOrBasicAuthMixin, self
                            ).dispatch(
                                request, *args, **kwargs
                            )
        # nope, really not authenticated
        return self.handle_no_permission()
--- a/src/documents/models.py
+++ b/src/documents/models.py
@@ -158,13 +158,22 @@ class Document(models.Model):
    correspondent = models.ForeignKey(
        Correspondent, blank=True, null=True, related_name="documents")
    title = models.CharField(max_length=128, blank=True, db_index=True)
-    content = models.TextField(db_index=True)
+
    content = models.TextField(
        db_index=True,
        blank=True,
        help_text="The raw, text-only data of the document.  This field is "
                  "primarily used for searching."
    )
    file_type = models.CharField(
        max_length=4,
        editable=False,
        choices=tuple([(t, t.upper()) for t in TYPES])
    )
    tags = models.ManyToManyField(
        Tag, related_name="documents", blank=True)
--- a/src/documents/parsers.py
+++ b/src/documents/parsers.py
@@ -0,0 +1,45 @@
 import logging
 import shutil
 import tempfile
 from django.conf import settings
 class ParseError(Exception):
    pass
 class DocumentParser(object):
    """
    Subclass this to make your own parser.  Have a look at
    `paperless_tesseract.parsers` for inspiration.
    """
    SCRATCH = settings.SCRATCH_DIR
    def __init__(self, path):
        self.document_path = path
        self.tempdir = tempfile.mkdtemp(prefix="paperless", dir=self.SCRATCH)
        self.logger = logging.getLogger(__name__)
        self.logging_group = None
    def get_thumbnail(self):
        """
        Returns the path to a file we can use as a thumbnail for this document.
        """
        raise NotImplementedError()
    def get_text(self):
        """
        Returns the text from the document and only the text.
        """
        raise NotImplementedError()
    def log(self, level, message):
        getattr(self.logger, level)(message, extra={
            "group": self.logging_group
        })
    def cleanup(self):
        self.log("debug", "Deleting directory {}".format(self.tempdir))
        shutil.rmtree(self.tempdir)
--- a/src/documents/serialisers.py
+++ b/src/documents/serialisers.py
@@ -18,12 +18,21 @@ class TagSerializer(serializers.HyperlinkedModelSerializer):
            "id", "slug", "name", "colour", "match", "matching_algorithm")
 class CorrespondentField(serializers.HyperlinkedRelatedField):
    def get_queryset(self):
        return Correspondent.objects.all()
 class TagsField(serializers.HyperlinkedRelatedField):
    def get_queryset(self):
        return Tag.objects.all()
 class DocumentSerializer(serializers.ModelSerializer):
-    correspondent = serializers.HyperlinkedRelatedField(
+    correspondent = CorrespondentField(
-        read_only=True, view_name="drf:correspondent-detail", allow_null=True)
+        view_name="drf:correspondent-detail", allow_null=True)
-    tags = serializers.HyperlinkedRelatedField(
+    tags = TagsField(view_name="drf:tag-detail", many=True)
        read_only=True, view_name="drf:tag-detail", many=True)
    class Meta(object):
        model = Document
--- a/src/documents/signals/init.py
+++ b/src/documents/signals/init.py
@@ -2,3 +2,4 @@ from django.dispatch import Signal
 document_consumption_started = Signal(providing_args=["filename"])
 document_consumption_finished = Signal(providing_args=["document"])
 document_consumer_declaration = Signal(providing_args=[])
--- a/src/documents/signals/handlers.py
+++ b/src/documents/signals/handlers.py
@@ -1,6 +1,5 @@
 import logging
 import os
 from subprocess import Popen
 from django.conf import settings
--- a/src/documents/static/paperless.css
+++ b/src/documents/static/paperless.css
@@ -10,3 +10,14 @@ td a.tag {
  margin: 1px;
  display: inline-block;
 }
 #result_list th.column-note {
  text-align: right;
 }
 #result_list td.field-note {
  text-align: right;
 }
 #result_list td textarea {
  width: 90%;
  height: 5em;
 }
--- a/src/documents/templates/admin/documents/document/change_list_results.html
+++ b/src/documents/templates/admin/documents/document/change_list_results.html
@@ -158,7 +158,7 @@
 <script>
-  // We nee to re-build the select-all functionality as the old logic pointed
+  // We need to re-build the select-all functionality as the old logic pointed
  // to a table and we're using divs now.
  django.jQuery("#action-toggle").on("change", function(){
    django.jQuery(".grid .box .result .checkbox input")
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -1,13 +1,6 @@
 import os
 from unittest import mock, skipIf
 import pyocr
 from django.test import TestCase
 from pyocr.libtesseract.tesseract_raw import \
    TesseractError as OtherTesseractError
 from ..models import FileInfo
 from ..consumer import image_to_string, strip_excess_whitespace
 class TestAttributes(TestCase):
@@ -308,71 +301,3 @@ class TestFieldPermutations(TestCase):
                        }
                        self._test_guessed_attributes(
                            template.format(**spec), **spec)
 class FakeTesseract(object):
    @staticmethod
    def can_detect_orientation():
        return True
    @staticmethod
    def detect_orientation(file_handle, lang):
        raise OtherTesseractError("arbitrary status", "message")
    @staticmethod
    def image_to_string(file_handle, lang):
        return "This is test text"
 class FakePyOcr(object):
    @staticmethod
    def get_available_tools():
        return [FakeTesseract]
 class TestOCR(TestCase):
    text_cases = [
        ("simple     string", "simple string"),
        (
            "simple    newline\n   testing string",
            "simple newline\ntesting string"
        ),
        (
            "utf-8   строка с пробелами в конце  ",
            "utf-8 строка с пробелами в конце"
        )
    ]
    SAMPLE_FILES = os.path.join(os.path.dirname(__file__), "samples")
    TESSERACT_INSTALLED = bool(pyocr.get_available_tools())
    def test_strip_excess_whitespace(self):
        for source, result in self.text_cases:
            actual_result = strip_excess_whitespace(source)
            self.assertEqual(
                result,
                actual_result,
                "strip_exceess_whitespace({}) != '{}', but '{}'".format(
                    source,
                    result,
                    actual_result
                )
            )
    @skipIf(not TESSERACT_INSTALLED, "Tesseract not installed. Skipping")
    @mock.patch("documents.consumer.Consumer.SCRATCH", SAMPLE_FILES)
    @mock.patch("documents.consumer.pyocr", FakePyOcr)
    def test_image_to_string_with_text_free_page(self):
        """
        This test is sort of silly, since it's really just reproducing an odd
        exception thrown by pyocr when it encounters a page with no text.
        Actually running this test against an installation of Tesseract results
        in a segmentation fault rooted somewhere deep inside pyocr where I
        don't care to dig.  Regardless, if you run the consumer normally,
        text-free pages are now handled correctly so long as we work around
        this weird exception.
        """
        image_to_string(["no-text.png", "en"])
--- a/src/documents/views.py
+++ b/src/documents/views.py
@@ -2,15 +2,16 @@ from django.http import HttpResponse
 from django.views.decorators.csrf import csrf_exempt
 from django.views.generic import DetailView, FormView, TemplateView
 from django_filters.rest_framework import DjangoFilterBackend
 from rest_framework.filters import SearchFilter, OrderingFilter
 from paperless.db import GnuPG
 from paperless.mixins import SessionOrBasicAuthMixin
 from paperless.views import StandardPagination
 from rest_framework.filters import OrderingFilter, SearchFilter
 from rest_framework.mixins import (
    DestroyModelMixin,
    ListModelMixin,
    RetrieveModelMixin,
    UpdateModelMixin
 )
 from rest_framework.pagination import PageNumberPagination
 from rest_framework.permissions import IsAuthenticated
 from rest_framework.viewsets import (
    GenericViewSet,
@@ -27,7 +28,6 @@ from .serialisers import (
    LogSerializer,
    TagSerializer
 )
 from .mixins import SessionOrBasicAuthMixin
 class IndexView(TemplateView):
@@ -92,12 +92,6 @@ class PushView(SessionOrBasicAuthMixin, FormView):
        return HttpResponse("0")
 class StandardPagination(PageNumberPagination):
    page_size = 25
    page_size_query_param = "page-size"
    max_page_size = 100000
 class CorrespondentViewSet(ModelViewSet):
    model = Correspondent
    queryset = Correspondent.objects.all()
--- a/src/paperless/mixins.py
+++ b/src/paperless/mixins.py
@@ -0,0 +1,46 @@
 from django.contrib.auth.mixins import AccessMixin
 from django.contrib.auth import authenticate, login
 import base64
 class SessionOrBasicAuthMixin(AccessMixin):
    """
    Session or Basic Authentication mixin for Django.
    It determines if the requester is already logged in or if they have
    provided proper http-authorization and returning the view if all goes
    well, otherwise responding with a 401.
    Base for mixin found here: https://djangosnippets.org/snippets/3073/
    """
    def dispatch(self, request, *args, **kwargs):
        # check if user is authenticated via the session
        if request.user.is_authenticated:
            # Already logged in, just return the view.
            return super(SessionOrBasicAuthMixin, self).dispatch(
                request, *args, **kwargs
            )
        # apparently not authenticated via session, maybe via HTTP Basic?
        if 'HTTP_AUTHORIZATION' in request.META:
            auth = request.META['HTTP_AUTHORIZATION'].split()
            if len(auth) == 2:
                # NOTE: Support for only basic authentication
                if auth[0].lower() == "basic":
                    authString = base64.b64decode(auth[1]).decode('utf-8')
                    uname, passwd = authString.split(':')
                    user = authenticate(username=uname, password=passwd)
                    if user is not None:
                        if user.is_active:
                            login(request, user)
                            request.user = user
                            return super(
                                SessionOrBasicAuthMixin, self
                            ).dispatch(
                                request, *args, **kwargs
                            )
        # nope, really not authenticated
        return self.handle_no_permission()
--- a/src/paperless/settings.py
+++ b/src/paperless/settings.py
@@ -61,6 +61,8 @@ INSTALLED_APPS = [
    "django_extensions",
    "documents.apps.DocumentsConfig",
    "reminders.apps.RemindersConfig",
    "paperless_tesseract.apps.PaperlessTesseractConfig",
    "flat_responsive",
    "django.contrib.admin",
@@ -70,6 +72,9 @@ INSTALLED_APPS = [
 ]
 if os.getenv("PAPERLESS_INSTALLED_APPS"):
    INSTALLED_APPS += os.getenv("PAPERLESS_INSTALLED_APPS").split(",")
 MIDDLEWARE_CLASSES = [
    'django.middleware.security.SecurityMiddleware',
    'django.contrib.sessions.middleware.SessionMiddleware',
--- a/src/paperless/urls.py
+++ b/src/paperless/urls.py
@@ -24,12 +24,14 @@ from documents.views import (
    IndexView, FetchView, PushView,
    CorrespondentViewSet, TagViewSet, DocumentViewSet, LogViewSet
 )
 from reminders.views import ReminderViewSet
 router = DefaultRouter()
-router.register(r'correspondents', CorrespondentViewSet)
+router.register(r"correspondents", CorrespondentViewSet)
-router.register(r'tags', TagViewSet)
+router.register(r"documents", DocumentViewSet)
-router.register(r'documents', DocumentViewSet)
+router.register(r"logs", LogViewSet)
-router.register(r'logs', LogViewSet)
+router.register(r"reminders", ReminderViewSet)
 router.register(r"tags", TagViewSet)
 urlpatterns = [
--- a/src/paperless/version.py
+++ b/src/paperless/version.py
@@ -1 +1 @@
-__version__ = (0, 3, 5)
+__version__ = (0, 3, 6)
--- a/src/paperless/views.py
+++ b/src/paperless/views.py
@@ -0,0 +1,7 @@
 from rest_framework.pagination import PageNumberPagination
 class StandardPagination(PageNumberPagination):
    page_size = 25
    page_size_query_param = "page-size"
    max_page_size = 100000
--- a/src/paperless_tesseract/init.py
+++ b/src/paperless_tesseract/init.py
--- a/src/paperless_tesseract/apps.py
+++ b/src/paperless_tesseract/apps.py
@@ -0,0 +1,16 @@
 from django.apps import AppConfig
 class PaperlessTesseractConfig(AppConfig):
    name = "paperless_tesseract"
    def ready(self):
        from documents.signals import document_consumer_declaration
        from .signals import ConsumerDeclaration
        document_consumer_declaration.connect(ConsumerDeclaration.handle)
        AppConfig.ready(self)
--- a/src/paperless_tesseract/languages.py
+++ b/src/paperless_tesseract/languages.py
--- a/src/paperless_tesseract/parsers.py
+++ b/src/paperless_tesseract/parsers.py
@@ -0,0 +1,214 @@
 import itertools
 import os
 import re
 import subprocess
 from multiprocessing.pool import Pool
 import langdetect
 import pyocr
 from django.conf import settings
 from documents.parsers import DocumentParser, ParseError
 from PIL import Image
 from pyocr.libtesseract.tesseract_raw import \
    TesseractError as OtherTesseractError
 from pyocr.tesseract import TesseractError
 from .languages import ISO639
 class OCRError(Exception):
    pass
 class RasterisedDocumentParser(DocumentParser):
    """
    This parser uses Tesseract to try and get some text out of a rasterised
    image, whether it's a PDF, or other graphical format (JPEG, TIFF, etc.)
    """
    CONVERT = settings.CONVERT_BINARY
    DENSITY = settings.CONVERT_DENSITY if settings.CONVERT_DENSITY else 300
    THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
    UNPAPER = settings.UNPAPER_BINARY
    DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
    def get_thumbnail(self):
        """
        The thumbnail of a PDF is just a 500px wide image of the first page.
        """
        run_convert(
            self.CONVERT,
            "-scale", "500x5000",
            "-alpha", "remove",
            self.document_path, os.path.join(self.tempdir, "convert-%04d.png")
        )
        return os.path.join(self.tempdir, "convert-0000.png")
    def get_text(self):
        images = self._get_greyscale()
        try:
            return self._get_ocr(images)
        except OCRError as e:
            raise ParseError(e)
    def _get_greyscale(self):
        """
        Greyscale images are easier for Tesseract to OCR
        """
        # Convert PDF to multiple PNMs
        pnm = os.path.join(self.tempdir, "convert-%04d.pnm")
        run_convert(
            self.CONVERT,
            "-density", str(self.DENSITY),
            "-depth", "8",
            "-type", "grayscale",
            self.document_path, pnm,
        )
        # Get a list of converted images
        pnms = []
        for f in os.listdir(self.tempdir):
            if f.endswith(".pnm"):
                pnms.append(os.path.join(self.tempdir, f))
        # Run unpaper in parallel on converted images
        with Pool(processes=self.THREADS) as pool:
            pool.map(run_unpaper, itertools.product([self.UNPAPER], pnms))
        # Return list of converted images, processed with unpaper
        pnms = []
        for f in os.listdir(self.tempdir):
            if f.endswith(".unpaper.pnm"):
                pnms.append(os.path.join(self.tempdir, f))
        return sorted(filter(lambda __: os.path.isfile(__), pnms))
    def _guess_language(self, text):
        try:
            guess = langdetect.detect(text)
            self.log("debug", "Language detected: {}".format(guess))
            return guess
        except Exception as e:
            self.log("warning", "Language detection error: {}".format(e))
    def _get_ocr(self, imgs):
        """
        Attempts to do the best job possible OCR'ing the document based on
        simple language detection trial & error.
        """
        if not imgs:
            raise OCRError("No images found")
        self.log("info", "OCRing the document")
        # Since the division gets rounded down by int, this calculation works
        # for every edge-case, i.e. 1
        middle = int(len(imgs) / 2)
        raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
        guessed_language = self._guess_language(raw_text)
        if not guessed_language or guessed_language not in ISO639:
            self.log("warning", "Language detection failed!")
            if settings.FORGIVING_OCR:
                self.log(
                    "warning",
                    "As FORGIVING_OCR is enabled, we're going to make the "
                    "best with what we have."
                )
                raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
                return raw_text
            raise OCRError("Language detection failed")
        if ISO639[guessed_language] == self.DEFAULT_OCR_LANGUAGE:
            raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
            return raw_text
        try:
            return self._ocr(imgs, ISO639[guessed_language])
        except pyocr.pyocr.tesseract.TesseractError:
            if settings.FORGIVING_OCR:
                self.log(
                    "warning",
                    "OCR for {} failed, but we're going to stick with what "
                    "we've got since FORGIVING_OCR is enabled.".format(
                        guessed_language
                    )
                )
                raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
                return raw_text
            raise OCRError(
                "The guessed language is not available in this instance of "
                "Tesseract."
            )
    def _ocr(self, imgs, lang):
        """
        Performs a single OCR attempt.
        """
        if not imgs:
            return ""
        self.log("info", "Parsing for {}".format(lang))
        with Pool(processes=self.THREADS) as pool:
            r = pool.map(image_to_string, itertools.product(imgs, [lang]))
            r = " ".join(r)
        # Strip out excess white space to allow matching to go smoother
        return strip_excess_whitespace(r)
    def _assemble_ocr_sections(self, imgs, middle, text):
        """
        Given a `middle` value and the text that middle page represents, we OCR
        the remainder of the document and return the whole thing.
        """
        text = self._ocr(imgs[:middle], self.DEFAULT_OCR_LANGUAGE) + text
        text += self._ocr(imgs[middle + 1:], self.DEFAULT_OCR_LANGUAGE)
        return text
 def run_convert(*args):
    environment = os.environ.copy()
    if settings.CONVERT_MEMORY_LIMIT:
        environment["MAGICK_MEMORY_LIMIT"] = settings.CONVERT_MEMORY_LIMIT
    if settings.CONVERT_TMPDIR:
        environment["MAGICK_TMPDIR"] = settings.CONVERT_TMPDIR
    subprocess.Popen(args, env=environment).wait()
 def run_unpaper(args):
    unpaper, pnm = args
    subprocess.Popen(
        (unpaper, pnm, pnm.replace(".pnm", ".unpaper.pnm"))).wait()
 def strip_excess_whitespace(text):
    collapsed_spaces = re.sub(r"([^\S\r\n]+)", " ", text)
    no_leading_whitespace = re.sub(
        "([\n\r]+)([^\S\n\r]+)", '\\1', collapsed_spaces)
    no_trailing_whitespace = re.sub("([^\S\n\r]+)$", '', no_leading_whitespace)
    return no_trailing_whitespace
 def image_to_string(args):
    img, lang = args
    ocr = pyocr.get_available_tools()[0]
    with Image.open(os.path.join(RasterisedDocumentParser.SCRATCH, img)) as f:
        if ocr.can_detect_orientation():
            try:
                orientation = ocr.detect_orientation(f, lang=lang)
                f = f.rotate(orientation["angle"], expand=1)
            except (TesseractError, OtherTesseractError):
                pass
        return ocr.image_to_string(f, lang=lang)
--- a/src/paperless_tesseract/signals.py
+++ b/src/paperless_tesseract/signals.py
@@ -0,0 +1,23 @@
 import re
 from .parsers import RasterisedDocumentParser
 class ConsumerDeclaration(object):
    MATCHING_FILES = re.compile("^.*\.(pdf|jpg|gif|png|tiff|pnm|bmp)$")
    @classmethod
    def handle(cls, sender, **kwargs):
        return cls.test
    @classmethod
    def test(cls, doc):
        if cls.MATCHING_FILES.match(doc):
            return {
                "parser": RasterisedDocumentParser,
                "weight": 0
            }
        return None
--- a/src/paperless_tesseract/tests/init.py
+++ b/src/paperless_tesseract/tests/init.py
--- a/src/paperless_tesseract/tests/samples/no-text.png
+++ b/src/paperless_tesseract/tests/samples/no-text.png
--- a/src/paperless_tesseract/tests/test_ocr.py
+++ b/src/paperless_tesseract/tests/test_ocr.py
@@ -0,0 +1,80 @@
 import os
 from unittest import mock, skipIf
 import pyocr
 from django.test import TestCase
 from pyocr.libtesseract.tesseract_raw import \
    TesseractError as OtherTesseractError
 from ..parsers import image_to_string, strip_excess_whitespace
 class FakeTesseract(object):
    @staticmethod
    def can_detect_orientation():
        return True
    @staticmethod
    def detect_orientation(file_handle, lang):
        raise OtherTesseractError("arbitrary status", "message")
    @staticmethod
    def image_to_string(file_handle, lang):
        return "This is test text"
 class FakePyOcr(object):
    @staticmethod
    def get_available_tools():
        return [FakeTesseract]
 class TestOCR(TestCase):
    text_cases = [
        ("simple     string", "simple string"),
        (
            "simple    newline\n   testing string",
            "simple newline\ntesting string"
        ),
        (
            "utf-8   строка с пробелами в конце  ",
            "utf-8 строка с пробелами в конце"
        )
    ]
    SAMPLE_FILES = os.path.join(os.path.dirname(__file__), "samples")
    TESSERACT_INSTALLED = bool(pyocr.get_available_tools())
    def test_strip_excess_whitespace(self):
        for source, result in self.text_cases:
            actual_result = strip_excess_whitespace(source)
            self.assertEqual(
                result,
                actual_result,
                "strip_exceess_whitespace({}) != '{}', but '{}'".format(
                    source,
                    result,
                    actual_result
                )
            )
    @skipIf(not TESSERACT_INSTALLED, "Tesseract not installed. Skipping")
    @mock.patch(
        "paperless_tesseract.parsers.RasterisedDocumentParser.SCRATCH",
        SAMPLE_FILES
    )
    @mock.patch("paperless_tesseract.parsers.pyocr", FakePyOcr)
    def test_image_to_string_with_text_free_page(self):
        """
        This test is sort of silly, since it's really just reproducing an odd
        exception thrown by pyocr when it encounters a page with no text.
        Actually running this test against an installation of Tesseract results
        in a segmentation fault rooted somewhere deep inside pyocr where I
        don't care to dig.  Regardless, if you run the consumer normally,
        text-free pages are now handled correctly so long as we work around
        this weird exception.
        """
        image_to_string(["no-text.png", "en"])
--- a/src/reminders/init.py
+++ b/src/reminders/init.py
--- a/src/reminders/admin.py
+++ b/src/reminders/admin.py
@@ -0,0 +1,20 @@
 from django.conf import settings
 from django.contrib import admin
 from .models import Reminder
 class ReminderAdmin(admin.ModelAdmin):
    class Media:
        css = {
            "all": ("paperless.css",)
        }
    list_per_page = settings.PAPERLESS_LIST_PER_PAGE
    list_display = ("date", "document", "note")
    list_filter = ("date",)
    list_editable = ("note",)
 admin.site.register(Reminder, ReminderAdmin)
--- a/src/reminders/apps.py
+++ b/src/reminders/apps.py
@@ -0,0 +1,5 @@
 from django.apps import AppConfig
 class RemindersConfig(AppConfig):
    name = "reminders"
--- a/src/reminders/filters.py
+++ b/src/reminders/filters.py
@@ -0,0 +1,14 @@
 from django_filters.rest_framework import CharFilter, FilterSet
 from .models import Reminder
 class ReminderFilterSet(FilterSet):
    class Meta(object):
        model = Reminder
        fields = {
            "document": ["exact"],
            "date": ["gt", "lt", "gte", "lte", "exact"],
            "note": ["istartswith", "iendswith", "icontains"]
        }
--- a/src/reminders/migrations/0001_initial.py
+++ b/src/reminders/migrations/0001_initial.py
@@ -0,0 +1,27 @@
 # -*- coding: utf-8 -*-
 # Generated by Django 1.10.5 on 2017-03-25 15:58
 from __future__ import unicode_literals
 from django.db import migrations, models
 import django.db.models.deletion
 class Migration(migrations.Migration):
    initial = True
    dependencies = [
        ('documents', '0016_auto_20170325_1558'),
    ]
    operations = [
        migrations.CreateModel(
            name='Reminder',
            fields=[
                ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
                ('date', models.DateTimeField()),
                ('note', models.TextField(blank=True)),
                ('document', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='documents.Document')),
            ],
        ),
    ]
--- a/src/reminders/migrations/init.py
+++ b/src/reminders/migrations/init.py
--- a/src/reminders/models.py
+++ b/src/reminders/models.py
@@ -0,0 +1,8 @@
 from django.db import models
 class Reminder(models.Model):
    document = models.ForeignKey("documents.Document")
    date = models.DateTimeField()
    note = models.TextField(blank=True)
--- a/src/reminders/serialisers.py
+++ b/src/reminders/serialisers.py
@@ -0,0 +1,14 @@
 from documents.models import Document
 from rest_framework import serializers
 from .models import Reminder
 class ReminderSerializer(serializers.HyperlinkedModelSerializer):
    document = serializers.HyperlinkedRelatedField(
        view_name="drf:document-detail", queryset=Document.objects)
    class Meta(object):
        model = Reminder
        fields = ("id", "document", "date", "note")
--- a/src/reminders/tests.py
+++ b/src/reminders/tests.py
@@ -0,0 +1,3 @@
 from django.test import TestCase
 # Create your tests here.
--- a/src/reminders/views.py
+++ b/src/reminders/views.py
@@ -0,0 +1,22 @@
 from django_filters.rest_framework import DjangoFilterBackend
 from rest_framework.filters import OrderingFilter
 from rest_framework.permissions import IsAuthenticated
 from rest_framework.viewsets import (
    ModelViewSet,
 )
 from .filters import ReminderFilterSet
 from .models import Reminder
 from .serialisers import ReminderSerializer
 from paperless.views import StandardPagination
 class ReminderViewSet(ModelViewSet):
    model = Reminder
    queryset = Reminder.objects
    serializer_class = ReminderSerializer
    pagination_class = StandardPagination
    permission_classes = (IsAuthenticated,)
    filter_backends = (DjangoFilterBackend, OrderingFilter)
    filter_class = ReminderFilterSet
    ordering_fields = ("date", "document")
--- a/src/tox.ini
+++ b/src/tox.ini
@@ -5,7 +5,7 @@
 [tox]
 skipsdist = True
-envlist = py34, py35, pep8
+envlist = py34, py35, py36, pep8
 [testenv]
 commands = {envpython} manage.py test
Author	SHA1	Message	Date
Daniel Quinn	5b88ebf0e7	Merge pull request #203 from danielquinn/feature/reminders Feature: Reminders	2017-03-25 16:27:28 +00:00
Daniel Quinn	a0edc7d54d	chore: update the changelog for reminders	2017-03-25 16:22:04 +00:00
Daniel Quinn	b876a0d0df	feat: add the new reminders app	2017-03-25 16:21:46 +00:00
Daniel Quinn	27db4f7e51	refactor: code cleanup I hate single quotes.	2017-03-25 16:20:59 +00:00
Daniel Quinn	426919fa9f	refactor: break document-only stuff into the paperless app The `SessionOrBasicAuthMixin` and `StandardPagination` classes were living in the documents app and I needed them in the new `reminders` app, so this commit breaks them out of `documents` and puts them in the central `paperless` app instead.	2017-03-25 16:18:34 +00:00
Daniel Quinn	e47c152b81	feat: migration for changes in 0.3.6	2017-03-25 16:01:59 +00:00
Daniel Quinn	b7cb708053	Merge pull request #197 from danielquinn/pluggable-consumers Pluggable consumers	2017-03-25 15:20:48 +00:00
Daniel Quinn	7611c2b3d5	fix: pep8 + travis & tox env updates	2017-03-25 15:10:51 +00:00
Daniel Quinn	5f964830aa	version bump	2017-03-25 15:10:51 +00:00
Daniel Quinn	7ec4f906af	feat: make the content field optional	2017-03-25 15:10:25 +00:00
Daniel Quinn	b5f6c06b8b	fix: a little cleanup	2017-03-25 15:10:25 +00:00
Daniel Quinn	55e81ca4bb	feat: refactor for pluggable consumers I've broken out the OCR-specific code from the consumers and dumped it all into its own app, `paperless_tesseract`. This new app should serve as a sample of how to create one's own consumer for different file types. Documentation for how to do this isn't ready yet, but for the impatient: * Create a new app * containing a `parsers.py` for your parser modelled after `paperless_tesseract.parsers.RasterisedDocumentParser` * containing a `signals.py` with a handler moddelled after `paperless_tesseract.signals.ConsumerDeclaration` * connect the signal handler to `documents.signals.document_consumer_declaration` in `your_app.apps` * Install the app into Paperless by declaring `PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be separated with commas. * Restart the consumer	2017-03-25 15:10:25 +00:00
Daniel Quinn	0f7bfc547a	Merge pull request #202 from danielquinn/fix/api-should-allow-writes Fix/api should allow writes	2017-03-25 15:08:40 +00:00
Daniel Quinn	9525725c28	chore: update the changelog	2017-03-25 15:07:58 +00:00
Daniel Quinn	2a2196fa4d	fix: #200 allow edits of correspondent & tags	2017-03-25 15:01:01 +00:00
Daniel Quinn	237efbcaa0	Merge branch 'master' of github.com:danielquinn/paperless	2017-03-05 12:15:22 +00:00
Daniel Quinn	351cd06ef7	Disable adding through the admin	2017-03-05 12:15:18 +00:00
Daniel Quinn	8b37160953	Merge pull request #194 from philippeowagner/master Better alt-text for thumbnails.	2017-03-01 09:12:10 +00:00
Philippe O. Wagner	db64478d9f	Better alt-text for thumbnails.	2017-03-01 00:50:53 +01:00
Daniel Quinn	8bc2dfe4c6	Django migrations doesn't account for PostgreSQL completely This was a weird bug to run into. Basically I changed a CharField into a ForeignKey field and ran `makemigrations` to get the job done. However, rather than doing a `RemoveField` and an `AddField`, migrations created a single `AlterField` which worked just fine in SQLite, but blew up in PostgreSQL with: psycopg2.ProgrammingError: operator class "varchar_pattern_ops" does not accept data type integer The fix was to rewrite the single migration into the two separate steps.	2017-02-18 17:55:52 +00:00
Daniel Quinn	3a427c9130	Allow for MariaDB/MySQL MariaDB/MySQL doesn't handle indexes on TextFields well and for some reason, Django's migrations opts to blow up rather than handle this in a more user-friendly way. The fix here isn't ideal, but should be sufficient should anyone try to use Paperless with MySQL.	2017-02-18 17:53:43 +00:00
`@@ -1 +1 @@`
	`__version__ = (0, 3, 5)`	`__version__ = (0, 3, 6)`
		`@@ -0,0 +1,3 @@`
							`from django.test import TestCase`

							`# Create your tests here.`