mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-12-22 19:11:22 +00:00
Compare commits
16 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
2ef2bf873e | ||
|
|
0bb7d27269 | ||
|
|
ce5e8b2658 | ||
|
|
3f572afb8b | ||
|
|
5c3cb1e4ab | ||
|
|
c7f4bfe4f3 | ||
|
|
65d6599964 | ||
|
|
5d32e89c44 | ||
|
|
750ab5bf85 | ||
|
|
2a3f766b93 | ||
|
|
14bb52b6a4 | ||
|
|
b5176d207e | ||
|
|
e4044d0df9 | ||
|
|
bacdd51fd7 | ||
|
|
8010d72f18 | ||
|
|
9dd76f1b87 |
@@ -2,7 +2,7 @@ language: python
|
|||||||
|
|
||||||
before_install:
|
before_install:
|
||||||
- sudo apt-get update -qq
|
- sudo apt-get update -qq
|
||||||
- sudo apt-get install -qq libpoppler-cpp-dev unpaper tesseract-ocr tesseract-ocr-eng
|
- sudo apt-get install -qq libpoppler-cpp-dev unpaper tesseract-ocr tesseract-ocr-eng tesseract-ocr-cat
|
||||||
|
|
||||||
sudo: false
|
sudo: false
|
||||||
|
|
||||||
|
|||||||
11
Dockerfile
11
Dockerfile
@@ -1,4 +1,4 @@
|
|||||||
FROM alpine:3.7
|
FROM alpine:3.8
|
||||||
|
|
||||||
LABEL maintainer="The Paperless Project https://github.com/danielquinn/paperless" \
|
LABEL maintainer="The Paperless Project https://github.com/danielquinn/paperless" \
|
||||||
contributors="Guy Addadi <addadi@gmail.com>, Pit Kleyersburg <pitkley@googlemail.com>, \
|
contributors="Guy Addadi <addadi@gmail.com>, Pit Kleyersburg <pitkley@googlemail.com>, \
|
||||||
@@ -12,11 +12,10 @@ COPY scripts/docker-entrypoint.sh /sbin/docker-entrypoint.sh
|
|||||||
ENV PAPERLESS_EXPORT_DIR=/export \
|
ENV PAPERLESS_EXPORT_DIR=/export \
|
||||||
PAPERLESS_CONSUMPTION_DIR=/consume
|
PAPERLESS_CONSUMPTION_DIR=/consume
|
||||||
|
|
||||||
# Install dependencies
|
|
||||||
RUN apk --no-cache --update add \
|
RUN apk update --no-cache && apk add python3 gnupg libmagic bash shadow curl \
|
||||||
python3 gnupg libmagic bash shadow curl \
|
sudo poppler tesseract-ocr imagemagick ghostscript unpaper optipng && \
|
||||||
sudo poppler tesseract-ocr imagemagick ghostscript unpaper && \
|
apk add --virtual .build-dependencies \
|
||||||
apk --no-cache add --virtual .build-dependencies \
|
|
||||||
python3-dev poppler-dev gcc g++ musl-dev zlib-dev jpeg-dev && \
|
python3-dev poppler-dev gcc g++ musl-dev zlib-dev jpeg-dev && \
|
||||||
# Install python dependencies
|
# Install python dependencies
|
||||||
python3 -m ensurepip && \
|
python3 -m ensurepip && \
|
||||||
|
|||||||
@@ -1,6 +1,49 @@
|
|||||||
Changelog
|
Changelog
|
||||||
#########
|
#########
|
||||||
|
|
||||||
|
2.5.0
|
||||||
|
=====
|
||||||
|
|
||||||
|
* **New dependency**: Paperless now optimises thumbnail generation with
|
||||||
|
`optipng`_, so you'll need to install that somewhere in your PATH or declare
|
||||||
|
its location in ``PAPERLESS_OPTIPNG_BINARY``. The Docker image has already
|
||||||
|
been updated on the Docker Hub, so you just need to pull the latest one from
|
||||||
|
there if you're a Docker user.
|
||||||
|
|
||||||
|
* "Login free" instances of Paperless were breaking whenever you tried to edit
|
||||||
|
objects in the admin: adding/deleting tags or correspondents, or even fixing
|
||||||
|
spelling. This was due to the "user hack" we were applying to sessions that
|
||||||
|
weren't using a login, as that hack user didn't have a valid id. The fix was
|
||||||
|
to attribute the first user id in the system to this hack user. `#394`_
|
||||||
|
|
||||||
|
* A problem in how we handle slug values on Tags and Correspondents required a
|
||||||
|
few changes to how we handle this field `#393`_:
|
||||||
|
|
||||||
|
1. Slugs are no longer editable. They're derived from the name of the tag or
|
||||||
|
correspondent at save time, so if you wanna change the slug, you have to
|
||||||
|
change the name, and even then you're restricted to the rules of the
|
||||||
|
``slugify()`` function. The slug value is still visible in the admin
|
||||||
|
though.
|
||||||
|
2. I've added a migration to go over all existing tags & correspondents and
|
||||||
|
rewrite the ``.slug`` values to ones conforming to the ``slugify()``
|
||||||
|
rules.
|
||||||
|
3. The consumption process now uses the same rules as ``.save()`` in
|
||||||
|
determining a slug and using that to check for an existing
|
||||||
|
tag/correspondent.
|
||||||
|
|
||||||
|
* An annoying bug in the date capture code was causing some bogus dates to be
|
||||||
|
attached to documents, which in turn busted the UI. Thanks to `Andrew Peng`_
|
||||||
|
for reporting this. `#414`_.
|
||||||
|
|
||||||
|
* A bug in the Dockerfile meant that Tesseract language files weren't being
|
||||||
|
installed correctly. `euri10`_ was quick to provide a fix: `#406`_, `#413`_.
|
||||||
|
|
||||||
|
* Document consumption is now wrapped in a transaction as per an old ticket
|
||||||
|
`#262`_.
|
||||||
|
|
||||||
|
* The ``get_date()`` functionality of the parsers has been consolidated onto
|
||||||
|
the ``DocumentParser`` class since much of that code was redundant anyway.
|
||||||
|
|
||||||
2.4.0
|
2.4.0
|
||||||
=====
|
=====
|
||||||
|
|
||||||
@@ -525,6 +568,8 @@ bulk of the work on this big change.
|
|||||||
.. _ahyear: https://github.com/ahyear
|
.. _ahyear: https://github.com/ahyear
|
||||||
.. _jonaswinkler: https://github.com/jonaswinkler
|
.. _jonaswinkler: https://github.com/jonaswinkler
|
||||||
.. _thepill: https://github.com/thepill
|
.. _thepill: https://github.com/thepill
|
||||||
|
.. _Andrew Peng: https://github.com/pengc99
|
||||||
|
.. _euri10: https://github.com/euri10
|
||||||
|
|
||||||
.. _#20: https://github.com/danielquinn/paperless/issues/20
|
.. _#20: https://github.com/danielquinn/paperless/issues/20
|
||||||
.. _#44: https://github.com/danielquinn/paperless/issues/44
|
.. _#44: https://github.com/danielquinn/paperless/issues/44
|
||||||
@@ -590,6 +635,7 @@ bulk of the work on this big change.
|
|||||||
.. _#322: https://github.com/danielquinn/paperless/pull/322
|
.. _#322: https://github.com/danielquinn/paperless/pull/322
|
||||||
.. _#328: https://github.com/danielquinn/paperless/pull/328
|
.. _#328: https://github.com/danielquinn/paperless/pull/328
|
||||||
.. _#253: https://github.com/danielquinn/paperless/issues/253
|
.. _#253: https://github.com/danielquinn/paperless/issues/253
|
||||||
|
.. _#262: https://github.com/danielquinn/paperless/issues/262
|
||||||
.. _#323: https://github.com/danielquinn/paperless/issues/323
|
.. _#323: https://github.com/danielquinn/paperless/issues/323
|
||||||
.. _#344: https://github.com/danielquinn/paperless/pull/344
|
.. _#344: https://github.com/danielquinn/paperless/pull/344
|
||||||
.. _#351: https://github.com/danielquinn/paperless/pull/351
|
.. _#351: https://github.com/danielquinn/paperless/pull/351
|
||||||
@@ -606,13 +652,19 @@ bulk of the work on this big change.
|
|||||||
.. _#391: https://github.com/danielquinn/paperless/pull/391
|
.. _#391: https://github.com/danielquinn/paperless/pull/391
|
||||||
.. _#390: https://github.com/danielquinn/paperless/pull/390
|
.. _#390: https://github.com/danielquinn/paperless/pull/390
|
||||||
.. _#392: https://github.com/danielquinn/paperless/issues/392
|
.. _#392: https://github.com/danielquinn/paperless/issues/392
|
||||||
|
.. _#393: https://github.com/danielquinn/paperless/issues/393
|
||||||
.. _#395: https://github.com/danielquinn/paperless/pull/395
|
.. _#395: https://github.com/danielquinn/paperless/pull/395
|
||||||
|
.. _#394: https://github.com/danielquinn/paperless/issues/394
|
||||||
.. _#396: https://github.com/danielquinn/paperless/pull/396
|
.. _#396: https://github.com/danielquinn/paperless/pull/396
|
||||||
.. _#399: https://github.com/danielquinn/paperless/pull/399
|
.. _#399: https://github.com/danielquinn/paperless/pull/399
|
||||||
.. _#400: https://github.com/danielquinn/paperless/pull/400
|
.. _#400: https://github.com/danielquinn/paperless/pull/400
|
||||||
.. _#401: https://github.com/danielquinn/paperless/pull/401
|
.. _#401: https://github.com/danielquinn/paperless/pull/401
|
||||||
.. _#405: https://github.com/danielquinn/paperless/pull/405
|
.. _#405: https://github.com/danielquinn/paperless/pull/405
|
||||||
|
.. _#406: https://github.com/danielquinn/paperless/issues/406
|
||||||
.. _#412: https://github.com/danielquinn/paperless/issues/412
|
.. _#412: https://github.com/danielquinn/paperless/issues/412
|
||||||
|
.. _#413: https://github.com/danielquinn/paperless/pull/413
|
||||||
|
.. _#414: https://github.com/danielquinn/paperless/issues/414
|
||||||
|
|
||||||
.. _pipenv: https://docs.pipenv.org/
|
.. _pipenv: https://docs.pipenv.org/
|
||||||
.. _a new home on Docker Hub: https://hub.docker.com/r/danielquinn/paperless/
|
.. _a new home on Docker Hub: https://hub.docker.com/r/danielquinn/paperless/
|
||||||
|
.. _optipng: http://optipng.sourceforge.net/
|
||||||
|
|||||||
@@ -213,3 +213,23 @@ PAPERLESS_DEBUG="false"
|
|||||||
# The number of years for which a correspondent will be included in the recent
|
# The number of years for which a correspondent will be included in the recent
|
||||||
# correspondents filter.
|
# correspondents filter.
|
||||||
#PAPERLESS_RECENT_CORRESPONDENT_YEARS=1
|
#PAPERLESS_RECENT_CORRESPONDENT_YEARS=1
|
||||||
|
|
||||||
|
###############################################################################
|
||||||
|
#### Third-Party Binaries ####
|
||||||
|
###############################################################################
|
||||||
|
|
||||||
|
# There are a few external software packages that Paperless expects to find on
|
||||||
|
# your system when it starts up. Unless you've done something creative with
|
||||||
|
# their installation, you probably won't need to edit any of these. However,
|
||||||
|
# if you've installed these programs somewhere where simply typing the name of
|
||||||
|
# the program doesn't automatically execute it (ie. the program isn't in your
|
||||||
|
# $PATH), then you'll need to specify the literal path for that program here.
|
||||||
|
|
||||||
|
# Convert (part of the ImageMagick suite)
|
||||||
|
#PAPERLESS_CONVERT_BINARY=/usr/bin/convert
|
||||||
|
|
||||||
|
# Unpaper
|
||||||
|
#PAPERLESS_UNPAPER_BINARY=/usr/bin/unpaper
|
||||||
|
|
||||||
|
# Optipng (for optimising thumbnail sizes)
|
||||||
|
#PAPERLESS_OPTIPNG_BINARY=/usr/bin/optipng
|
||||||
|
|||||||
@@ -125,6 +125,8 @@ class CorrespondentAdmin(CommonAdmin):
|
|||||||
list_filter = ("matching_algorithm",)
|
list_filter = ("matching_algorithm",)
|
||||||
list_editable = ("match", "matching_algorithm")
|
list_editable = ("match", "matching_algorithm")
|
||||||
|
|
||||||
|
readonly_fields = ("slug",)
|
||||||
|
|
||||||
def get_queryset(self, request):
|
def get_queryset(self, request):
|
||||||
qs = super(CorrespondentAdmin, self).get_queryset(request)
|
qs = super(CorrespondentAdmin, self).get_queryset(request)
|
||||||
qs = qs.annotate(
|
qs = qs.annotate(
|
||||||
@@ -149,6 +151,8 @@ class TagAdmin(CommonAdmin):
|
|||||||
list_filter = ("colour", "matching_algorithm")
|
list_filter = ("colour", "matching_algorithm")
|
||||||
list_editable = ("colour", "match", "matching_algorithm")
|
list_editable = ("colour", "match", "matching_algorithm")
|
||||||
|
|
||||||
|
readonly_fields = ("slug",)
|
||||||
|
|
||||||
def get_queryset(self, request):
|
def get_queryset(self, request):
|
||||||
qs = super(TagAdmin, self).get_queryset(request)
|
qs = super(TagAdmin, self).get_queryset(request)
|
||||||
qs = qs.annotate(document_count=models.Count("documents"))
|
qs = qs.annotate(document_count=models.Count("documents"))
|
||||||
@@ -167,7 +171,7 @@ class DocumentAdmin(CommonAdmin):
|
|||||||
}
|
}
|
||||||
|
|
||||||
search_fields = ("correspondent__name", "title", "content", "tags__name")
|
search_fields = ("correspondent__name", "title", "content", "tags__name")
|
||||||
readonly_fields = ("added",)
|
readonly_fields = ("added", "file_type", "storage_type",)
|
||||||
list_display = ("title", "created", "added", "thumbnail", "correspondent",
|
list_display = ("title", "created", "added", "thumbnail", "correspondent",
|
||||||
"tags_")
|
"tags_")
|
||||||
list_filter = (
|
list_filter = (
|
||||||
|
|||||||
@@ -1,3 +1,4 @@
|
|||||||
|
from django.db import transaction
|
||||||
import datetime
|
import datetime
|
||||||
import hashlib
|
import hashlib
|
||||||
import logging
|
import logging
|
||||||
@@ -111,8 +112,11 @@ class Consumer:
|
|||||||
if not self.try_consume_file(file):
|
if not self.try_consume_file(file):
|
||||||
self._ignore.append((file, mtime))
|
self._ignore.append((file, mtime))
|
||||||
|
|
||||||
|
@transaction.atomic
|
||||||
def try_consume_file(self, file):
|
def try_consume_file(self, file):
|
||||||
"Return True if file was consumed"
|
"""
|
||||||
|
Return True if file was consumed
|
||||||
|
"""
|
||||||
|
|
||||||
if not re.match(FileInfo.REGEXES["title"], file):
|
if not re.match(FileInfo.REGEXES["title"], file):
|
||||||
return False
|
return False
|
||||||
@@ -145,7 +149,7 @@ class Consumer:
|
|||||||
parsed_document = parser_class(doc)
|
parsed_document = parser_class(doc)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
thumbnail = parsed_document.get_thumbnail()
|
thumbnail = parsed_document.get_optimised_thumbnail()
|
||||||
date = parsed_document.get_date()
|
date = parsed_document.get_date()
|
||||||
document = self._store(
|
document = self._store(
|
||||||
parsed_document.get_text(),
|
parsed_document.get_text(),
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
from django_filters.rest_framework import CharFilter, FilterSet, BooleanFilter, ModelChoiceFilter
|
from django_filters.rest_framework import BooleanFilter, FilterSet
|
||||||
|
|
||||||
from .models import Correspondent, Document, Tag
|
from .models import Correspondent, Document, Tag
|
||||||
|
|
||||||
|
|||||||
52
src/documents/migrations/0022_auto_20181007_1420.py
Normal file
52
src/documents/migrations/0022_auto_20181007_1420.py
Normal file
@@ -0,0 +1,52 @@
|
|||||||
|
# Generated by Django 2.0.8 on 2018-10-07 14:20
|
||||||
|
|
||||||
|
from django.db import migrations, models
|
||||||
|
from django.utils.text import slugify
|
||||||
|
|
||||||
|
|
||||||
|
def re_slug_all_the_things(apps, schema_editor):
|
||||||
|
"""
|
||||||
|
Rewrite all slug values to make sure they're actually slugs before we brand
|
||||||
|
them as uneditable.
|
||||||
|
"""
|
||||||
|
|
||||||
|
Tag = apps.get_model("documents", "Tag")
|
||||||
|
Correspondent = apps.get_model("documents", "Tag")
|
||||||
|
|
||||||
|
for klass in (Tag, Correspondent):
|
||||||
|
for instance in klass.objects.all():
|
||||||
|
klass.objects.filter(
|
||||||
|
pk=instance.pk
|
||||||
|
).update(
|
||||||
|
slug=slugify(instance.slug)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Migration(migrations.Migration):
|
||||||
|
|
||||||
|
dependencies = [
|
||||||
|
('documents', '0021_document_storage_type'),
|
||||||
|
]
|
||||||
|
|
||||||
|
operations = [
|
||||||
|
migrations.AlterModelOptions(
|
||||||
|
name='tag',
|
||||||
|
options={'ordering': ('name',)},
|
||||||
|
),
|
||||||
|
migrations.AlterField(
|
||||||
|
model_name='correspondent',
|
||||||
|
name='slug',
|
||||||
|
field=models.SlugField(blank=True, editable=False),
|
||||||
|
),
|
||||||
|
migrations.AlterField(
|
||||||
|
model_name='document',
|
||||||
|
name='file_type',
|
||||||
|
field=models.CharField(choices=[('pdf', 'PDF'), ('png', 'PNG'), ('jpg', 'JPG'), ('gif', 'GIF'), ('tiff', 'TIFF'), ('txt', 'TXT'), ('csv', 'CSV'), ('md', 'MD')], editable=False, max_length=4),
|
||||||
|
),
|
||||||
|
migrations.AlterField(
|
||||||
|
model_name='tag',
|
||||||
|
name='slug',
|
||||||
|
field=models.SlugField(blank=True, editable=False),
|
||||||
|
),
|
||||||
|
migrations.RunPython(re_slug_all_the_things, migrations.RunPython.noop)
|
||||||
|
]
|
||||||
@@ -11,6 +11,7 @@ from django.conf import settings
|
|||||||
from django.db import models
|
from django.db import models
|
||||||
from django.template.defaultfilters import slugify
|
from django.template.defaultfilters import slugify
|
||||||
from django.utils import timezone
|
from django.utils import timezone
|
||||||
|
from django.utils.text import slugify
|
||||||
from fuzzywuzzy import fuzz
|
from fuzzywuzzy import fuzz
|
||||||
|
|
||||||
from .managers import LogManager
|
from .managers import LogManager
|
||||||
@@ -37,7 +38,7 @@ class MatchingModel(models.Model):
|
|||||||
)
|
)
|
||||||
|
|
||||||
name = models.CharField(max_length=128, unique=True)
|
name = models.CharField(max_length=128, unique=True)
|
||||||
slug = models.SlugField(blank=True)
|
slug = models.SlugField(blank=True, editable=False)
|
||||||
|
|
||||||
match = models.CharField(max_length=256, blank=True)
|
match = models.CharField(max_length=256, blank=True)
|
||||||
matching_algorithm = models.PositiveIntegerField(
|
matching_algorithm = models.PositiveIntegerField(
|
||||||
@@ -147,8 +148,6 @@ class MatchingModel(models.Model):
|
|||||||
def save(self, *args, **kwargs):
|
def save(self, *args, **kwargs):
|
||||||
|
|
||||||
self.match = self.match.lower()
|
self.match = self.match.lower()
|
||||||
|
|
||||||
if not self.slug:
|
|
||||||
self.slug = slugify(self.name)
|
self.slug = slugify(self.name)
|
||||||
|
|
||||||
models.Model.save(self, *args, **kwargs)
|
models.Model.save(self, *args, **kwargs)
|
||||||
@@ -452,7 +451,7 @@ class FileInfo:
|
|||||||
r = []
|
r = []
|
||||||
for t in tags.split(","):
|
for t in tags.split(","):
|
||||||
r.append(Tag.objects.get_or_create(
|
r.append(Tag.objects.get_or_create(
|
||||||
slug=t.lower(),
|
slug=slugify(t),
|
||||||
defaults={"name": t}
|
defaults={"name": t}
|
||||||
)[0])
|
)[0])
|
||||||
return tuple(r)
|
return tuple(r)
|
||||||
|
|||||||
@@ -1,9 +1,13 @@
|
|||||||
import logging
|
import logging
|
||||||
import shutil
|
import os
|
||||||
import tempfile
|
|
||||||
import re
|
import re
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
import tempfile
|
||||||
|
|
||||||
|
import dateparser
|
||||||
from django.conf import settings
|
from django.conf import settings
|
||||||
|
from django.utils import timezone
|
||||||
|
|
||||||
# This regular expression will try to find dates in the document at
|
# This regular expression will try to find dates in the document at
|
||||||
# hand and will match the following formats:
|
# hand and will match the following formats:
|
||||||
@@ -32,6 +36,8 @@ class DocumentParser:
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
SCRATCH = settings.SCRATCH_DIR
|
SCRATCH = settings.SCRATCH_DIR
|
||||||
|
DATE_ORDER = settings.DATE_ORDER
|
||||||
|
OPTIPNG = settings.OPTIPNG_BINARY
|
||||||
|
|
||||||
def __init__(self, path):
|
def __init__(self, path):
|
||||||
self.document_path = path
|
self.document_path = path
|
||||||
@@ -45,6 +51,19 @@ class DocumentParser:
|
|||||||
"""
|
"""
|
||||||
raise NotImplementedError()
|
raise NotImplementedError()
|
||||||
|
|
||||||
|
def optimise_thumbnail(self, in_path):
|
||||||
|
|
||||||
|
out_path = os.path.join(self.tempdir, "optipng.png")
|
||||||
|
|
||||||
|
args = (self.OPTIPNG, "-o5", in_path, "-out", out_path)
|
||||||
|
if not subprocess.Popen(args).wait() == 0:
|
||||||
|
raise ParseError("Optipng failed at {}".format(args))
|
||||||
|
|
||||||
|
return out_path
|
||||||
|
|
||||||
|
def get_optimised_thumbnail(self):
|
||||||
|
return self.optimise_thumbnail(self.get_thumbnail())
|
||||||
|
|
||||||
def get_text(self):
|
def get_text(self):
|
||||||
"""
|
"""
|
||||||
Returns the text from the document and only the text.
|
Returns the text from the document and only the text.
|
||||||
@@ -55,7 +74,52 @@ class DocumentParser:
|
|||||||
"""
|
"""
|
||||||
Returns the date of the document.
|
Returns the date of the document.
|
||||||
"""
|
"""
|
||||||
raise NotImplementedError()
|
|
||||||
|
date = None
|
||||||
|
date_string = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
text = self.get_text()
|
||||||
|
except ParseError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
next_year = timezone.now().year + 5 # Arbitrary 5 year future limit
|
||||||
|
|
||||||
|
# Iterate through all regex matches and try to parse the date
|
||||||
|
for m in re.finditer(DATE_REGEX, text):
|
||||||
|
|
||||||
|
date_string = m.group(0)
|
||||||
|
|
||||||
|
try:
|
||||||
|
date = dateparser.parse(
|
||||||
|
date_string,
|
||||||
|
settings={
|
||||||
|
"DATE_ORDER": self.DATE_ORDER,
|
||||||
|
"PREFER_DAY_OF_MONTH": "first",
|
||||||
|
"RETURN_AS_TIMEZONE_AWARE": True
|
||||||
|
}
|
||||||
|
)
|
||||||
|
except TypeError:
|
||||||
|
# Skip all matches that do not parse to a proper date
|
||||||
|
continue
|
||||||
|
|
||||||
|
if date is not None and next_year > date.year > 1900:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
date = None
|
||||||
|
|
||||||
|
if date is not None:
|
||||||
|
self.log(
|
||||||
|
"info",
|
||||||
|
"Detected document date {} based on string {}".format(
|
||||||
|
date.isoformat(),
|
||||||
|
date_string
|
||||||
|
)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.log("info", "Unable to detect date for document")
|
||||||
|
|
||||||
|
return date
|
||||||
|
|
||||||
def log(self, level, message):
|
def log(self, level, message):
|
||||||
getattr(self.logger, level)(message, extra={
|
getattr(self.logger, level)(message, extra={
|
||||||
|
|||||||
@@ -76,7 +76,12 @@ def binaries_check(app_configs, **kwargs):
|
|||||||
error = "Paperless can't find {}. Without it, consumption is impossible."
|
error = "Paperless can't find {}. Without it, consumption is impossible."
|
||||||
hint = "Either it's not in your ${PATH} or it's not installed."
|
hint = "Either it's not in your ${PATH} or it's not installed."
|
||||||
|
|
||||||
binaries = (settings.CONVERT_BINARY, settings.UNPAPER_BINARY, "tesseract")
|
binaries = (
|
||||||
|
settings.CONVERT_BINARY,
|
||||||
|
settings.OPTIPNG_BINARY,
|
||||||
|
settings.UNPAPER_BINARY,
|
||||||
|
"tesseract"
|
||||||
|
)
|
||||||
|
|
||||||
check_messages = []
|
check_messages = []
|
||||||
for binary in binaries:
|
for binary in binaries:
|
||||||
|
|||||||
@@ -1,15 +1,20 @@
|
|||||||
|
from django.contrib.auth.models import User as DjangoUser
|
||||||
|
|
||||||
|
|
||||||
class User:
|
class User:
|
||||||
"""
|
"""
|
||||||
This is a dummy django User used with our middleware to disable
|
This is a dummy django User used with our middleware to disable
|
||||||
login authentication if that is configured in paperless.conf
|
login authentication if that is configured in paperless.conf
|
||||||
"""
|
"""
|
||||||
|
|
||||||
is_superuser = True
|
is_superuser = True
|
||||||
is_active = True
|
is_active = True
|
||||||
is_staff = True
|
is_staff = True
|
||||||
is_authenticated = True
|
is_authenticated = True
|
||||||
|
|
||||||
# Must be -1 to avoid colliding with real user ID's (which start at 1)
|
@property
|
||||||
id = -1
|
def id(self):
|
||||||
|
return DjangoUser.objects.order_by("pk").first().pk
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def pk(self):
|
def pk(self):
|
||||||
@@ -18,8 +23,8 @@ class User:
|
|||||||
|
|
||||||
"""
|
"""
|
||||||
NOTE: These are here as a hack instead of being in the User definition
|
NOTE: These are here as a hack instead of being in the User definition
|
||||||
above due to the way pycodestyle handles lamdbdas.
|
NOTE: above due to the way pycodestyle handles lamdbdas.
|
||||||
See https://github.com/PyCQA/pycodestyle/issues/379 for more.
|
NOTE: See https://github.com/PyCQA/pycodestyle/issues/379 for more.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
User.has_module_perms = lambda *_: True
|
User.has_module_perms = lambda *_: True
|
||||||
|
|||||||
@@ -247,6 +247,9 @@ CONVERT_TMPDIR = os.getenv("PAPERLESS_CONVERT_TMPDIR")
|
|||||||
CONVERT_MEMORY_LIMIT = os.getenv("PAPERLESS_CONVERT_MEMORY_LIMIT")
|
CONVERT_MEMORY_LIMIT = os.getenv("PAPERLESS_CONVERT_MEMORY_LIMIT")
|
||||||
CONVERT_DENSITY = os.getenv("PAPERLESS_CONVERT_DENSITY")
|
CONVERT_DENSITY = os.getenv("PAPERLESS_CONVERT_DENSITY")
|
||||||
|
|
||||||
|
# OptiPNG
|
||||||
|
OPTIPNG_BINARY = os.getenv("PAPERLESS_OPTIPNG_BINARY", "optipng")
|
||||||
|
|
||||||
# Unpaper
|
# Unpaper
|
||||||
UNPAPER_BINARY = os.getenv("PAPERLESS_UNPAPER_BINARY", "unpaper")
|
UNPAPER_BINARY = os.getenv("PAPERLESS_UNPAPER_BINARY", "unpaper")
|
||||||
|
|
||||||
|
|||||||
@@ -1 +1 @@
|
|||||||
__version__ = (2, 3, 0)
|
__version__ = (2, 5, 0)
|
||||||
|
|||||||
@@ -4,7 +4,6 @@ import re
|
|||||||
import subprocess
|
import subprocess
|
||||||
from multiprocessing.pool import Pool
|
from multiprocessing.pool import Pool
|
||||||
|
|
||||||
import dateparser
|
|
||||||
import langdetect
|
import langdetect
|
||||||
import pyocr
|
import pyocr
|
||||||
from django.conf import settings
|
from django.conf import settings
|
||||||
@@ -14,7 +13,7 @@ from pyocr.libtesseract.tesseract_raw import \
|
|||||||
from pyocr.tesseract import TesseractError
|
from pyocr.tesseract import TesseractError
|
||||||
|
|
||||||
import pdftotext
|
import pdftotext
|
||||||
from documents.parsers import DocumentParser, ParseError, DATE_REGEX
|
from documents.parsers import DocumentParser, ParseError
|
||||||
|
|
||||||
from .languages import ISO639
|
from .languages import ISO639
|
||||||
|
|
||||||
@@ -33,7 +32,6 @@ class RasterisedDocumentParser(DocumentParser):
|
|||||||
DENSITY = settings.CONVERT_DENSITY if settings.CONVERT_DENSITY else 300
|
DENSITY = settings.CONVERT_DENSITY if settings.CONVERT_DENSITY else 300
|
||||||
THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
|
THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
|
||||||
UNPAPER = settings.UNPAPER_BINARY
|
UNPAPER = settings.UNPAPER_BINARY
|
||||||
DATE_ORDER = settings.DATE_ORDER
|
|
||||||
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
|
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
|
||||||
OCR_ALWAYS = settings.OCR_ALWAYS
|
OCR_ALWAYS = settings.OCR_ALWAYS
|
||||||
|
|
||||||
@@ -46,15 +44,18 @@ class RasterisedDocumentParser(DocumentParser):
|
|||||||
The thumbnail of a PDF is just a 500px wide image of the first page.
|
The thumbnail of a PDF is just a 500px wide image of the first page.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
out_path = os.path.join(self.tempdir, "convert.png")
|
||||||
|
|
||||||
|
# Run convert to get a decent thumbnail
|
||||||
run_convert(
|
run_convert(
|
||||||
self.CONVERT,
|
self.CONVERT,
|
||||||
"-scale", "500x5000",
|
"-scale", "500x5000",
|
||||||
"-alpha", "remove",
|
"-alpha", "remove",
|
||||||
"{}[0]".format(self.document_path),
|
"{}[0]".format(self.document_path),
|
||||||
os.path.join(self.tempdir, "convert.png")
|
out_path
|
||||||
)
|
)
|
||||||
|
|
||||||
return os.path.join(self.tempdir, "convert.png")
|
return out_path
|
||||||
|
|
||||||
def _is_ocred(self):
|
def _is_ocred(self):
|
||||||
|
|
||||||
@@ -202,40 +203,6 @@ class RasterisedDocumentParser(DocumentParser):
|
|||||||
text += self._ocr(imgs[middle + 1:], self.DEFAULT_OCR_LANGUAGE)
|
text += self._ocr(imgs[middle + 1:], self.DEFAULT_OCR_LANGUAGE)
|
||||||
return text
|
return text
|
||||||
|
|
||||||
def get_date(self):
|
|
||||||
date = None
|
|
||||||
datestring = None
|
|
||||||
|
|
||||||
try:
|
|
||||||
text = self.get_text()
|
|
||||||
except ParseError as e:
|
|
||||||
return None
|
|
||||||
|
|
||||||
# Iterate through all regex matches and try to parse the date
|
|
||||||
for m in re.finditer(DATE_REGEX, text):
|
|
||||||
datestring = m.group(0)
|
|
||||||
|
|
||||||
try:
|
|
||||||
date = dateparser.parse(
|
|
||||||
datestring,
|
|
||||||
settings={'DATE_ORDER': self.DATE_ORDER,
|
|
||||||
'PREFER_DAY_OF_MONTH': 'first',
|
|
||||||
'RETURN_AS_TIMEZONE_AWARE': True})
|
|
||||||
except TypeError:
|
|
||||||
# Skip all matches that do not parse to a proper date
|
|
||||||
continue
|
|
||||||
|
|
||||||
if date is not None:
|
|
||||||
break
|
|
||||||
|
|
||||||
if date is not None:
|
|
||||||
self.log("info", "Detected document date " + date.isoformat() +
|
|
||||||
" based on string " + datestring)
|
|
||||||
else:
|
|
||||||
self.log("info", "Unable to detect date for document")
|
|
||||||
|
|
||||||
return date
|
|
||||||
|
|
||||||
|
|
||||||
def run_convert(*args):
|
def run_convert(*args):
|
||||||
|
|
||||||
|
|||||||
@@ -384,3 +384,42 @@ class TestDate(TestCase):
|
|||||||
document.get_date(),
|
document.get_date(),
|
||||||
datetime.datetime(2017, 12, 31, 0, 0, tzinfo=tz.tzutc())
|
datetime.datetime(2017, 12, 31, 0, 0, tzinfo=tz.tzutc())
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@mock.patch(
|
||||||
|
"paperless_tesseract.parsers.RasterisedDocumentParser.get_text",
|
||||||
|
return_value="01-07-0590 00:00:00"
|
||||||
|
)
|
||||||
|
@mock.patch(
|
||||||
|
"paperless_tesseract.parsers.RasterisedDocumentParser.SCRATCH",
|
||||||
|
SCRATCH
|
||||||
|
)
|
||||||
|
def test_crazy_date_past(self, *args):
|
||||||
|
document = RasterisedDocumentParser("/dev/null")
|
||||||
|
document.get_text()
|
||||||
|
self.assertIsNone(document.get_date())
|
||||||
|
|
||||||
|
@mock.patch(
|
||||||
|
"paperless_tesseract.parsers.RasterisedDocumentParser.get_text",
|
||||||
|
return_value="01-07-2350 00:00:00"
|
||||||
|
)
|
||||||
|
@mock.patch(
|
||||||
|
"paperless_tesseract.parsers.RasterisedDocumentParser.SCRATCH",
|
||||||
|
SCRATCH
|
||||||
|
)
|
||||||
|
def test_crazy_date_future(self, *args):
|
||||||
|
document = RasterisedDocumentParser("/dev/null")
|
||||||
|
document.get_text()
|
||||||
|
self.assertIsNone(document.get_date())
|
||||||
|
|
||||||
|
@mock.patch(
|
||||||
|
"paperless_tesseract.parsers.RasterisedDocumentParser.get_text",
|
||||||
|
return_value="01-07-0590 00:00:00"
|
||||||
|
)
|
||||||
|
@mock.patch(
|
||||||
|
"paperless_tesseract.parsers.RasterisedDocumentParser.SCRATCH",
|
||||||
|
SCRATCH
|
||||||
|
)
|
||||||
|
def test_crazy_date_past(self, *args):
|
||||||
|
document = RasterisedDocumentParser("/dev/null")
|
||||||
|
document.get_text()
|
||||||
|
self.assertIsNone(document.get_date())
|
||||||
|
|||||||
@@ -1,11 +1,9 @@
|
|||||||
import os
|
import os
|
||||||
import re
|
|
||||||
import subprocess
|
import subprocess
|
||||||
|
|
||||||
import dateparser
|
|
||||||
from django.conf import settings
|
from django.conf import settings
|
||||||
|
|
||||||
from documents.parsers import DocumentParser, ParseError, DATE_REGEX
|
from documents.parsers import DocumentParser, ParseError
|
||||||
|
|
||||||
|
|
||||||
class TextDocumentParser(DocumentParser):
|
class TextDocumentParser(DocumentParser):
|
||||||
@@ -16,7 +14,6 @@ class TextDocumentParser(DocumentParser):
|
|||||||
CONVERT = settings.CONVERT_BINARY
|
CONVERT = settings.CONVERT_BINARY
|
||||||
THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
|
THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
|
||||||
UNPAPER = settings.UNPAPER_BINARY
|
UNPAPER = settings.UNPAPER_BINARY
|
||||||
DATE_ORDER = settings.DATE_ORDER
|
|
||||||
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
|
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
|
||||||
OCR_ALWAYS = settings.OCR_ALWAYS
|
OCR_ALWAYS = settings.OCR_ALWAYS
|
||||||
|
|
||||||
@@ -26,7 +23,7 @@ class TextDocumentParser(DocumentParser):
|
|||||||
|
|
||||||
def get_thumbnail(self):
|
def get_thumbnail(self):
|
||||||
"""
|
"""
|
||||||
The thumbnail of a txt is just a 500px wide image of the text
|
The thumbnail of a text file is just a 500px wide image of the text
|
||||||
rendered onto a letter-sized page.
|
rendered onto a letter-sized page.
|
||||||
"""
|
"""
|
||||||
# The below is heavily cribbed from https://askubuntu.com/a/590951
|
# The below is heavily cribbed from https://askubuntu.com/a/590951
|
||||||
@@ -35,7 +32,7 @@ class TextDocumentParser(DocumentParser):
|
|||||||
text_color = "black" # text color
|
text_color = "black" # text color
|
||||||
psize = [500, 647] # icon size
|
psize = [500, 647] # icon size
|
||||||
n_lines = 50 # number of lines to show
|
n_lines = 50 # number of lines to show
|
||||||
output_file = os.path.join(self.tempdir, "convert-txt.png")
|
out_path = os.path.join(self.tempdir, "convert.png")
|
||||||
|
|
||||||
temp_bg = os.path.join(self.tempdir, "bg.png")
|
temp_bg = os.path.join(self.tempdir, "bg.png")
|
||||||
temp_txlayer = os.path.join(self.tempdir, "tx.png")
|
temp_txlayer = os.path.join(self.tempdir, "tx.png")
|
||||||
@@ -46,9 +43,13 @@ class TextDocumentParser(DocumentParser):
|
|||||||
work_size = ",".join([str(n - 1) for n in psize])
|
work_size = ",".join([str(n - 1) for n in psize])
|
||||||
r = str(round(psize[0] / 10))
|
r = str(round(psize[0] / 10))
|
||||||
rounded = ",".join([r, r])
|
rounded = ",".join([r, r])
|
||||||
run_command(self.CONVERT, "-size ", picsize, ' xc:none -draw ',
|
run_command(
|
||||||
'"fill ', bg_color, ' roundrectangle 0,0,',
|
self.CONVERT,
|
||||||
work_size, ",", rounded, '" ', temp_bg)
|
"-size ", picsize,
|
||||||
|
' xc:none -draw ',
|
||||||
|
'"fill ', bg_color, ' roundrectangle 0,0,', work_size, ",", rounded, '" ', # NOQA: E501
|
||||||
|
temp_bg
|
||||||
|
)
|
||||||
|
|
||||||
def read_text():
|
def read_text():
|
||||||
with open(self.document_path, 'r') as src:
|
with open(self.document_path, 'r') as src:
|
||||||
@@ -57,7 +58,8 @@ class TextDocumentParser(DocumentParser):
|
|||||||
return text.replace('"', "'")
|
return text.replace('"', "'")
|
||||||
|
|
||||||
def create_txlayer():
|
def create_txlayer():
|
||||||
run_command(self.CONVERT,
|
run_command(
|
||||||
|
self.CONVERT,
|
||||||
"-background none",
|
"-background none",
|
||||||
"-fill",
|
"-fill",
|
||||||
text_color,
|
text_color,
|
||||||
@@ -65,14 +67,20 @@ class TextDocumentParser(DocumentParser):
|
|||||||
"-border 4 -bordercolor none",
|
"-border 4 -bordercolor none",
|
||||||
"-size ", txsize,
|
"-size ", txsize,
|
||||||
' caption:"', read_text(), '" ',
|
' caption:"', read_text(), '" ',
|
||||||
temp_txlayer)
|
temp_txlayer
|
||||||
|
)
|
||||||
|
|
||||||
create_txlayer()
|
create_txlayer()
|
||||||
create_bg()
|
create_bg()
|
||||||
run_command(self.CONVERT, temp_bg, temp_txlayer,
|
run_command(
|
||||||
"-background None -layers merge ", output_file)
|
self.CONVERT,
|
||||||
|
temp_bg,
|
||||||
|
temp_txlayer,
|
||||||
|
"-background None -layers merge ",
|
||||||
|
out_path
|
||||||
|
)
|
||||||
|
|
||||||
return output_file
|
return out_path
|
||||||
|
|
||||||
def get_text(self):
|
def get_text(self):
|
||||||
|
|
||||||
@@ -84,40 +92,6 @@ class TextDocumentParser(DocumentParser):
|
|||||||
|
|
||||||
return self._text
|
return self._text
|
||||||
|
|
||||||
def get_date(self):
|
|
||||||
date = None
|
|
||||||
datestring = None
|
|
||||||
|
|
||||||
try:
|
|
||||||
text = self.get_text()
|
|
||||||
except ParseError as e:
|
|
||||||
return None
|
|
||||||
|
|
||||||
# Iterate through all regex matches and try to parse the date
|
|
||||||
for m in re.finditer(DATE_REGEX, text):
|
|
||||||
datestring = m.group(0)
|
|
||||||
|
|
||||||
try:
|
|
||||||
date = dateparser.parse(
|
|
||||||
datestring,
|
|
||||||
settings={'DATE_ORDER': self.DATE_ORDER,
|
|
||||||
'PREFER_DAY_OF_MONTH': 'first',
|
|
||||||
'RETURN_AS_TIMEZONE_AWARE': True})
|
|
||||||
except TypeError:
|
|
||||||
# Skip all matches that do not parse to a proper date
|
|
||||||
continue
|
|
||||||
|
|
||||||
if date is not None:
|
|
||||||
break
|
|
||||||
|
|
||||||
if date is not None:
|
|
||||||
self.log("info", "Detected document date " + date.isoformat() +
|
|
||||||
" based on string " + datestring)
|
|
||||||
else:
|
|
||||||
self.log("info", "Unable to detect date for document")
|
|
||||||
|
|
||||||
return date
|
|
||||||
|
|
||||||
|
|
||||||
def run_command(*args):
|
def run_command(*args):
|
||||||
environment = os.environ.copy()
|
environment = os.environ.copy()
|
||||||
|
|||||||
19
src/reminders/migrations/0002_auto_20181007_1420.py
Normal file
19
src/reminders/migrations/0002_auto_20181007_1420.py
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
# Generated by Django 2.0.8 on 2018-10-07 14:20
|
||||||
|
|
||||||
|
from django.db import migrations, models
|
||||||
|
import django.db.models.deletion
|
||||||
|
|
||||||
|
|
||||||
|
class Migration(migrations.Migration):
|
||||||
|
|
||||||
|
dependencies = [
|
||||||
|
('reminders', '0001_initial'),
|
||||||
|
]
|
||||||
|
|
||||||
|
operations = [
|
||||||
|
migrations.AlterField(
|
||||||
|
model_name='reminder',
|
||||||
|
name='document',
|
||||||
|
field=models.ForeignKey(on_delete=django.db.models.deletion.PROTECT, to='documents.Document'),
|
||||||
|
),
|
||||||
|
]
|
||||||
@@ -4,7 +4,6 @@ from django.db import models
|
|||||||
class Reminder(models.Model):
|
class Reminder(models.Model):
|
||||||
|
|
||||||
document = models.ForeignKey(
|
document = models.ForeignKey(
|
||||||
"documents.Document", on_delete=models.PROTECT
|
"documents.Document", on_delete=models.PROTECT)
|
||||||
)
|
|
||||||
date = models.DateTimeField()
|
date = models.DateTimeField()
|
||||||
note = models.TextField(blank=True)
|
note = models.TextField(blank=True)
|
||||||
|
|||||||
Reference in New Issue
Block a user