Skip to content

Latest commit

 

History

History
45 lines (32 loc) · 910 Bytes

File metadata and controls

45 lines (32 loc) · 910 Bytes

Hackscript

Data extraction of invoices with no specific template and multiple format support using spaCy for ML in python

Modules Required

1_EmailFetcher.py

import imaplib
import email

2_Anmol.py

OpenCV - pip install opencv-python
pytesseract (Needs PIL/Pillow)
sudo apt update
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
pip install pytesseract
pdf2image
sudo apt install poppler-utils
pip install pdf2image

3_TrainSpacy.py

spaCy
json
scikit-learn
loggin

4_FinalOutput

spaCy
logging
json

TODO

1) Better spaCy entity points in training data

2) More spaCy pipes

3) Image processing ML(SVM,CNN)

4) Tesseract better preprocessing text extraction

5) Table removal before processing image

6) Code Optimization