Skip to content

AnmolMajithia/InvoiceReader-ML

Repository files navigation

Hackscript

Data extraction of invoices with no specific template and multiple format support using spaCy for ML in python

Modules Required

1_EmailFetcher.py

import imaplib
import email

2_Anmol.py

OpenCV - pip install opencv-python
pytesseract (Needs PIL/Pillow)
sudo apt update
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
pip install pytesseract
pdf2image
sudo apt install poppler-utils
pip install pdf2image

3_TrainSpacy.py

spaCy
json
scikit-learn
loggin

4_FinalOutput

spaCy
logging
json

TODO

1) Better spaCy entity points in training data

2) More spaCy pipes

3) Image processing ML(SVM,CNN)

4) Tesseract better preprocessing text extraction

5) Table removal before processing image

6) Code Optimization

About

Data extraction of invoices with no specific template and multiple format support using ML in python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages