Hackscript Data extraction of invoices with no specific template and multiple format support using spaCy for ML in python Modules Required 1_EmailFetcher.py import imaplib import email 2_Anmol.py OpenCV - pip install opencv-python pytesseract (Needs PIL/Pillow) sudo apt update sudo apt install tesseract-ocr sudo apt install libtesseract-dev pip install pytesseract pdf2image sudo apt install poppler-utils pip install pdf2image 3_TrainSpacy.py spaCy json scikit-learn loggin 4_FinalOutput spaCy logging json TODO 1) Better spaCy entity points in training data 2) More spaCy pipes 3) Image processing ML(SVM,CNN) 4) Tesseract better preprocessing text extraction 5) Table removal before processing image 6) Code Optimization