This project is concerned with automated retrieval of product information from retailer sites like amazon.com, tesco.com, asda.com, etc. The data retrieved can be used to train many machine learning algorithms like pricing approximation, image recognition etc. This project follows an object oriented approach in Python which gives it a bit generic functionality.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
What things you need to install the software and how to install them
Python 3.x Interpreter (https://www.python.org/downloads/)
Scrapy Package library (https://pypi.org/project/Scrapy/)
Pillow Package library https://pypi.org/project/Pillow/2.2.1/)
Unittest Package library (Python standard library)
os Package library (Python standard library)
interface-python Package Library (https://pypi.org/project/python-interface/)
enum Package library (Python standard library)
This guide will be limited to installing the project on PyCharm by JetBrains
- Clone the master directory Tesco_Amazon_Scraper.
- Open he directory as a project in PyCharm (Right click on folder and click on Open as project in PyCharm)
- Click on Files > Settings > Project: Tesco_Amazon_Scraper > Project Interpreter
- Click on + button on right side of the dialog box.
- Search for all the pre-requisites libraries and install them using button Install Package
This guide will show an example on running the project to scrape Amazon
- Open the master directory in terminal or command prompt.
- Run one of the command deending upon the output format required (CSV or JSON format)
scrapy crawl RetailerSpider -a output=json -a site=amazon -s LOG_ENABLED=False
or
scrapy crawl RetailerSpider -a output=csv -a site=amazon -s LOG_ENABLED=False
-
Its recommended to have
-s LOG_ENABLED=Falseas it significantly speeds up scraping process. -
Generally we don't want to scrape the whole amazon database so its recommended to add
-s DEPTH_LIMIT=1or-s DEPTH_LIMIT=2etc depending upon how many products you require. More the products more are the chances of the ip being blocked by amazon. -
Use Ctrl + c keyboard shortcut to have a graceful shut-down of the process (It will take some time to closs all the files after using the shortcut).
This will create a directory of Files in the master directory. Replace -a site=tesco for scraping tesco.com.
For running a unit test on a specific product page example.
- Download the html file and and place it in same directory as in
test.pyashtml_sample4.html - Define your own unit test or re-configuire the already written ones.
- Open
test.pyand change the actual parameters of fake_response_from_file function with your own file name and url. - Open test.py directory in Terminal/command prompt and type python -m unittest
The new retailer site must not be build with JavaScript. Files RetailerDetails.py, constants.py, RetailerSpider.py needs to be modified for using the scraper.
- In
constants.pyredefine enums for the new retailer domain. - Each new enums must have a new constant of
[SITE_NAME]and constant value for the specific site (Domain URL, Xpath Queries and Product page regex are used as constant's values). - In
RetailerDetails.pydefine a class[SITE_NAME]Detailsextending interfaceDetails(replace [SITE_NAME] with name of the retailer). - Define all functions of the
Detailsinterface in the new class. - Use the function
applyrespone(response_object, xpath_query_expression)inutilities.pyreturns a object that contains data of your xpath query. - In
RetailerSpider.pyammendmakeObject(spider)function to check for equality ofspider.site == [SITE_NAME]and return an object of[SITE_NAME]Detailsclass after importing it fromRetailerDetails. - Run the command for running the
RetailerSpider.pyspider but with-a site=[SITE_NAME]argument.
Scrapy API A web framework for Information retrieval.
Raveesh Gupta