Why is this use case interesting for this application? - Testing alternatives to retrieve datasets?
Harvard created a data pipeline (https://github.com/institutional/institutional-books-1-pipeline) and an associated tool for obtaining materials from Google Books (https://www.institutional.org/posts/grin-transfer).
We do not have access to the Google Books, but we could implement the same approach accessing the Hugging Face dataset instead.
- GRIN Transfer: Download books
- See how well it works
- What does the output look like?
- How long does the OCR cleanup process take?
- Could it make sense to use FastAPI here, or does LangChain have a data pipeline to access this resource?