Question Answering System — NLP Project (Intermediate)
Build an end to end QA system using Haystack transformers & Streamlit 🔥
1. Introduction
Natural Language Processing is one of the important and exciting fields in AI and Data science. NLP applications are already used in many places — chatbots, sentiment analyzers, recommender systems, translators, search engines, etc.
In this article, we will develop an end to end Question Answering application.
But, What is Question Answering ? — the task of searching through a large collection of documents for a piece of text that answers a question. Simply, answering the questions using a set of documents as reference. QA systems are used for information retrieval, document search, Real time FAQ etc..
Haystack is an open-source framework for building search systems that work intelligently over large document collections.
Streamlit is an open-source framework for building Machine Learning and Data Science web apps.
2. QA system — Overview
- Documents: Source of information. Word documents, Plain text documents, PDFs, etc..
- File Converter: Converts files on your computer into the documents that can be processed by the Haystack pipeline.
- Preprocessor: Cleans and splits the text into sensible units.
- Document Store: The component in Haystack that stores the text documents and their metadata in a way that optimises retrieval time.
- Retriever: A lightweight filter that selects only the most relevant documents for the Reader to further process.
- Reader: A trained Question Answering model that does the closest reading of a document to extract the exact text which answers a question.
3. Project !!
We will develop our QA sytem on a book (Think & grow Rich by Napolean Hill pdf) using it’s pages as our documents, this system will give answers to our questions related to the book.
Setting up the python environment (Learn more..)
# create a virtual environment
python3 -m venv env# activate the env
for ubuntu: source env/bin/activate
for windows: env\Scripts\activate# install required packages
pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git
pip install requests streamlit pdftotext
Data
Downloading a pdf book from the internet, you can use your local documents also.
Indexing pipeline
We will convert the text document into the haystack supported format and apply Preprocessor to clean and split the document into sensible units. We will store these preprocessed texts in a SQL document store.
Search pipeline
We will download our reader (a pre-trained transformer model on QA task) and also initialize our retriever to search top k relevant documents in document store.
For a given question, the retriever will search for the top ‘k’ documents relevant to the question and reader will predict answers using those ‘k’ documents instead of searching the whole document store.
Web app
We will build our web app using streamlit framework which is compatible with haystack and also it is easy to use.
Run the application using the below comand and our web app will open at http://localhost:8501/ in your browser.
streamlit run app.py
Summary
- Collected some pdf documents / book from the internet.
- Extracted text from the pages of book
- Converted, Preprocessed and Stored them in a document store
- Built a Haystack pipeline to predict answers using reader and retriever
- Developed a web app to use our QA system.
Resources
Full Project Code: https://github.com/ashok49473/datascience-blogs/tree/main/QuestionAnsweringSystem-Haystack-Streamlit
Haystack Documentation: https://haystack.deepset.ai/overview/intro