Build an Open-Domain Question-Answering System With BERT in 3 Lines of Code

Open-Domain Question-Answering (QA)systems accept natural language questions as input and return exact answers from content buried within large text corpora such as Wikipedia. This is very different from standard search engines that simply return the documents that match keywords in a search query.

Building such systems for practical applications has historically been quite challenging and involved. This article attempts to make such systems more accessible and easier to apply. In this article, we will build a fully-functional, end-to-end open-domain QA system in only 3 lines of code . To accomplish this, we will use ktrain , a Python library and TensorFlow wrapper that makes deep learning and AI more accessible and easier to apply. ktrain is free, open-source, and available here .

The basic idea here will be to treat a document set covering many different topics as a kind of knowledge base to which we can submit questions and receive exact answers. In this article, we will use the 20 Newsgroups dataset as the knowledge base. As a collection of newsgroup postings which contain an abundance of opinions, debates, and arguments, the corpus is far from ideal as a knowledge base. It is generally better to use fact-based documents such as Wikipedia articles or even news articles. However, the 20 Newsgroups dataset will suffice for this example, as it makes for an interesting and easy-to-try case study. It also highlights the fact that Question-Answering (QA) systems can yield insights from any collection of documents.

Let us begin by loading the dataset into an array called docs using scikit-learn and importing some ktrain modules.

STEP 1: Create a Search Index

With a dataset in hand, we will first need to create a search index . The search index will allow us to quickly and easily retrieve documents that contain words present in the question. Such documents are likely to contain the answer and can be analyzed further to extract answers. We will first initialize the search index and then add documents from the Python list to the index. Since the newsgroup postings are small and fit in main memory, we will set commit_every to a large value to speed up the indexing process. This means results will not be written until the end. If you experience issues, you can try lowering this value.

For document sets that are too large to be loaded into a Python list, you can use SimpleQA.index_from_folder , which will crawl a folder and index all plain text documents found.

STEP 2: Create a QA Instance

Next, we will create a QA instance, which is largely a wrapper around a pretrained BertForQuestionAnswering model from the excellent transformers library.

That’s it! In roughly 3 lines of code, we have built an end-to-end Question-Answering system that is ready to receive questions.

Ask Questions!

We will invoke the qa.ask method to issue questions to the text corpus we indexed and retrieve answers. The ask method performs the following steps:

Uses the search index to locate documents that contain words in the question
Extracts paragraphs from these documents for use as contexts and uses a BERT model pretrained on the SQuAD dataset to parse out candidate answers
Sorts and prunes candidate answers by confidence scores and returns results

We will also use the qa.display method to nicely format and display the top 5 results in our Jupyter notebook. Since the model is combing through paragraphs and sentences to find answers, it may take a few moments to return results.

Note also that the 20 Newsgroups dataset covers events in the early to mid 1990s, so references to recent events will not exist. The dataset does, however, cover many different domain categories. For instance, there is sci.space category that covers topics about space. Let’s begin with a space question!

Space Question:

When did the Cassini probe launch?

As you can see, the top candidate answer indicates that the Cassini space probe was launched in October of 1997 , which appears to be correct . The specific answer within its context is highlighted in red under the column Context . Note that the correct answer will not always be the top answer, but it is in this case.

Since we used index_from_list to index documents, the last column (populated from the reference field in the answers dictionaries) shows the list index associated with the newsgroup posting containing the answer. This reference field can be used to peruse the entire document containing the answer with print(docs[59]) . If using index_from_folder to index documents, then the reference field will be populated with the relative file path of the document instead.

The 20 Newsgroups dataset also contains lots of posts discussing and debating Christianity, as well. Let’s ask a question on this subject.

Religious Question:

Who was Jesus Christ?

Here, we see different views on Jesus Christ, as debated and discussed in this document set.

Finally, the 20 Newsgroups dataset also contains several newsgroup categories about computing subjects like computer graphics and PC hardware and software. Let’s ask a technical support question.

Technical Support Question:

What causes computer images to be too dark?

From the candidate answers, a lack of gamma correction seems to be at least one of the causes of the reported problem.

A Note About Deploying the QA System

To deploy this system, the only state that needs to be persisted is the search index we initialized and populated in STEP 1 . Once a search index is initialized and populated, one can simply start from STEP 2 to use the QA system in a production environment.

Source Code for This Article

Source code for this article is available in two forms:

a Jupyter notebook available on our GitHub repo
a Google Colab notebook available here

Feel free to try out the ktrain QA module on your own document collections. For more information, visit our GitHub repository here:

Build an Open-Domain Question-Answering System With BERT in 3 Lines of Code