Creating a Chat Bot using NLP and Keras in Python

Chat Bot using Python, Machine Learning, NLP, Keras and NLTK

Introduction

Chatbots are often used by businesses and organizations to automate customer service, sales, and marketing interactions, as well as to provide 24/7 support to their customers. They can also be used for personal purposes, such as entertainment, education, and productivity.

In this article we are going to create a Chat bot using Python, Machine learning, Natural Language Processing or NLP, Keras and NLTK or Natural Language Toolkit. But before that let us discuss some basic concepts in following sections.

What is a Chat Bot?

A chat-bot is a computer program designed to simulate conversation with human users, especially over the internet. Chat-bots can be programmed to interact with users in a natural language conversation using text-based interfaces, voice assistants or even chat windows in websites and apps.

Chat-bots use artificial intelligence (AI) technologies such as natural language processing (NLP), machine learning, and pattern recognition to understand and interpret user inputs, and provide relevant responses based on pre-programmed scripts or learned patterns from past interactions.

What is Natural Language Processing or NLP?

Natural language processing or NLP involves processing and analyzing natural language data, such as text or speech, using computer algorithms and statistical models. The goal of the artificial intelligence area known as “natural language processing” (NLP) is to make it possible for computers to comprehend, analyze, and produce human language.

NLP is widely used in a variety of applications, including virtual assistants, chatbots, search engines, speech recognition, and text analytics. As the amount of digital text data continues to grow, NLP is becoming an increasingly important tool for extracting valuable insights and knowledge from unstructured natural language data.

What is Keras?

Python-based Keras is an open-source neural network library. It is intended to be expandable, modular, and user-friendly. Keras enables programmers to rapidly and simply build deep learning models for a variety of applications.

Keras provides a high-level interface to popular deep learning frameworks such as TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano, making it easier to build and train deep learning models. It offers a simple and intuitive API that makes it easy to build and train neural networks without requiring an in-depth knowledge of the underlying mathematics and algorithms.

Keras also includes a range of pre-built neural network layers and activation functions, as well as tools for data preprocessing and model evaluation. It supports both CPU and GPU computing, allowing users to take advantage of hardware acceleration to speed up training and inference.

Keras has become one of the most popular deep learning libraries, widely used in academia and industry for a range of applications, including image recognition, natural language processing, and time series forecasting.

What is NLTK?

NLTK (Natural Language Toolkit) is a popular open-source Python library for working with human language data. It provides a range of tools and resources for tasks such as tokenization, stemming, tagging, parsing, and semantic reasoning.

NLTK includes a wide range of language processing algorithms and models, as well as datasets and corpora for training and testing natural language models. It also includes a range of modules for working with specific language tasks, such as sentiment analysis, text classification, and named entity recognition.

NLTK is widely used in academic and industrial settings for natural language processing (NLP) research, education, and development. Its modular design makes it easy to use and extend for a wide range of NLP tasks, and its comprehensive documentation and tutorials make it a popular choice for beginners as well as experienced developers.

Creating Chat Bot using Python, NLTK, Keras and NLP

There are a number of steps we need to follow for creating and training this chat bot deep learning model. Lets us go step by step.

Import the required libraries

# Import the required library
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
import random
import json
import pickle
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

Load the data

# Load the data
file_name = 'data.json'
data_corpus = open(file_name).read()
data_collection = json.loads(data_corpus)

Pre-process the data

Tokenization is a fundamental task in Natural Language Processing (NLP) that involves breaking a text into individual words or meaningful sub-components, called tokens. Tokenization is typically the first step in NLP tasks such as text classification, sentiment analysis, and machine translation.

In tokenization, a text is split into individual tokens based on certain rules, such as whitespace, punctuation marks, and word boundaries. For example, the sentence “I love NLP” would be tokenized into the tokens “I”, “love”, and “NLP”.

Tokenization can be more complex in languages that use compound words, where a single word may be made up of multiple sub-words. In such cases, tokenization may involve splitting a word into sub-tokens, or combining multiple tokens to form a single token.

Tokenization is important because it allows a computer to understand the structure and meaning of a text by breaking it down into smaller, more manageable pieces. Once a text has been tokenized, it can be further analyzed and processed using a variety of NLP techniques and algorithms.

# Pre process the data
vocabs=[]
types = []
docss = []

for dataroot in data_collection['dataroots']:
    for question in dataroot['questions']:        
        wd = nltk.word_tokenize(question)
        vocabs.extend(wd)       
        docss.append((wd, dataroot['name']))      
        if dataroot['name'] not in types:
            types.append(dataroot['name'])

Lemmatization is a technique in Natural Language Processing (NLP) that involves reducing words to their base or root form, called a lemma. The goal of lemmatization is to normalize words so that variations of the same word are treated as the same word, which helps improve the accuracy of NLP tasks such as text classification and sentiment analysis.

In lemmatization, a word is reduced to its base form based on its morphological analysis, such as its part of speech and inflectional endings. For example, the words “walking,” “walked,” and “walks” would be lemmatized to the base form “walk.”

Lemmatization is more complex than stemming, another text normalization technique, which involves removing the suffixes from a word to obtain a root form. Stemming can result in an incorrect root form, whereas lemmatization takes into account the context of the word and produces a correct root form based on its dictionary form.

Lemmatization can improve the accuracy of NLP tasks that rely on identifying the meaning of words and the relationships between words in a sentence. It is widely used in applications such as search engines, chatbots, and speech recognition systems to improve the accuracy of natural language processing.

ignore_vocabs = ['.','?', '!']
vocabs = [wd.lower() for wd in vocabs if wd not in ignore_vocabs]
vocabs = [lemmatizer.lemmatize(wd) for wd in vocabs]
vocabs = sorted(list(set(vocabs)))
types = sorted(list(set(types)))
pickle.dump(vocabs,open('vocabs.pkl','wb'))
pickle.dump(types,open('types.pkl','wb'))

Create Training Data

In machine learning, it is essential to train and test the model to evaluate its performance and ensure that it can generalize well to new, unseen data. Training data is the set of labeled examples that are used to train the machine learning model, while test data is a set of labeled examples that are held back from the training process and used to evaluate the performance of the model on unseen data.

The primary reason for separating data into training and test sets is to prevent overfitting. Overfitting occurs when a model is too complex and fits the training data too closely, leading to poor performance on new, unseen data. By evaluating the model’s performance on the test set, we can get an estimate of how well the model will perform on new, unseen data and adjust the model’s complexity to avoid overfitting.

Moreover, using a separate test set helps us to avoid the problem of data leakage, where the model learns patterns from the test data and performs well on the test set but fails to generalize well to new data.

In summary, training and test data are essential components of the machine learning workflow. Training data is used to build and optimize the machine learning model, while test data is used to evaluate its performance and ensure that it can generalize well to new, unseen data. By using a separate test set, we can prevent overfitting and data leakage, and ensure that the model is robust and reliable.

# create our training data
training = []
output_var = [0] * len(types)
for doc in docss:
    bag = []
    pattern_vocabs = doc[0]
    pattern_vocabs = [lemmatizer.lemmatize(wd.lower()) for wd in pattern_vocabs]
    for wd in vocabs:
        bag.append(1) if wd in pattern_vocabs else bag.append(0)
    
    output_row = list(output_var)
    output_row[types.index(doc[1])] = 1    
    training.append([bag, output_row])

random.shuffle(training)
training = np.array(training)
train_x = list(training[:,0])
train_y = list(training[:,1])

Create the model

A form of model in Keras, a high-level neural networks API that operates on top of TensorFlow, Theano, or CNTK, is the sequential API. The Sequential API is meant for creating straightforward sequential models with precisely one input tensor and one output tensor for each layer.

In the Sequential API, a model is defined as a sequence of layers, where each layer is added one at a time using the add() method. The layers can be instantiated with different types of neurons, such as Dense (fully connected), Convolutional, or Recurrent layers.

# Create, compile and train model
model = Sequential()
model.add(Dense(256, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

Compile the model

In this article, the model has two Dense layers, where the first layer has 64 neurons with ReLU activation and the second layer has 10 neurons with Softmax activation. The input_shape parameter specifies the shape of the input data. The model is compiled with a categorical cross-entropy loss function, the Adam optimizer, and the accuracy metric.

Once the model is defined, it can be trained using the fit() method and evaluated using the evaluate() method. Sequential API is a simple and intuitive way to build neural network models, and it is well suited for many simple classification and regression tasks.

sgd = SGD(lr=0.001, decay=1e-7, momentum=0.85, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

Train the model

para = model.fit(np.array(train_x), np.array(train_y), epochs=100, batch_size=10, verbose=1)

Save the Model

model.save('model.h5', para)

Predict using the Model

#Load model
model = load_model('model.h5')
dataroots = json.loads(open('data.json').read())
vocabs = pickle.load(open('vocabs.pkl','rb'))
types = pickle.load(open('types.pkl','rb'))

# create prediction 
def model_response(msg):
    datar = predict(msg, model)
    res = sample_response(datar, dataroots)    
    return res
def sample_response(datar, dataroots_json):
    tag = datar[0]['dataroot']
    list_of_dataroots = dataroots_json['dataroots']
    for i in list_of_dataroots:
        if(i['name']== tag):
            result = random.choice(i['answers'])
            break
    return result
def predict(sentence, model):
    p = vocabs_fn(sentence, vocabs,show_details=False)
    res = model.predict(np.array([p]))[0]
    thresold_error = 0.20
    results = [[i,r] for i,r in enumerate(res) if r> thresold_error]
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"dataroot": types[r[0]], "probability": str(r[1])})
    return return_list
def vocabs_fn(sentence, vocabs, show_details=True):
    sentence_vocabs = clean(sentence)
    bag = [0]*len(vocabs)  
    for s in sentence_vocabs:
        for i,w in enumerate(vocabs):
            if w == s: 
                bag[i] = 1
                if show_details:
                    print("")
    return(np.array(bag))
def clean(sentence):
    sentence_vocabs = nltk.word_tokenize(sentence)
    sentence_vocabs = [lemmatizer.lemmatize(wd.lower()) for wd in sentence_vocabs]
    return sentence_vocabs

# Check Prediction
msg = "hello"
res = model_response(msg)
res

Conclusion

In this article we discussed how to create a Chat bot using Python, Machine learning, Natural Language Processing or NLP, Keras and NLTK or Natural Language Toolkit. I hope you liked the article and it added some value to your knowledge in this area.

Happy Learning 🙂

Leave a comment