Sentiment Analysis using Deep Learning (BERT)
Sentiment analysis is one of the classic machine learning problems which finds use cases across industries. For example, it can help us in gauging public opinion and brand perception by analyzing social media sentiments. Another use case could be of aiding businesses in understanding customer feedback to enhance products/services. In this article, we will go through step by step explanation on how we can use Bidirectional Encoder Representations from Transformers (BERT) deep learning technique to solve sentiment analysis problem.
Introduction
Sentiment analysis typically involves determining the emotional tone conveyed in text data. By employing machine learning techniques, sentiment analysis identifies whether a piece of text expresses positive, negative, or neutral sentiment. For example, a restaurant owner might use sentiment analysis to assess customer reviews on good delivery platforms. By analyzing the sentiment of these reviews, the owner can easily identify areas of strength or improvement, such as positive feedback on food quality but negative comments on service speed. This insight enables the owner to make data-driven decisions to enhance customer satisfaction and improve business operations.
A deep learning model learns from examples by getting better at tasks like recognizing objects in images or understanding language by looking at lots and lots of examples. BERT (Bidirectional Encoder Representations from Transformers) [3] is a language model meaning itās trained on a huge amount of text. It understands a lot about how words and sentences work together. Itās really good at understanding the context and meaning of words in a sentence, which makes it useful for tasks like answering questions, summarizing text, or translating languages.
Problem statement & Dataset
Dataset [3] is relatively straight forward as shown below. We have a text (tweets) column and category column. we have to create an algorithm which can predict category of tweet by learning from data in training set:

Code & Explanation
Libraries: we will be using pytorch, sklearn and transformers libraries mainly. PyTorch library is helping us with complex tensor definitions and computations during training our model or generating output using model. sklearn is although an extensive ML library itself but we are using it here just to split dataset and compute some metrics for model performance. lastly, transformers provides the pre-trained BERT model we will be using to build our classification model for this problem statement.
# Importing the os module to interact with the operating system
import os
# Listing the contents of the current directory
os.listdir('.')
# Importing pandas for data manipulation and analysis
import pandas as pd
# Importing numpy for numerical computations
import numpy as np
# Importing random for generating random numbers and making choices
import random
# Importing tqdm for displaying progress bars during iterations
from tqdm.notebook import tqdm
# Importing necessary functions and classes from scikit-learn for
# machine learning tasks
from sklearn.model_selection import train_test_split # For splitting
# data into train and test sets
from sklearn.metrics import f1_score # For calculating F1 score
# Importing torch for building and training neural networks
import torch
# Importing transformers from Hugging Face for pre-trained models
# and tokenization
import transformers
from transformers import (BertTokenizer, # For BERT tokenizer
AutoTokenizer, # For automatic selection of tokenizer
BertForSequenceClassification, # For BERT-based sequence classification model
AdamW, # For AdamW optimizer
get_linear_schedule_with_warmup) # For learning rate scheduling
# Importing necessary classes from torch.utils.data for handling datasets
from torch.utils.data import (TensorDataset, DataLoader,
RandomSampler, SequentialSampler)
Dataset & manipulation: In this step, we first fix the categories we would want to limit our analysis to. This we do by removing certain categories. We then proceed to convert the categories to numerical labels so that it can be fed into the model.
# Reading the CSV file 'smile-annotations-final.csv' into a pandas DataFrame
# Assigning custom column names 'id', 'text', and 'category'
df = pd.read_csv('smile-annotations-final.csv',
names=['id', 'text', 'category'])
# Setting the 'id' column as the index of the DataFrame
df.set_index('id', inplace=True)
# Displaying the first few rows of the DataFrame using the 'head' method
display('head', df.head())
# Displaying the counts of unique values in the 'category' column
# using the 'value_counts' method
display('category counts', df.category.value_counts())
# Filtering out rows where the 'category' column contains '|'
df = df[~df.category.str.contains('\|')]
# Filtering out rows where the 'category' column is 'nocode'
df = df[df.category != 'nocode']
# Displaying the counts of unique values in the 'category' column
# after cleanup
display('category counts after cleanup', df.category.value_counts())
# Extracting unique categories from the 'category' column of the DataFrame
possible_labels = df.category.unique()
# Creating a dictionary to map string categories to numerical labels
label_dict = {}
for index, possible_label in enumerate(possible_labels):
label_dict[possible_label] = index
# Creating a new column 'label' in the DataFrame by replacing string categories with numerical labels
df['label'] = df.category.replace(label_dict)
# Displaying the first few rows of the DataFrame with the new 'label' column
df.head()

Data after introducing numerical labels:

Splitting the data into training and validation sets: In step below, we utilize sklearn to split the data to train and validation sets
### splitting data into training and validation sets ###
X_train, X_val, y_train, y_val = train_test_split(df.index.values,
df.label.values,
test_size=0.15,
random_state=17,
stratify=df.label.values)
df['data_type'] = ['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
df.groupby(['category', 'label', 'data_type']).count()
Tokenization: Deep learning models require the training data (examples through which it learns) in tensor form. Since input data is a dataframe containing texts, we first need to split them into individual words (process of tokenization) and ensure that resulting tokenized training samples are of the same size. For this we add padding and limit the length of tokenized training example. We also get attention mask in the training data which is one of the input while training BERT model. The same processes are required for validation set.
# Using the BERT tokenizer from the 'bert-base-uncased' model
# and setting do_lower_case to True to ensure all text is lowercased
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
# Encoding the text data in the training set using batch_encode_plus
# This method tokenizes and encodes a batch of sequences, adding special tokens,
# padding the sequences to the same length, and returning PyTorch tensors
encoded_data_train = tokenizer.batch_encode_plus(
df[df.data_type=='train'].text.values, # Extracting text data for training
add_special_tokens=True, # Adding special tokens like [CLS] and [SEP]
return_attention_mask=True, # Returning attention masks to focus on actual tokens
pad_to_max_length=True, # Padding sequences to the same length
max_length=256, # Maximum length of each sequence
return_tensors='pt' # Returning PyTorch tensors
)
# Encoding the text data in the validation set using batch_encode_plus
encoded_data_val = tokenizer.batch_encode_plus(
df[df.data_type=='val'].text.values, # Extracting text data for validation
add_special_tokens=True, # Adding special tokens like [CLS] and [SEP]
return_attention_mask=True, # Returning attention masks to focus on actual tokens
pad_to_max_length=True, # Padding sequences to the same length
max_length=256, # Maximum length of each sequence
return_tensors='pt' # Returning PyTorch tensors
)
# Extracting input IDs, attention masks, and labels for the training set
input_ids_train = encoded_data_train['input_ids'] # Input IDs representing tokenized text
attention_masks_train = encoded_data_train['attention_mask'] # Attention masks indicating which tokens to attend to
labels_train = torch.tensor(df[df.data_type=='train'].label.values) # Labels for the training set
# Extracting input IDs, attention masks, and labels for the validation set
input_ids_val = encoded_data_val['input_ids'] # Input IDs representing tokenized text
attention_masks_val = encoded_data_val['attention_mask'] # Attention masks indicating which tokens to attend to
labels_val = torch.tensor(df[df.data_type=='val'].label.values) # Labels for the validation set
# Creating PyTorch datasets for training and validation
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train) # Training dataset
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val) # Validation dataset
Setting up BERT and functions to estimate performance: We now setup BERT pre-trained model, define batch size for each training iteration, optimizer, no. of epochs for training. We also define F1 score & accuracy for each class as metrics to evaluate model performance.
# Initializing the BERT model for sequence classification from the pre-trained 'bert-base-uncased' model
# Specifying the number of labels in the output layer based on the length of the label dictionary
# Setting output_attentions and output_hidden_states to False to exclude additional outputs
# Setting resume_download to True to resume download if interrupted
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
num_labels=len(label_dict),
output_attentions=False,
output_hidden_states=False,
resume_download=True)
# Defining the batch size for training and validation
batch_size = 32
# Creating data loaders for training and validation sets
# Using RandomSampler for training data and SequentialSampler for validation data
dataloader_train = DataLoader(dataset_train,
sampler=RandomSampler(dataset_train),
batch_size=batch_size)
dataloader_validation = DataLoader(dataset_val,
sampler=SequentialSampler(dataset_val),
batch_size=batch_size)
# Initializing the AdamW optimizer with the BERT model parameters
# Setting the learning rate to 2e-5 and epsilon to 1e-8
optimizer = AdamW(model.parameters(),
lr=2e-5,
eps=1e-8)
# Defining the number of epochs for training
epochs = 7
# Creating a linear scheduler with warmup for adjusting learning rates during training
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0,
num_training_steps=len(dataloader_train)*epochs)
# Defining a function to calculate the F1 score
def f1_score_func(preds, labels):
preds_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
return f1_score(labels_flat, preds_flat, average='weighted')
# Defining a function to calculate accuracy per class
def accuracy_per_class(preds, labels):
label_dict_inverse = {v: k for k, v in label_dict.items()}
preds_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
for label in np.unique(labels_flat):
y_preds = preds_flat[labels_flat==label]
y_true = labels_flat[labels_flat==label]
print(f'Class: {label_dict_inverse[label]}')
print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')
Evaluation function: In below segment, we are assigning the device available for computation CPU or GPU depending on availability. evaluate method below uses fine tuned model for prediction on validation set.
### assigning seed to be able to reproduce results ###
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
# Checking for GPU availability and assigning the device accordingly
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device) # Moving the model to the selected device
print(device) # Printing the device (GPU or CPU) being used
# Defining the evaluation function for the validation set
def evaluate(dataloader_val):
model.eval() # Setting the model to evaluation mode
loss_val_total = 0 # Initializing total validation loss
predictions, true_vals = [], [] # Lists to store predictions
# and true values
# Iterating through batches in the validation dataloader
for batch in dataloader_val:
batch = tuple(b.to(device) for b in batch) # Moving batch
# tensors to the device
inputs = {'input_ids': batch[0], # Input token IDs
'attention_mask': batch[1], # Attention masks
'labels': batch[2], # Labels
}
with torch.no_grad(): # Disabling gradient calculation
outputs = model(**inputs) # Forward pass
loss = outputs[0] # Extracting loss value from the output
logits = outputs[1] # Predicted logits
loss_val_total += loss.item() # Accumulating validation loss
logits = logits.detach().cpu().numpy() # Detaching logits from
# computation graph and moving to CPU
label_ids = inputs['labels'].cpu().numpy() # Moving label IDs to CPU
predictions.append(logits) # Appending predictions to the list
true_vals.append(label_ids) # Appending true values to the list
loss_val_avg = loss_val_total/len(dataloader_val) # Calculating
# average validation loss
# Concatenating predictions and true values to form arrays
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)
return loss_val_avg, predictions, true_vals # Returning validation
# loss, predictions, and true values
Training: Now, we fine tune pre-trained BERT model using training data.
# Training loop for each epoch
for epoch in tqdm(range(1, epochs+1)):
model.train() # Setting the model to training mode
loss_train_total = 0 # Initializing total training loss
# Progress bar for training epoch
progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
for batch in progress_bar:
model.zero_grad() # Resetting gradients
batch = tuple(b.to(device) for b in batch) # Moving batch tensors to the device
inputs = {'input_ids': batch[0], # Input token IDs
'attention_mask': batch[1], # Attention masks
'labels': batch[2], # Labels
}
outputs = model(**inputs) # Forward pass
loss = outputs[0] # Extracting loss value from the output
loss_train_total += loss.item() # Accumulating training loss
loss.backward() # Backpropagation
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clipping gradients to prevent explosion
optimizer.step() # Optimizer step
scheduler.step() # Scheduler step
# Updating progress bar with current training loss
progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model') # Saving model after each epoch
tqdm.write(f'\nEpoch {epoch}') # Printing current epoch
loss_train_avg = loss_train_total/len(dataloader_train) # Calculating average training loss
tqdm.write(f'Training loss: {loss_train_avg}') # Printing training loss
val_loss, predictions, true_vals = evaluate(dataloader_validation) # Evaluating on validation set
val_f1 = f1_score_func(predictions, true_vals) # Calculating F1 score
tqdm.write(f'Validation loss: {val_loss}') # Printing validation loss
tqdm.write(f'F1 Score (Weighted): {val_f1}') # Printing F1 score

Results
Training loss is not an accurate measure of performance as its possible to overfit the training data when while training. A better way of evaluating model is reviewing its performance on unseen data ā validation set in our case. Lets check how loss and F1 score has evolved with epochs:

We can see that both loss and F1 score have plateaued indicating that we have sufficiently fine-tuned the model. F1 score is .83 which is good for a model without any hyper-parameter tuning. More so, we have not even analyzed the tweets sufficiently to remove the special characters etc. which would have improved our results significantly.
Conclusion & Future work
A pre-trained BERT model can help us getting really good results on natual language processing tasks with mere fine-tuning with custom training data related to the problem. This can be very easily utilized in industry or academic problems for examples the problem of performing sentiment analysis on
customer review data for a restaurant.As discussed above, we can further improve the model by cleaning the data, performing hyper parameter tuning on text size etc.
If you liked the explanation , follow me for more! Feel free to leave your comments if you have any queries or suggestions.
You can also check out other articles written around data science, computing on medium. If you like my work and want to contribute to my journey, you cal always buy me a coffee :)
References
[1] github link to the notebook: https://github.com/girish9851/Sentiment-Analysis-with-Deep-Learning-using-BERT/blob/master/Sentiment_analysis_with_deep_learning_using_BERT.ipynb
[2] smile twitter emotion dataset: https://www.kaggle.com/datasets/ashkhagan/smile-twitter-emotion-dataset
[3] BERT: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
Comments
Post a Comment