Sentiment Analysis using Deep Learning (BERT)

1/02/2025 11:14:00 AM

Sentiment Analysis using Deep Learning (BERT)

Sentiment analysis is one of the classic machine learning problems which finds use cases across industries. For example, it can help us in gauging public opinion and brand perception by analyzing social media sentiments. Another use case could be of aiding businesses in understanding customer feedback to enhance products/services. In this article, we will go through step by step explanation on how we can use Bidirectional Encoder Representations from Transformers (BERT) deep learning technique to solve sentiment analysis problem.

Introduction

Sentiment analysis typically involves determining the emotional tone conveyed in text data. By employing machine learning techniques, sentiment analysis identifies whether a piece of text expresses positive, negative, or neutral sentiment. For example, a restaurant owner might use sentiment analysis to assess customer reviews on good delivery platforms. By analyzing the sentiment of these reviews, the owner can easily identify areas of strength or improvement, such as positive feedback on food quality but negative comments on service speed. This insight enables the owner to make data-driven decisions to enhance customer satisfaction and improve business operations.

A deep learning model learns from examples by getting better at tasks like recognizing objects in images or understanding language by looking at lots and lots of examples. BERT (Bidirectional Encoder Representations from Transformers) [3] is a language model meaning it’s trained on a huge amount of text. It understands a lot about how words and sentences work together. It’s really good at understanding the context and meaning of words in a sentence, which makes it useful for tasks like answering questions, summarizing text, or translating languages.

Problem statement & Dataset

Dataset [3] is relatively straight forward as shown below. We have a text (tweets) column and category column. we have to create an algorithm which can predict category of tweet by learning from data in training set:

Code & Explanation

Libraries: we will be using pytorch, sklearn and transformers libraries mainly. PyTorch library is helping us with complex tensor definitions and computations during training our model or generating output using model. sklearn is although an extensive ML library itself but we are using it here just to split dataset and compute some metrics for model performance. lastly, transformers provides the pre-trained BERT model we will be using to build our classification model for this problem statement.

# Importing the os module to interact with the operating system
import os

# Listing the contents of the current directory
os.listdir('.')

# Importing pandas for data manipulation and analysis
import pandas as pd

# Importing numpy for numerical computations
import numpy as np

# Importing random for generating random numbers and making choices
import random

# Importing tqdm for displaying progress bars during iterations
from tqdm.notebook import tqdm

# Importing necessary functions and classes from scikit-learn for 
# machine learning tasks
from sklearn.model_selection import train_test_split  # For splitting 
# data into train and test sets
from sklearn.metrics import f1_score  # For calculating F1 score

# Importing torch for building and training neural networks
import torch

# Importing transformers from Hugging Face for pre-trained models 
# and tokenization

import transformers
from transformers import (BertTokenizer,  # For BERT tokenizer
    AutoTokenizer,  # For automatic selection of tokenizer
    BertForSequenceClassification,  # For BERT-based sequence classification model
    AdamW,  # For AdamW optimizer
    get_linear_schedule_with_warmup)  # For learning rate scheduling

# Importing necessary classes from torch.utils.data for handling datasets
from torch.utils.data import (TensorDataset, DataLoader, 
                              RandomSampler, SequentialSampler)

Dataset & manipulation: In this step, we first fix the categories we would want to limit our analysis to. This we do by removing certain categories. We then proceed to convert the categories to numerical labels so that it can be fed into the model.

# Reading the CSV file 'smile-annotations-final.csv' into a pandas DataFrame
# Assigning custom column names 'id', 'text', and 'category'
df = pd.read_csv('smile-annotations-final.csv', 
                  names=['id', 'text', 'category'])

# Setting the 'id' column as the index of the DataFrame
df.set_index('id', inplace=True)

# Displaying the first few rows of the DataFrame using the 'head' method
display('head', df.head())

# Displaying the counts of unique values in the 'category' column 
# using the 'value_counts' method
display('category counts', df.category.value_counts())

# Filtering out rows where the 'category' column contains '|'
df = df[~df.category.str.contains('\|')]

# Filtering out rows where the 'category' column is 'nocode'
df = df[df.category != 'nocode']

# Displaying the counts of unique values in the 'category' column 
# after cleanup
display('category counts after cleanup', df.category.value_counts())

# Extracting unique categories from the 'category' column of the DataFrame
possible_labels = df.category.unique()

# Creating a dictionary to map string categories to numerical labels
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
    
# Creating a new column 'label' in the DataFrame by replacing string categories with numerical labels
df['label'] = df.category.replace(label_dict)

# Displaying the first few rows of the DataFrame with the new 'label' column
df.head()

Data after introducing numerical labels:

Splitting the data into training and validation sets: In step below, we utilize sklearn to split the data to train and validation sets

### splitting data into training and validation sets ###

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=17, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['category', 'label', 'data_type']).count()

Tokenization: Deep learning models require the training data (examples through which it learns) in tensor form. Since input data is a dataframe containing texts, we first need to split them into individual words (process of tokenization) and ensure that resulting tokenized training samples are of the same size. For this we add padding and limit the length of tokenized training example. We also get attention mask in the training data which is one of the input while training BERT model. The same processes are required for validation set.

# Using the BERT tokenizer from the 'bert-base-uncased' model
# and setting do_lower_case to True to ensure all text is lowercased
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

# Encoding the text data in the training set using batch_encode_plus
# This method tokenizes and encodes a batch of sequences, adding special tokens,
# padding the sequences to the same length, and returning PyTorch tensors
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values,            # Extracting text data for training
    add_special_tokens=True,                          # Adding special tokens like [CLS] and [SEP]
    return_attention_mask=True,                      # Returning attention masks to focus on actual tokens
    pad_to_max_length=True,                          # Padding sequences to the same length
    max_length=256,                                   # Maximum length of each sequence
    return_tensors='pt'                               # Returning PyTorch tensors
)

# Encoding the text data in the validation set using batch_encode_plus
encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values,              # Extracting text data for validation
    add_special_tokens=True,                          # Adding special tokens like [CLS] and [SEP]
    return_attention_mask=True,                      # Returning attention masks to focus on actual tokens
    pad_to_max_length=True,                          # Padding sequences to the same length
    max_length=256,                                   # Maximum length of each sequence
    return_tensors='pt'                               # Returning PyTorch tensors
)

# Extracting input IDs, attention masks, and labels for the training set
input_ids_train = encoded_data_train['input_ids']     # Input IDs representing tokenized text
attention_masks_train = encoded_data_train['attention_mask']  # Attention masks indicating which tokens to attend to
labels_train = torch.tensor(df[df.data_type=='train'].label.values)  # Labels for the training set

# Extracting input IDs, attention masks, and labels for the validation set
input_ids_val = encoded_data_val['input_ids']         # Input IDs representing tokenized text
attention_masks_val = encoded_data_val['attention_mask']   # Attention masks indicating which tokens to attend to
labels_val = torch.tensor(df[df.data_type=='val'].label.values)   # Labels for the validation set

# Creating PyTorch datasets for training and validation
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)  # Training dataset
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)          # Validation dataset

Setting up BERT and functions to estimate performance: We now setup BERT pre-trained model, define batch size for each training iteration, optimizer, no. of epochs for training. We also define F1 score & accuracy for each class as metrics to evaluate model performance.

# Initializing the BERT model for sequence classification from the pre-trained 'bert-base-uncased' model
# Specifying the number of labels in the output layer based on the length of the label dictionary
# Setting output_attentions and output_hidden_states to False to exclude additional outputs
# Setting resume_download to True to resume download if interrupted
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False,
                                                      resume_download=True)

# Defining the batch size for training and validation
batch_size = 32

# Creating data loaders for training and validation sets
# Using RandomSampler for training data and SequentialSampler for validation data
dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

# Initializing the AdamW optimizer with the BERT model parameters
# Setting the learning rate to 2e-5 and epsilon to 1e-8
optimizer = AdamW(model.parameters(),
                  lr=2e-5, 
                  eps=1e-8)

# Defining the number of epochs for training
epochs = 7

# Creating a linear scheduler with warmup for adjusting learning rates during training
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

# Defining a function to calculate the F1 score
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

# Defining a function to calculate accuracy per class
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

Evaluation function: In below segment, we are assigning the device available for computation CPU or GPU depending on availability. evaluate method below uses fine tuned model for prediction on validation set.

### assigning seed to be able to reproduce results ###
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Checking for GPU availability and assigning the device accordingly
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)  # Moving the model to the selected device
print(device)  # Printing the device (GPU or CPU) being used

# Defining the evaluation function for the validation set
def evaluate(dataloader_val):

    model.eval()  # Setting the model to evaluation mode
    
    loss_val_total = 0  # Initializing total validation loss
    predictions, true_vals = [], []  # Lists to store predictions
                                      # and true values
    
    # Iterating through batches in the validation dataloader
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)  # Moving batch 
                                                    # tensors to the device
        
        inputs = {'input_ids':      batch[0],      # Input token IDs
                  'attention_mask': batch[1],      # Attention masks
                  'labels':         batch[2],      # Labels
                 }

        with torch.no_grad():  # Disabling gradient calculation        
            outputs = model(**inputs)  # Forward pass
            
        loss = outputs[0]  # Extracting loss value from the output
        logits = outputs[1]  # Predicted logits
        loss_val_total += loss.item()  # Accumulating validation loss

        logits = logits.detach().cpu().numpy()  # Detaching logits from 
                                          # computation graph and moving to CPU
        label_ids = inputs['labels'].cpu().numpy()  # Moving label IDs to CPU
        predictions.append(logits)  # Appending predictions to the list
        true_vals.append(label_ids)  # Appending true values to the list
    
    loss_val_avg = loss_val_total/len(dataloader_val)  # Calculating 
                                                      # average validation loss
    
    # Concatenating predictions and true values to form arrays
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals  # Returning validation 
                                            # loss, predictions, and true values

Training: Now, we fine tune pre-trained BERT model using training data.

# Training loop for each epoch
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()  # Setting the model to training mode
    
    loss_train_total = 0  # Initializing total training loss

    # Progress bar for training epoch
    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()  # Resetting gradients
        
        batch = tuple(b.to(device) for b in batch)  # Moving batch tensors to the device
        
        inputs = {'input_ids':      batch[0],      # Input token IDs
                  'attention_mask': batch[1],      # Attention masks
                  'labels':         batch[2],      # Labels
                 }       

        outputs = model(**inputs)  # Forward pass
        
        loss = outputs[0]  # Extracting loss value from the output
        loss_train_total += loss.item()  # Accumulating training loss
        loss.backward()  # Backpropagation

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Clipping gradients to prevent explosion
        
        optimizer.step()  # Optimizer step
        scheduler.step()  # Scheduler step
        
        # Updating progress bar with current training loss
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')  # Saving model after each epoch
        
    tqdm.write(f'\nEpoch {epoch}')  # Printing current epoch
    
    loss_train_avg = loss_train_total/len(dataloader_train)  # Calculating average training loss
    tqdm.write(f'Training loss: {loss_train_avg}')  # Printing training loss
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)  # Evaluating on validation set
    val_f1 = f1_score_func(predictions, true_vals)  # Calculating F1 score
    tqdm.write(f'Validation loss: {val_loss}')  # Printing validation loss
    tqdm.write(f'F1 Score (Weighted): {val_f1}')  # Printing F1 score

Above chart shows how the training loss has decreased as the epochs increase.

Results

Training loss is not an accurate measure of performance as its possible to overfit the training data when while training. A better way of evaluating model is reviewing its performance on unseen data — validation set in our case. Lets check how loss and F1 score has evolved with epochs:

We can see that both loss and F1 score have plateaued indicating that we have sufficiently fine-tuned the model. F1 score is .83 which is good for a model without any hyper-parameter tuning. More so, we have not even analyzed the tweets sufficiently to remove the special characters etc. which would have improved our results significantly.

Conclusion & Future work

A pre-trained BERT model can help us getting really good results on natual language processing tasks with mere fine-tuning with custom training data related to the problem. This can be very easily utilized in industry or academic problems for examples the problem of performing sentiment analysis on

customer review data for a restaurant.

As discussed above, we can further improve the model by cleaning the data, performing hyper parameter tuning on text size etc.

If you liked the explanation , follow me for more! Feel free to leave your comments if you have any queries or suggestions.

You can also check out other articles written around data science, computing on medium. If you like my work and want to contribute to my journey, you cal always buy me a coffee :)

References

[1] github link to the notebook: https://github.com/girish9851/Sentiment-Analysis-with-Deep-Learning-using-BERT/blob/master/Sentiment_analysis_with_deep_learning_using_BERT.ipynb

[2] smile twitter emotion dataset: https://www.kaggle.com/datasets/ashkhagan/smile-twitter-emotion-dataset

[3] BERT: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

Search This Blog

Indie Quant

Sentiment Analysis using Deep Learning (BERT)

Introduction

Problem statement & Dataset

Code & Explanation

Results

Conclusion & Future work

Comments

Post a Comment

Popular Posts

Missing Character Prediction in Words with BiLSTM and Attention

Handling Overfitting in Machine Learning

The 5 Most Popular Regression Techniques

Text Classification Using Recurrent Neural Networks

Hypothesis Testing Series - An End to End Guide to Bayesian Hypothesis Tests - Part 3

How I Created Animated Choropleth Map and Running Bar Plot using Python

The Power of Vectorization in Python Data Operations

Deep Convolutional Generative Adversarial Networks

Hypothesis Testing Series - An End to End Guide to Permutation Tests - Part 2