Sentiment Analysis using Machine Learning: A Structured Approach towards the Optimal Solution
With a range of NLP & machine learning techniques available, how do you start with a sentiment analysis problem and take right steps towards reaching the optimal solution? Sounds exciting ! let’s get right into it.
Introduction
Natural language processing has plenty of tools available depending on the problem at hand. However, with a lot of tools comes a problem of how do we use these tools in the best possible way.
In this article, we are taking a problem of detecting sentiments in tweets. As we will see, this is not a very straightforward problem and needs use of a variety of NLP and Machine learning tools to arrive at reasonably good solution.
Problem Statement
Dataset contains the tweet, language, number of retweets for the tweet, original author user id, sentiment class.
Using training data, we are supposed to create a model which can help us in predicting the sentiment category using the information present in the rest of the columns.
Below are the libraries, we would be using:
##### libraries #####
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
##### NLP libraries ######
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import string
import re
##### dimensionality reduction #######
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
##### machine learning ######
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
import lightgbm as lgbm
import xgboost as xgboost
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
# Loading the dataset
train = pd.read_csv('train.csv')
test =pd.read_csv('test.csv')
train.head()
Data Exploration
We will begin with looking at each feature and checking out their key properties:
- Class Imbalance: It’s important to look at the class imbalance as (a) it can determine choice of the metric to evaluate our model. (b) In addition to that it also let’s us decide if we need to create more training samples in case we get significantly skewed classes.
# Create a figure
ch = plt.figure(figsize=(15, 8))
# Plot a count of sentiment classes
sns.catplot(data=train, y='sentiment_class', kind='count', ax=ch)
# Comment on class imbalance
# Approximately 50% of sentiments are neutral, while only 25% each are positive and negative.
plt.title('Count Plot of Categories')
# Save the figure
plt.savefig('cat_count.png', pad_inches=0.01)

Verdict: since classes are somewhat imbalanced, we will use weighted f1 score as a metric to determine performance of our model. Weighted f1 score takes into account the number of samples present in every class.
2. Author name: It might appear that author would not be a good feature, but let’s anyway check it out.
# Create a DataFrame with the count of original authors
original_authors_train = pd.DataFrame(train['original_author'].value_counts())
# Display the top 320 authors with the most tweets
original_authors_train['original_author'].head(320)
# Comment on authors with multiple tweets
# Approximately 300 users have more than one tweet.
# Perform a join operation to see sentiment variation by original author
train.groupby(['original_author', 'sentiment_class'])['id'].count().tail(40)
# Comment on sentiment variation among authors
# Some authors have both positive and neutral tweets, but most users tweeting neutral tend to remain neutral.
# Authors could be a useful feature.

Verdict: Authors with neutral tweets continue to tweet in a neutral way if they tweet again. So it will definitely add some information to our model
3. Retweet count
# Display the count distribution of retweet counts
train[False == train['retweet_count'].str.\
contains(r'[a-zA-Z]')].groupby(['sentiment_class',
'retweet_count'])['id'].count().head(60)
# Convert and clean the retweet count data
feature_retweet = pd.to_numeric(train[False == \
train['retweet_count'].str.contains(r'[a-zA-Z]')]\
['retweet_count']).round().abs()
feature_retweet[True == train['retweet_count'].str.contains(r'[a-zA-Z]')] = 0
# Check for missing values in the retweet count
feature_retweet.isnull().sum()
# Group and count retweets for a specific sentiment class
train[False == train['retweet_count']\
.str.contains(r'[a-zA-Z]')][train['sentiment_class'] == 1].\
groupby(['sentiment_class', 'retweet_count'])['id'].count()
# Comment on the significance of retweet counts
# It's not very clear if retweet counts will play a major role, but for now, let's retain them.

Verdict: Since it’s not very clear if retweet count will be us
eful given the count varies a lot within a class. We will keep it in the model since we are not able to take a call.4. Language
# Display languge distribution in the training dataset
print(train['lang'].value_counts())
# Display language distribution in the test dataset
print(test['lang'].value_counts())
# Comment on language data
# Despite some noise in the data, we can safely assume that all tweets are in English.

Verdict: Clearly, most of the tweets are in english and there are some erroneous data points such as numbers in the language column.This feature does not seem to help us in categorising so we can get rid of it in our models.
5. Tweet text
# Create a figure for text length analysis
ch_dist = plt.figure()
# Plot histograms of text length for different sentiment classes
train[train['sentiment_class'] == 0]['original_text'].str.split().apply(len).hist(bins=20, legend=True)
train[train['sentiment_class'] == 1]['original_text'].str.split().apply(len).hist(bins=20, legend=True)
train[train['sentiment_class'] == -1]['original_text'].str.split().apply(len).hist(bins=20, legend=True)
# Comment on text length distribution
# There is a fairly even distribution of text length across different sentiment classes.
# Set titles and labels for the text length analysis
plt.title("Tweet Length Count")
plt.xlabel("Length")
plt.ylabel("Count")
# Add a legend
plt.legend(['sentiment_class - 0', 'sentiment_class - 1', 'sentiment_class - (-1)'])
# Save the figure
ch_dist.savefig('cat_distr.png')
# End of the analysis

Verdict: lengths are similarly distributed for all three classes with just one noticeable difference that total count of some specific lengths are significantly different for each class specially neutral vs negative sentiment. This can add to predictive power of our model so we will retain it.
6. Count Vectoriser & Word Cloud: we perform this to analyse the text data present in the tweet body. This will give us good idea of present without having to look at each tweet.
# Create a CountVectorizer
vectorizer = CountVectorizer()
# Transform the text data
X = vectorizer.fit_transform(list(train['original_text']))
# Display the shape of the resulting matrix
print(X.toarray().shape)
# Get the feature names from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Convert the sparse matrix to a dense matrix and then to a DataFrame
dense = X.todense()
lst1 = dense.tolist()
df = pd.DataFrame(lst1, columns=feature_names)
# Generate a Word Cloud based on word frequencies
Cloud = WordCloud(background_color="black", max_words=100).generate_from_frequencies(df.T.sum(axis=1))
# Create a figure for the Word Cloud
ch = plt.figure(figsize=(15, 10))
# Display the Word Cloud with appropriate settings
plt.imshow(Cloud, interpolation='bilinear')
plt.axis("off")
# Save the Word Cloud as an image
plt.savefig('word_cloud.png')
# Show the Word Cloud
plt.show()
# summary #
# Common Words like "happy," "mother," and "day" are frequent and may not be ideal features. #
# Consider reducing their importance or excluding them from the feature set. #

Verdict: We can see that count of words like happy is higher. This might not be super useful since the data set itself is around happy mother’s day. We would expect happy to be present in all the tweets.
It’s possible to get rid of this problem by using TFIDF vectoriser which reduces weight for more frequent terms as they might not add much to the predictive power.
7. TFIDF vectoriser and word cloud: let’s now repeat the above steps using tfidf vectorizer
# Create a TF-IDF Vectorizer with specific settings
tfidfvectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=2000)
# Transform the training text data using the TF-IDF Vectorizer
X_train_tfidf = tfidfvectorizer.fit_transform(list(train['original_text']))
# Get the feature names from the TF-IDF Vectorizer
feature_names = tfidfvectorizer.get_feature_names_out()
# Convert the sparse TF-IDF matrix to a dense matrix and then to a DataFrame
dense_tfidf = X_train_tfidf.todense()
lst2 = dense_tfidf.tolist()
df_tfidf = pd.DataFrame(lst2, columns=feature_names)
# Generate a Word Cloud based on TF-IDF weighted word frequencies
Cloud_tfidf = WordCloud(background_color="black",
max_words=100).generate_from_frequencies(df_tfidf.T.sum(axis=1))
# Create a figure for the TF-IDF Word Cloud
ch = plt.figure(figsize=(15, 10))
# Display the Word Cloud with appropriate settings
plt.imshow(Cloud_tfidf, interpolation='bilinear')
plt.axis("off")
# Save the TF-IDF Word Cloud as an image
plt.savefig('word_cloud.png')
# Show the TF-IDF Word Cloud
plt.show()

Verdict: We see that size of happy is reduced as we guessed above, so tfidf vectoriser has indeed helped us.
8. TSNE
Let’s now try to see if there is any pattern in tfidf data by utilizing TSNE to visualise this high dimensional data:
# Convert X_train to a dense DataFrame
X_train_tfidf = X_train_tfidf.toarray()
X_train_tfidf = pd.DataFrame(data=X_train_tfidf,
index=np.arange(X_train_tfidf.shape[0]),
columns=np.arange(X_train_tfidf.shape[1]))
# Perform t-SNE dimensionality reduction to 2 components
X_embedded = TSNE(n_components=2).fit_transform(X_train_tfidf)
# Check the shape of the resulting embedded data
X_embedded.shape
# Create a DataFrame for the embedded data
X_embedded = pd.DataFrame(data=X_embedded,
index=np.arange(X_embedded.shape[0]),
columns=np.arange(X_embedded.shape[1]))
# Add the 'sentiment_class' column to the embedded data
X_embedded['label'] = train['sentiment_class']
# Create a scatterplot to visualize the data
plt.figure(figsize=(15, 10))
sns.scatterplot(x=0,
y=1,
hue="label",
data=X_embedded)
# Save the scatterplot as an image
plt.savefig('scatter.png')
# Comment on the visualization
# This visualization suggests that there is no apparent pattern in the data after t-SNE dimensionality reduction.
# We can still proceed to model using various classifiers and evaluate their performance on the validation data.

Verdict: Although data points are spread across the whole area — we can see that neutral tweets are placed in the slightly right inner area of the ellipse. However, significant pattern is not present in th
e plot — indicating predictive power of any model is going to be less in general.Base model
We will now create a base model without any text cleaning and assess its performance:
- Logistic Regression
############# Creating a Logistic Regression-based model ###############
# Initialize the Logistic Regression classifier
clf = LogisticRegression(max_iter=10000)
# Train the model on the training data
model = clf.fit(X_train, y_train)
# Make predictions on the test data
pred_test = model.predict(X_test)
# Calculate and display the weighted F1 score
display(f1_score(pred_test, y_test, average='weighted'))
# Display the confusion matrix
display(confusion_matrix(pred_test, y_test))
2. Naive Bayes
############# Creating a Naive Bayes-based model ###############
# Initialize the Gaussian Naive Bayes classifier
clf = GaussianNB()
# Train the model on the training data
model = clf.fit(X_train, y_train)
# Make predictions on the test data
pred_test = model.predict(X_test)
# Calculate and display the weighted F1 score
display(f1_score(y_test, pred_test, average='weighted'))
# Display the confusion matrix
display(confusion_matrix(pred_test, y_test))
3. LGBM
############# Creating an LGBM-based model ###############
# Initialize the LGBM classifier
clf = lgbm.sklearn.LGBMClassifier()
# Train the model on the training data
model = clf.fit(X_train, y_train)
# Make predictions on the test data
pred_test = model.predict(X_test)
# Calculate and display the weighted F1 score
display(f1_score(pred_test, y_test, average='weighted'))
# Display the confusion matrix
display(confusion_matrix(pred_test, y_test))

LGBM and Logistic regression perform way better than Naive Bayes so we can continue with LGBM and Logistic and avoid using Naive Bayes in next sections.
4. Using complete data: In models above, we have just utilised the information in training set and not the test set. Let’s now combine the train and test set & rerun analysis to make predictions. We will choose the one performing the best on test set:
While performing this, we have also utilised truncated SVD to reduce the dimensionality for improving the model performance.
############## Let's create a setup to calculate an LGBM-based model ##############
# Combine the training and test data (Full_data) without the 'sentiment_class' column
Full_data = train.drop('sentiment_class', axis=1)
Full_data = pd.concat([Full_data, test], ignore_index=True)
# Initialize a TF-IDF Vectorizer with specific settings
tfidfvectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=20000)
# Transform the original text into TF-IDF features
Y = tfidfvectorizer.fit_transform(list(Full_data['original_text']))
# Apply Truncated Singular Value Decomposition (SVD) to reduce dimensionality
svd = TruncatedSVD(n_components=50, n_iter=7, random_state=42)
X = svd.fit_transform(Y)
# Convert the SVD output into a DataFrame
X = pd.DataFrame(data=X, index=range(X.shape[0]), columns=range(X.shape[1]))
# Extract and process the retweet_count feature
feature_retweet = pd.to_numeric(Full_data[False == Full_data['retweet_count'].\
str.contains(r'[a-zA-Z]')]['retweet_count']).round().abs()
X['retweet_count'] = feature_retweet
# Add a feature column for text_length
X['text_length'] = Full_data['original_text'].str.split().apply(len)
# Initialize a LabelEncoder for encoding the 'original_author' feature
le = preprocessing.LabelEncoder()
# Fit the LabelEncoder on the 'original_author' column
le.fit(Full_data['original_author'])
# Display the classes after encoding
display(le.classes_)
# Transform the 'original_author' column with the encoded labels
X['original_author'] = le.transform(Full_data['original_author'])
# Split the data back into training and testing sets
X_train = X[:train.shape[0]]
X_test = X[train.shape[0]:]
y_train = train['sentiment_class']
# Initialize the LGBM classifier
clf = lgbm.sklearn.LGBMClassifier(random_state=42)
# Train the model on the training data
model = clf.fit(X_train, y_train)
# Predict on the training data and calculate the weighted F1 score
y_train_pred = model.predict(X_train)
display(f1_score(y_train_pred, y_train, average='weighted'))
# Predict on the test data
pred = model.predict(X_test)
# Prepare the submission DataFrame and save it to a CSV file
sub = pd.DataFrame()
sub['id'] = test['id']
sub['sentiment_class'] = pred
sub.to_csv('submission.csv', index=False)
# This serves as our baseline with a test score of 41.92 and a weighted F1 score
We have presented LGBM model as it performed the best with ~.41 weighted F1 score on the test data. This can be considered our base model and now we can try different methods to clean the text data and extract the features to further refine our model.
Data Cleaning
We use various text cleaning methods such as removing punctuations, stop words, stemming etc. to refine the tweet data and rerun our models.
####################### Text Cleaning Steps ##########################
# Steps:
# 1. Remove punctuations
# 2. Tokenization: Convert a sentence into a list of words
# 3. Remove stopwords
# 4. Lemmatization/Stemming: Transform any word form to its root word
# Remove punctuations from the 'original_text' column
tweets = Full_data['original_text']
display(string.punctuation)
def remove_punct(tweet):
# Remove punctuation characters
tweet = "".join([char for char in tweet if char not in string.punctuation])
# Remove numbers
tweet = re.sub('[0-9]+', '', tweet)
return tweet
tweets = tweets.apply(lambda x: remove_punct(x))
display(tweets.head(10))
# Tokenize the text by splitting into words
def tokenization(tweet):
tweet = re.split('\W+', tweet)
return tweet
tweets = tweets.apply(lambda x: tokenization(x))
display(tweets.head(10))
# Remove stopwords using NLTK stopwords list
stopwords = nltk.corpus.stopwords.words('english')
def remove_stopwords(tweet):
tweet = [word for word in tweet if word not in stopwords]
return tweet
tweets = tweets.apply(lambda x: remove_stopwords(x))
display(tweets.head(10))
# Perform stemming using Porter Stemmer
ps = nltk.PorterStemmer()
def stemming(tweet):
tweet = [ps.stem(word) for word in tweet]
return tweet
tweets = tweets.apply(lambda x: stemming(x))
display(tweets.head(10))
# Perform lemmatization using WordNet Lemmatizer
wn = nltk.WordNetLemmatizer()
def lemmatizer(tweet):
tweet = [wn.lemmatize(word) for word in tweet]
return tweet
tweets = tweets.apply(lambda x: lemmatizer(x))
display(tweets.head(10))
# Join the tokenized words to create clean text
tweets = tweets.apply(lambda x: ' '.join(x))
display(tweets.head(10))
Improved model
- LGBM & XGBoost: We utilise similar mechanisms as base models on the cleaned data using LGBM and XGBoost
# Import necessary libraries
from sklearn.decomposition import TruncatedSVD
import lightgbm as lgbm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn import preprocessing
############## Setting up the data and models ##############
# Create a cleaned copy of Full_data
Full_data_cleaned = Full_data.copy()
# Replace the 'original_text' column with 'tweets'
Full_data_cleaned['original_text'] = tweets
# Initialize a TF-IDF Vectorizer with specific settings
tfidfvectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=2000)
# Transform the 'original_text' into TF-IDF features
Y = tfidfvectorizer.fit_transform(list(Full_data['original_text']))
# Apply Truncated Singular Value Decomposition (SVD) for dimensionality reduction
svd = TruncatedSVD(n_components=50, n_iter=7, random_state=42)
X = svd.fit_transform(Y)
# Convert the SVD output into a DataFrame
X = pd.DataFrame(data=X, index=range(X.shape[0]), columns=range(X.shape[1]))
# Extract and process the retweet_count feature
feature_retweet = pd.to_numeric(Full_data[False == Full_data['retweet_count'].str.contains(r'[a-zA-Z]')]['retweet_count']).round().abs()
X['retweet_count'] = feature_retweet
# Add a feature column for text_length
X['text_length'] = Full_data['original_text'].str.split().apply(len)
# Initialize a LabelEncoder for encoding the 'original_author' feature
le = preprocessing.LabelEncoder()
# Fit the LabelEncoder on the 'original_author' column
le.fit(Full_data['original_author'])
# Get the classes after encoding
le.classes_
# Transform the 'original_author' column with the encoded labels
X['original_author'] = le.transform(Full_data['original_author'])
# Split the data back into training and testing sets
X_train = X[:train.shape[0]]
X_test = X[train.shape[0]:]
y_train = train['sentiment_class']
# Initialize the LGBM classifier
clf = lgbm.sklearn.LGBMClassifier(random_state=42)
# Train the LGBM model
model = clf.fit(X_train, y_train)
# Predict on the training data and calculate the weighted F1 score
y_train_pred = model.predict(X_train)
display(f1_score(y_train_pred, y_train, average='weighted'))
# Predict on the test data
pred = model.predict(X_test)
# Prepare and save the submission DataFrame
sub = pd.DataFrame()
sub['id'] = test['id']
sub['sentiment_class'] = pred
sub.to_csv('submission.csv', index=False)
# Prepare data for XGBoost
y_train_xg = y_train + 1
# Initialize the XGBoost classifier
clf = xgboost.sklearn.XGBClassifier(random_state=42)
# Train the XGBoost model
model = clf.fit(X_train, y_train_xg)
# Predict on the training data and calculate the weighted F1 score
y_train_pred = model.predict(X_train)
display(f1_score(y_train_pred, y_train_xg, average='weighted'))
y_test_pred = model.predict(X_test)
# Prepare and save the submission DataFrame
sub=pd.DataFrame()
sub['id'] = test['id']
sub['sentiment_class'] = y_test_pred
sub.to_csv('submission.csv',index=False)
2. Stacked Classifier: We have also tried stacking based models by using ensemble of RF and LGBM and XGBoost as final classifer.
################## Stacking LGBM and RF ensemble with XGB as final estimator #######################
#### Note: This configuration didn't perform well. ####
#### We could have considered other models like BERT, ####
#### but we will stop here. ###############
# Fill any missing values with 0 in the training data
X_train = X_train.fillna(0)
# Ensure column names are of string data type
X_train.columns = X_train.columns.astype(str)
# Define a list of estimators for stacking
estimators = [
('lgb', lgbm.sklearn.LGBMClassifier(random_state=42)),
('rf', RandomForestClassifier(random_state=42))]
# Initialize the Stacking Classifier with XGB as the final estimator
clf = StackingClassifier(
estimators=estimators, passthrough=True, final_estimator=xgboost.XGBClassifier(random_state=42))
# Train the Stacking model
model = clf.fit(X_train, y_train)
# Predict on the training data and calculate the weighted F1 score
y_train_pred = model.predict(X_train)
display(f1_score(y_train_pred, y_train, average='weighted'))
# Ensure column names are of string data type in the testing data
X_test=X_test.fillna(0)
X_test.columns = X_test.columns.astype(str)
# Predict on the test data
pred = model.predict(X_test)
# Prepare and save the submission DataFrame
sub = pd.DataFrame()
sub['id'] = test['id']
sub['sentiment_class'] = pred
sub.to_csv('submission.csv', index=False)
Out of these three models — LGBM performed the best and provided test score of ~43 weighted F1 score.
References
- https://www.hackerearth.com/challenges/competitive/hackerearth-machine-learning-challenge-mothers-day/instructions/
- Notebook Link — https://github.com/girish9851/HackerEarth-Twitter-Sentiment-Analysis/blob/master/Twitter_sentiment_v2_medium_11_28_202310.ipynb
- https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
- https://pypi.org/project/wordcloud/
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- https://lightgbm.readthedocs.io/en/latest/Python-Intro.html
- https://xgboost.readthedocs.io/en/stable/python/python_intro.html
- https://scikit-learn.org/stable/modules/naive_bayes.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
If you found the explanation helpful, follow me for more content! Feel free to leave comments with any questions or suggestions you might have.
You can also check out other articles written around data science, computing on medium. If you like my work and want to contribute to my journey, you cal always buy me a coffee :)
Comments
Post a Comment