This is how you can use NLP and Web-Scraping to find in-demand skills for your dream job.
In this fast-evolving professional world, how do you know which skills are needed for your dream job? go through jobs descriptions on LinkedIn and find out the skills required — sounds like an arduous task. Let’s solve it for you so that you can focus more on learning skills than looking for what to learn.
Introduction
How do you go about finding what skills to learn for the desired job vacancies? from friends, colleagues, LinkedIn posts are some of the most common sources of information. But these provide just a surface level knowledge and always leave a room for some key skill slipping through the cracks.
Another way is to go through job descriptions and read them one by one and take a note of the skillsets. This is definitely better than the first one but even this leaves a room for key misses as it involves manually going through the job descriptions. At the same time, it might be enervating task to go through hundreds of job descriptions.
In below sections, we will utilize publicly available data on LinkedIn and use simple web-scraping tool to systematically fetch the data. We will further leverage fundamental NLP techniques to clean the data and get the keywords. Among these top key words, we will see the most popular skillsets required.
Below are the libraries we will be using in this project:
# libraries import #
# Import necessary libraries for data retrieval, manipulation, and analysis
import requests # For making HTTP requests
import re # For regular expressions
import time # For time-related functions
from bs4 import BeautifulSoup # For web scraping
import csv # For CSV file handling
import numpy as np # For numerical computing
import pandas as pd # For data manipulation and analysis
# Text processing and analysis libraries #
# Libraries for text processing, feature extraction, and visualization
from sklearn.feature_extraction.text import TfidfVectorizer # For TF-IDF
# vectorization
from wordcloud import WordCloud # For creating word clouds
import matplotlib.pyplot as plt # For data visualization
import string # For string operations
import nltk # Natural Language Toolkit for text processing
import re # For regular expressions
# Machine Learning libraries #
# Libraries for machine learning algorithms and dimensionality reduction
from sklearn import datasets # To import datasets for practice
from sklearn.decomposition import PCA # For Principal Component Analysis
Data Set
We are using LinkedIn’s publicly available job advertisements for this purpose.
Check out this page: linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Front End Developer&location=India&start=1

We go to each of these links and save the list of links to a CSV file. We are using job role and location of the job as parameters. We are performing below steps to get the list of all the links to job vacancies:
- We feed the number of pages from which we plan to fetch the links
- We iterate over each page and extract the URLs
- filter only the URLs which belong to jobs
- append it to temporary list and save this data in a CSV format for later use.
# inputs #
# Define the search parameters for job position and location on LinkedIn
POSITION = "Front%20End%20Developer" # Desired job position
LOCATION = "India" # Desired job location
# Path to store the extracted job URLs in a CSV file
URL_LIST_PATH = "urls.csv"
# Function to retrieve a list of job URLs from the LinkedIn search results
def get_url_list(number_of_results=100):
"""
This function fetches links of jobs from LinkedIn search results
"""
web = requests.session() # Create a session for making HTTP requests
list_of_urls = [] # Initialize an empty list to store job URLs
for x in range(number_of_results):
link = "https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={POSITION}&location={LOCATION}&start={start}".\
format(POSITION=POSITION,
LOCATION=LOCATION,
start=x)
# Send a GET request to the LinkedIn job search URL
req = web.get(link)
time.sleep(0.2) # Pause execution to avoid spamming requests
# Extract URLs from the response text
list_temp = re.findall(r"href=\"(.*?)\"", req.text)
# Filter job-related URLs
list_temp_updated = [item_ for item_ in list_temp if '/jobs/' in item_]
list_of_urls.extend(list_temp_updated) # Add filtered URLs to the list
# Print progress during URL extraction
print("{} of {}".format(x, number_of_results))
# Remove duplicate URLs using NumPy
list_of_urls = np.unique(list_of_urls)
return np.unique(list_of_urls) # Return a list of unique job URLs
# Function to write the extracted job URLs to a CSV file
def write_results_to_csv(unique_urls, path=URL_LIST_PATH):
"""
This function saves the extracted job URLs to a CSV file
"""
with open(path, 'w', newline='\n') as csvfile:
urlwriter = csv.writer(csvfile) # Create a CSV writer object
for x in list(unique_urls):
urlwriter.writerow([x]) # Write each URL to the CSV file
return
# Call the functions to extract job URLs and save them to a CSV file
write_results_to_csv(get_url_list( number_of_results=1), path=URL_LIST_PATH)
As a next step, we need to go through each of these j
ob advertisements and pull out the most useful information for us.Web-scraping
Let’s now go through each of the links stored in the CSV file and extract the job description. We utilize BeautifulSoup, one of the most popular libraries for web scraping tasks. We perform below steps:
- We identify the class which contains job description in the job advertisement webpage. This can be done using the inspect element in your browser by searching for relevant job description text.

2. We define variables to keep track of how many links we have visited and how many are skipped as we could not find the relevant class in the webpage.
3. We then iterate through the links and extract the webpage and then extract the class of interest mentioned above.
4. Perform a basic cleaning of the job description extracted.
5. Convert this dictionary to a data frame for further text processing.
# Function to retrieve job advertisement body from URLs stored in a CSV file
def get_job_advt_body(url_path=URL_LIST_PATH):
"""
This function retrieves the text body of job advertisements
from provided URLs and stores them in a dictionary
"""
CLASS = "show-more-less-html__markup show-more-less-html__markup--clamp-after-5 relative overflow-hidden"
dict_temp = {} # Initialize an empty dictionary to store URL-text pairs
c = 1 # Counter for processed URLs
s = 1 # Counter for total URLs
with open(url_path, newline='\n') as csvfile:
urlReader = csv.reader(csvfile) # Read URLs from the CSV file
for row in urlReader:
# Wait to avoid excessive requests
time.sleep(0.2)
content_ = requests.get(row[0]) # Get content from the URL
# Parse HTML content
bs_content = BeautifulSoup(content_.content, 'html.parser')
# Find job description content
jd_content = bs_content.findAll(attrs={'class': CLASS})
if jd_content:
# Get job description text
updated_jd_content = jd_content[0].get_text("\n")
# Encode text to ASCII
updated_jd_content = updated_jd_content.encode("ascii",
"ignore")
# Decode ASCII text
updated_jd_content = updated_jd_content.decode()
print("{} out of total urls".format(c)) # Print progress
# Print skipped URLs count
print("{} urls skipped".format(s - c))
# Store URL-text pair in the dictionary
dict_temp[row[0]] = updated_jd_content
c += 1 # Increment counter for processed URLs
s += 1 # Increment total URLs counter
return dict_temp # Return dictionary containing URL-text pairs
# Retrieve job advertisement bodies from URLs and store in the dictionary
DICT = get_job_advt_body(URL_LIST_PATH)
# Function to convert dictionary of job descriptions to a Pandas DataFrame
# for further processing
def get_dataset(DICT):
"""
This function converts a dictionary of job descriptions to
a DataFrame for further processing
"""
df = pd.DataFrame() # Create an empty DataFrame
df.index = list(range(len(DICT))) # Set DataFrame index
# Add job advertisement content to the DataFrame
df['job_advt'] = list(DICT.values())
return df # Return the DataFrame containing job advertisement content
# Convert the dictionary of job descriptions to a DataFrame
df = get_dataset(DICT)
Data processing
In this step, we clean up the dataset further with the help of traditional text cleaning mechanisms using regex and NLTK. Regex is used to remove numbers and tokenization. NLTK is utilized to remove the stop words present in the text data.
# Extract job advertisement text from DataFrame column
job_advt = df['job_advt']
# Function to remove numbers from text
def remove_punct(tweet):
"""
This function removes numbers from the text.
"""
# Remove numbers
tweet = re.sub('[0-9]+', ' ', tweet)
return tweet
# Apply the remove_punct function to each job advertisement text
job_advt = job_advt.apply(lambda x: remove_punct(x))
display(job_advt.head(10)) # Display the processed job advertisement texts
# Tokenize the text by splitting into words
def tokenization(tweet):
"""
This function tokenizes the text by splitting it into words.
"""
tweet = re.split('\W+', tweet)
return tweet
# Apply tokenization function to each processed job advertisement text
job_advt = job_advt.apply(lambda x: tokenization(x))
display(job_advt.head(10)) # Display the tokenized job advertisement texts
# Remove stopwords using NLTK stopwords list
stopwords = nltk.corpus.stopwords.words('english') # Get English stopwords
def remove_stopwords(tweet):
"""
This function removes stopwords from the text using
NLTK's English stopwords list.
"""
tweet = [word for word in tweet if word not in stopwords]
return tweet
# Apply remove_stopwords function to each tokenized job advertisement text
job_advt = job_advt.apply(lambda x: remove_stopwords(x))
# Display the job advertisement texts after removing stopwords
display(job_advt.head(10)) #
# Join the tokens back into sentences
job_advt = job_advt.apply(lambda x: ' '.join(x))
# Display the processed job advertisement texts after rejoining tokens
display(job_advt.head(10))
# removing in case we have gone through same page more than once
job_advt.drop_duplicates(inplace=True)
TF-iDF Word Cloud
We now utilize TF-iDF to create the vector representation for each word in the job descriptions\ present in our cleaned data frame.
We then utilize the vector representations to create the word cloud covering top 200 words with size indicating their score based on the TF-iDF vector representation.
# Create a TF-IDF Vectorizer with specific settings
tfidfvectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=2000)
# Transform the training text data using the TF-IDF Vectorizer
X_train_tfidf = tfidfvectorizer.fit_transform(list(job_advt))
# Get the feature names from the TF-IDF Vectorizer
feature_names = tfidfvectorizer.get_feature_names_out()
# Convert the sparse TF-IDF matrix to a dense matrix and then to a DataFrame
dense_tfidf = X_train_tfidf.todense()
lst2 = dense_tfidf.tolist()
df_tfidf = pd.DataFrame(lst2, columns=feature_names)
# Generate a Word Cloud based on TF-IDF weighted word frequencies
Cloud_tfidf = WordCloud(background_color="white",
colormap='Dark2',
width=800,
height=600,
max_words=200).generate_from_frequencies(df_tfidf.T.sum(axis=1).sort_values())
# Create a figure for the TF-IDF Word Cloud
ch = plt.figure(figsize=(12, 9))
# Display the Word Cloud with appropriate settings
plt.imshow(Cloud_tfidf, interpolation='bilinear')
plt.axis("off")
# Save the TF-IDF Word Cloud as an image
plt.savefig('word_cloud.png')
# Show the TF-IDF Word Cloud
plt.show()

As we can see that results are promising with such a simple model. Let’s try to further improve upon this.
PCA & Word Cloud
Since TF-iDF has 1000s of elements in the vector representation, a simple idea could be to reduce the dimension using tools such as PCA which retain the most important information in the first principal component and so on.
To proceed with this step, we utilize the TF-iDF dataframe created in previous step and compute the first principal component. We then utilize this data frame to create the word cloud.
# Transpose the TF-IDF DataFrame for further processing
df_tfidf_t = df_tfidf.transpose()
# Reduce components using Principal Component Analysis (PCA) for Euclidean distance visualization
pca = decomposition.PCA(n_components=1)
pca.fit(df_tfidf_t)
# Transform and create a DataFrame with reduced components using PCA
df_fc_cleaned_reduced_euc = pd.DataFrame(pca.transform(df_tfidf_t).transpose(),
index=['PC_1'],
columns=df_tfidf_t.transpose().columns)
# Transpose the reduced DataFrame
df_fc_cleaned_reduced_euc_t = df_fc_cleaned_reduced_euc.transpose()
# Generate a Word Cloud based on TF-IDF weighted word frequencies from reduced components
Cloud_tfidf = WordCloud(background_color="white",
max_words=200,
width=800,
height=600,
colormap='Dark2').generate_from_frequencies(df_fc_cleaned_reduced_euc_t.sum(axis=1))
# Create a figure for the TF-IDF Word Cloud
ch = plt.figure(figsize=(12, 9))
# Display the Word Cloud with appropriate settings
plt.imshow(Cloud_tfidf, interpolation='bilinear')
plt.axis("off")
# Save the TF-IDF Word Cloud as an image
plt.savefig('word_cloud.png')
# Show the TF-IDF Word Cloud
plt.show()

Future Work
Although results are promising, we can improve the results by:
- Utilizing better vectorizations techniques than TF-iDF such as Word2vec or Glove models etc.
- Perform clustering to separate out the technical skills and soft from rest of the words.
Reference
- How I used Python to extract keywords from LinkedIn job descriptions | by Ima | Medium [Web-scraping]
- Sentiment Analysis using Machine Learning: A Structured Approach towards the Optimal Solution [NLP]
- Job-Skills/Job_skills_v1_20231113.ipynb at main · girish9851/Job-Skills (github.com) [Notebook Link]
If you found the explanation helpful, follow me for more content! Feel free to leave comments with any questions or suggestions you might have.
You can also check out other articles written around data science, computing on medium. If you like my work and want to contribute to my journey, you cal always buy me a coffee :)
Comments
Post a Comment