This is how you can use NLP and Web-Scraping to find in-demand skills for your dream job.

1/06/2025 04:31:00 AM

This is how you can use NLP and Web-Scraping to find in-demand skills for your dream job.

In this fast-evolving professional world, how do you know which skills are needed for your dream job? go through jobs descriptions on LinkedIn and find out the skills required — sounds like an arduous task. Let’s solve it for you so that you can focus more on learning skills than looking for what to learn.

Introduction

How do you go about finding what skills to learn for the desired job vacancies? from friends, colleagues, LinkedIn posts are some of the most common sources of information. But these provide just a surface level knowledge and always leave a room for some key skill slipping through the cracks.

Another way is to go through job descriptions and read them one by one and take a note of the skillsets. This is definitely better than the first one but even this leaves a room for key misses as it involves manually going through the job descriptions. At the same time, it might be enervating task to go through hundreds of job descriptions.

In below sections, we will utilize publicly available data on LinkedIn and use simple web-scraping tool to systematically fetch the data. We will further leverage fundamental NLP techniques to clean the data and get the keywords. Among these top key words, we will see the most popular skillsets required.

Below are the libraries we will be using in this project:

# libraries import #
# Import necessary libraries for data retrieval, manipulation, and analysis

import requests  # For making HTTP requests
import re  # For regular expressions
import time  # For time-related functions
from bs4 import BeautifulSoup  # For web scraping
import csv  # For CSV file handling
import numpy as np  # For numerical computing
import pandas as pd  # For data manipulation and analysis

# Text processing and analysis libraries #
# Libraries for text processing, feature extraction, and visualization

from sklearn.feature_extraction.text import TfidfVectorizer  # For TF-IDF 
# vectorization
from wordcloud import WordCloud  # For creating word clouds
import matplotlib.pyplot as plt  # For data visualization
import string  # For string operations
import nltk  # Natural Language Toolkit for text processing
import re  # For regular expressions

# Machine Learning libraries #
# Libraries for machine learning algorithms and dimensionality reduction

from sklearn import datasets  # To import datasets for practice
from sklearn.decomposition import PCA  # For Principal Component Analysis

Data Set

We are using LinkedIn’s publicly available job advertisements for this purpose.

Check out this page: linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Front End Developer&location=India&start=1

We go to each of these links and save the list of links to a CSV file. We are using job role and location of the job as parameters. We are performing below steps to get the list of all the links to job vacancies:

We feed the number of pages from which we plan to fetch the links
We iterate over each page and extract the URLs
filter only the URLs which belong to jobs
append it to temporary list and save this data in a CSV format for later use.

# inputs #

# Define the search parameters for job position and location on LinkedIn
POSITION = "Front%20End%20Developer"  # Desired job position
LOCATION = "India"  # Desired job location

# Path to store the extracted job URLs in a CSV file

URL_LIST_PATH = "urls.csv"  

# Function to retrieve a list of job URLs from the LinkedIn search results
def get_url_list(number_of_results=100):
    """
    This function fetches links of jobs from LinkedIn search results
    """
    web = requests.session()  # Create a session for making HTTP requests

    list_of_urls = []  # Initialize an empty list to store job URLs

    for x in range(number_of_results):
        link = "https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={POSITION}&location={LOCATION}&start={start}".\
        format(POSITION=POSITION,
               LOCATION=LOCATION,
               start=x)
        
        # Send a GET request to the LinkedIn job search URL
        req = web.get(link)  
        time.sleep(0.2)  # Pause execution to avoid spamming requests
        # Extract URLs from the response text
        list_temp = re.findall(r"href=\"(.*?)\"", req.text)  
        # Filter job-related URLs        
        list_temp_updated = [item_ for item_ in list_temp if '/jobs/' in item_]  
        list_of_urls.extend(list_temp_updated)  # Add filtered URLs to the list
        
        # Print progress during URL extraction
        print("{} of {}".format(x, number_of_results)) 
 
    # Remove duplicate URLs using NumPy    
    list_of_urls = np.unique(list_of_urls)  
    
    return np.unique(list_of_urls)  # Return a list of unique job URLs

# Function to write the extracted job URLs to a CSV file
def write_results_to_csv(unique_urls, path=URL_LIST_PATH):
    """
    This function saves the extracted job URLs to a CSV file
    """
    with open(path, 'w', newline='\n') as csvfile:
        urlwriter = csv.writer(csvfile)  # Create a CSV writer object
        for x in list(unique_urls):
            urlwriter.writerow([x])  # Write each URL to the CSV file
            
    return

# Call the functions to extract job URLs and save them to a CSV file
write_results_to_csv(get_url_list( number_of_results=1), path=URL_LIST_PATH)

As a next step, we need to go through each of these j

ob advertisements and pull out the most useful information for us.

Web-scraping

Let’s now go through each of the links stored in the CSV file and extract the job description. We utilize BeautifulSoup, one of the most popular libraries for web scraping tasks. We perform below steps:

We identify the class which contains job description in the job advertisement webpage. This can be done using the inspect element in your browser by searching for relevant job description text.

2. We define variables to keep track of how many links we have visited and how many are skipped as we could not find the relevant class in the webpage.

3. We then iterate through the links and extract the webpage and then extract the class of interest mentioned above.

4. Perform a basic cleaning of the job description extracted.

5. Convert this dictionary to a data frame for further text processing.

# Function to retrieve job advertisement body from URLs stored in a CSV file
def get_job_advt_body(url_path=URL_LIST_PATH):
    """
    This function retrieves the text body of job advertisements 
      from provided URLs and stores them in a dictionary
    """
    CLASS = "show-more-less-html__markup show-more-less-html__markup--clamp-after-5 relative overflow-hidden"
    dict_temp = {}  # Initialize an empty dictionary to store URL-text pairs
    
    c = 1  # Counter for processed URLs
    s = 1  # Counter for total URLs
    
    with open(url_path, newline='\n') as csvfile:
        urlReader = csv.reader(csvfile)  # Read URLs from the CSV file
        for row in urlReader:
            # Wait to avoid excessive requests
            time.sleep(0.2)  
            content_ = requests.get(row[0])  # Get content from the URL
            # Parse HTML content
            bs_content = BeautifulSoup(content_.content, 'html.parser')  
            # Find job description content
            jd_content = bs_content.findAll(attrs={'class': CLASS})  
            if jd_content:
                # Get job description text
                updated_jd_content = jd_content[0].get_text("\n") 
                # Encode text to ASCII 
                updated_jd_content = updated_jd_content.encode("ascii", 
                                                              "ignore")  
                # Decode ASCII text                
                updated_jd_content = updated_jd_content.decode()  
                print("{} out of total urls".format(c))  # Print progress
                # Print skipped URLs count
                print("{} urls skipped".format(s - c))  
                # Store URL-text pair in the dictionary
                dict_temp[row[0]] = updated_jd_content  
                c += 1  # Increment counter for processed URLs
            s += 1  # Increment total URLs counter
            
    return dict_temp  # Return dictionary containing URL-text pairs

# Retrieve job advertisement bodies from URLs and store in the dictionary
DICT = get_job_advt_body(URL_LIST_PATH)

# Function to convert dictionary of job descriptions to a Pandas DataFrame 
# for further processing
def get_dataset(DICT):
    """
    This function converts a dictionary of job descriptions to 
    a DataFrame for further processing
    """
    df = pd.DataFrame()  # Create an empty DataFrame
    df.index = list(range(len(DICT)))  # Set DataFrame index
    
    # Add job advertisement content to the DataFrame
    df['job_advt'] = list(DICT.values())
    
    return df  # Return the DataFrame containing job advertisement content

# Convert the dictionary of job descriptions to a DataFrame
df = get_dataset(DICT)

Data processing

In this step, we clean up the dataset further with the help of traditional text cleaning mechanisms using regex and NLTK. Regex is used to remove numbers and tokenization. NLTK is utilized to remove the stop words present in the text data.

# Extract job advertisement text from DataFrame column
job_advt = df['job_advt']  

# Function to remove numbers from text
def remove_punct(tweet):
    """
    This function removes numbers from the text.
    """
    # Remove numbers
    tweet = re.sub('[0-9]+', ' ', tweet)
    return tweet

# Apply the remove_punct function to each job advertisement text
job_advt = job_advt.apply(lambda x: remove_punct(x))
display(job_advt.head(10))  # Display the processed job advertisement texts

# Tokenize the text by splitting into words
def tokenization(tweet):
    """
    This function tokenizes the text by splitting it into words.
    """
    tweet = re.split('\W+', tweet)
    return tweet

# Apply tokenization function to each processed job advertisement text
job_advt = job_advt.apply(lambda x: tokenization(x))
display(job_advt.head(10))  # Display the tokenized job advertisement texts

# Remove stopwords using NLTK stopwords list
stopwords = nltk.corpus.stopwords.words('english')  # Get English stopwords

def remove_stopwords(tweet):
    """
    This function removes stopwords from the text using 
    NLTK's English stopwords list.
    """
    tweet = [word for word in tweet if word not in stopwords]
    return tweet

# Apply remove_stopwords function to each tokenized job advertisement text
job_advt = job_advt.apply(lambda x: remove_stopwords(x))

# Display the job advertisement texts after removing stopwords
display(job_advt.head(10))  #

# Join the tokens back into sentences
job_advt = job_advt.apply(lambda x: ' '.join(x))
# Display the processed job advertisement texts after rejoining tokens
display(job_advt.head(10))  

# removing in case we have gone through same page more than once
job_advt.drop_duplicates(inplace=True)

TF-iDF Word Cloud

We now utilize TF-iDF to create the vector representation for each word in the job descriptions\ present in our cleaned data frame.

We then utilize the vector representations to create the word cloud covering top 200 words with size indicating their score based on the TF-iDF vector representation.

# Create a TF-IDF Vectorizer with specific settings
tfidfvectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=2000)

# Transform the training text data using the TF-IDF Vectorizer
X_train_tfidf = tfidfvectorizer.fit_transform(list(job_advt))

# Get the feature names from the TF-IDF Vectorizer
feature_names = tfidfvectorizer.get_feature_names_out()

# Convert the sparse TF-IDF matrix to a dense matrix and then to a DataFrame
dense_tfidf = X_train_tfidf.todense()
lst2 = dense_tfidf.tolist()
df_tfidf = pd.DataFrame(lst2, columns=feature_names)

# Generate a Word Cloud based on TF-IDF weighted word frequencies
Cloud_tfidf = WordCloud(background_color="white", 
                        colormap='Dark2',
                        width=800,
                        height=600,
                        max_words=200).generate_from_frequencies(df_tfidf.T.sum(axis=1).sort_values())

# Create a figure for the TF-IDF Word Cloud
ch = plt.figure(figsize=(12, 9))

# Display the Word Cloud with appropriate settings
plt.imshow(Cloud_tfidf, interpolation='bilinear')
plt.axis("off")

# Save the TF-IDF Word Cloud as an image
plt.savefig('word_cloud.png')

# Show the TF-IDF Word Cloud
plt.show()

We are able to easily extract key skills such as JS, react, CSS, angular etc.

As we can see that results are promising with such a simple model. Let’s try to further improve upon this.

PCA & Word Cloud

Since TF-iDF has 1000s of elements in the vector representation, a simple idea could be to reduce the dimension using tools such as PCA which retain the most important information in the first principal component and so on.

To proceed with this step, we utilize the TF-iDF dataframe created in previous step and compute the first principal component. We then utilize this data frame to create the word cloud.

# Transpose the TF-IDF DataFrame for further processing
df_tfidf_t = df_tfidf.transpose()

# Reduce components using Principal Component Analysis (PCA) for Euclidean distance visualization
pca = decomposition.PCA(n_components=1)
pca.fit(df_tfidf_t)
# Transform and create a DataFrame with reduced components using PCA
df_fc_cleaned_reduced_euc = pd.DataFrame(pca.transform(df_tfidf_t).transpose(),
                                         index=['PC_1'],
                                         columns=df_tfidf_t.transpose().columns)
# Transpose the reduced DataFrame
df_fc_cleaned_reduced_euc_t = df_fc_cleaned_reduced_euc.transpose()

# Generate a Word Cloud based on TF-IDF weighted word frequencies from reduced components
Cloud_tfidf = WordCloud(background_color="white",
                        max_words=200,
                        width=800,
                        height=600,
                        colormap='Dark2').generate_from_frequencies(df_fc_cleaned_reduced_euc_t.sum(axis=1))

# Create a figure for the TF-IDF Word Cloud
ch = plt.figure(figsize=(12, 9))

# Display the Word Cloud with appropriate settings
plt.imshow(Cloud_tfidf, interpolation='bilinear')
plt.axis("off")

# Save the TF-IDF Word Cloud as an image
plt.savefig('word_cloud.png')

# Show the TF-IDF Word Cloud
plt.show()

Although, we do not see any significant differences between previous result. It does tell us that we do not really need 1000 dimensions — instead just 1st principal component is enough to give good results.

Future Work

Although results are promising, we can improve the results by:

Utilizing better vectorizations techniques than TF-iDF such as Word2vec or Glove models etc.
Perform clustering to separate out the technical skills and soft from rest of the words.

Reference

If you found the explanation helpful, follow me for more content! Feel free to leave comments with any questions or suggestions you might have.

You can also check out other articles written around data science, computing on medium. If you like my work and want to contribute to my journey, you cal always buy me a coffee :)

Search This Blog

Indie Quant

This is how you can use NLP and Web-Scraping to find in-demand skills for your dream job.

Introduction

Data Set

Web-scraping

Data processing

TF-iDF Word Cloud

PCA & Word Cloud

Future Work

Reference

Comments

Post a Comment

Popular Posts

Missing Character Prediction in Words with BiLSTM and Attention

Handling Overfitting in Machine Learning

The 5 Most Popular Regression Techniques

Text Classification Using Recurrent Neural Networks

Hypothesis Testing Series - An End to End Guide to Bayesian Hypothesis Tests - Part 3

How I Created Animated Choropleth Map and Running Bar Plot using Python

Sentiment Analysis using Deep Learning (BERT)

The Power of Vectorization in Python Data Operations

Deep Convolutional Generative Adversarial Networks

Hypothesis Testing Series - An End to End Guide to Permutation Tests - Part 2