capstone-writeup-cormac-miles-will-1

Your Personal Data Science Job Coach

Cormac Dacker, Will Antink, Miles DeVaney

Introduction

This project was inspired by our collective experience with job searching as data science students who are about to graduate. As we came to discover, there are a wide range of job boards which host a significant number of data science jobs, making it a complex process to stay up-to-date with new jobs that have been posted. Further exacerbating this struggle is the fact that these job boards tend to lack useful and reliable filters, making it difficult to find the jobs which are most relevant to your skills and experience. In our experience, we found it especially difficult to filter out jobs which required many years of experience, even while using built-in experience level filters where they were available.

Another significant issue that we encountered was the wide variety of skills which can be requested for jobs in the data science field. Even jobs with exactly the same title might be looking for totally distinct sets of skills. For example, one job might be looking for experience with Python, Azure, and PowerBI, while another might be looking for experience with R, AWS, and Tableau. This significantly increases the time it can take to identify whether a job might be of interest to you as you scroll through search results.

Background

In order to address the problem, we came up with the idea to create a searchable database which conveniently aggregates the job postings from many different job boards into one place. Furthermore, we theorized that it would be relatively simple to analyze the job descriptions and determine what kind of skills and experiences they are looking for in a candidate. This information could then be used to drive more consistent filtering and faster searching of these jobs than their original host websites allowed for, enabling the most relevant job postings to be found much more quickly. The ultimate goal of this project was to create a way for data science job candidates to input some basic information about their skills and experiences, and the locations or kinds of jobs they are interested in, and to receive back a list of jobs from all the major job boards which most closely match their profile.

A basic view of our project pipeline is shown below:

Methodology

The Data

Dataset

The data for this project was collected from a variety of job boards through web scrapers built in Python using the Selenium package. These websites included Indeed, LinkedIn, Blind, SimplyHired, Glassdoor, and CareerBuilder. The web scraper would search each job board for jobs posted within the last 30 days using the keyword “Data Scientist”. The scraper would then iterate through each job found in the results, collecting the following information about each: Job Title, Company Name, Location, Date Posted, Salary (if listed), and Description.

Data Cleaning

Data cleaning is a critical step of our process, as scraping data across multiple job boards can lead to inconsistencies in the data. Data cleaning helps to ensure normalization and consistency across all our job listings. The cleaning also helps to ensure that the data preprocessing step where the data is cleaned and transformed to make it ready for analysis. Our data cleaning process includes:

Removing duplicates:

Here we remove duplicate job postings that may have been posted on multiple job boards. This involved Identifying which columns to use to identify duplicates and then removing them. For this we used the job title, company name, and location, and salary columns. Because this process what also accords multiple job boards we additionally had to create a ‘waterfall of trust’ by which we determined where from which job board site we would prefer to keep the data from if there was a duplicate.
Normalizing the data:

Normalizing the data involves converting the data into a standard format. This includes converting the date posted to datetime format, and the salary to a numerical format. Removing bad characters from the job title, company name, and location columns. Otherwise, the data was already in a standard format and fairly clean as it was scraped from the job boards themselves.
Vectorizing the descriptions with TF-IDF:

The job descriptions were vectorized using the TF-IDF vectorizer from the scikit-learn library. This was done to make the descriptions more amenable to machine learning algorithms. The vectorized descriptions were then used to identify the skills requested in the job descriptions.
Named Entity Recognition (NER):

Named Entity Recognition (NER) was used to identify the skills requested in the job descriptions. For this purpose we used the latest and greatest NER package GLiNER, a combination of traditional NER techniques, BERT (a advanced NLP model), and a small LLM. This stacking of models allows for a more accurate identification of skills in the job descriptions. This was done to identify the skills requested in the job descriptions. The identified skills were then used to filter the job postings based on the user’s skills and the skills requested in the job descriptions.
Job Title Classification The job titles were classified into one of five categories: Data Scientist, Data Analyst, Data Engineer, Machine Learning Engineer, and Software Engineer. This was done to make it easier to filter the job postings based on the user’s desired job archetype. To do this we used a simple classification model built using the scikit-learn library. We used a Random Forest Classifier to classify the job titles into one of the five categories. The model was trained to predict job archetypes based on job title, description and salary. This model never performed well, however, at its best it was at 65% accuracy. While this may have been acceptable we found we could achieve better accuracy and less computational complexity by leveraging fuzzy matching instead. This was done by using the fuzzywuzzy library to match job titles to the job archetypes.
Seniority Identification:

The seniority of the job postings was identified based on the job title. The job titles were classified as a binary indicator of seniority, with 1 indicating a senior position and 0 indicating a junior position. This was done to make it easier to filter the job postings based on the user’s desired seniority level. To do this we used a simple direct matching against terms that would indicate seniority in the job title. By doing this we were able to filter out senior positions from the job postings.

Data Engineering

We understood from the start of the progress that we had a substantial amount of freedom in terms of how and where to store the data we were using. We decided early on that in order to have a decent tool for job aggregation and recommendation, we needed a data storage method that was easily scaled, flexible, free, and accessible remotely. For this, we turned to Railway. Given our shared experience with Railway’s PostgreSQL databases in the data engineering course, it took almost no effort to spin up a useful relational database that would suit our needs. As an added bonus, we had plenty of credits left over from the prior semester, and hosting the data was essentially free for the duration of the project. As mentioned before, the data we were using came from a variety of sources, but was quite similar overall, and would fit in a very simple relational database. An image of the initial database structure is attached. After some tinkering, we realized that the nature of our data was such that we could have a very orderly structure with only one table, which is what we ended up going with. While we thought a more complex database would have been more impressive, there really was not a way to split up our table into multiple others that made it more efficient or effective. The final table design is attached.
Data is fed initially into the jobs_staging table via a Python script, and then this information is copied to a read_only table within the database denoted as jobs. This way, we could send information to the necessary tables, and the end-user would not be able to execute commands from the script that generates the webpage for them. At the time of the project presentation, the database contained just over 29,000 job listings, all of which can be processed, cleaned, and sent to our database in a matter of minutes.

Statistical Thinking

Data Analysis

The first step in our analysis was to look at the distribution of salaries across the job postings. These salaries were then compared to the job title, location, experience level, and skills requested in the job description. This analysis then allowed us to determine which factors had the most significant impact on the salary offered for a given job.

EDA

This graph was constructed using a subset of data scraped from Indeed, with duplicate postings removed and with more senior positions filtered out. Although this only provides a basic level of advice for job-seekers, it can definitely provide an advantage to know which areas have the highest number of entry-level data science jobs, enabling you to specifically target them in your search.

Here we can see the top and bottom five states by salary. This ranking of states by salary can help a job seeker narrow down where it is actually viable for them to look for a position. Maybe you like to ski and be in nature; well Vermont may be tempting, however you’ll be making, on average, roughly half of your Washington counterpart. This plot will, in the future, will become a geographic map rather than a bar plot. The national average is denoted by the red line, resting just above $140,000.

This plot conveys some of the difficulty that comes with searching for a job, namely that your search terms don’t always return you the most relevant positions, an issue which our project intends to help combat. Here, we can see that despite using the search term “Data Scientist”, we actually ended up with a greater number of Engineer or Analyst positions than we did Data Scientist positions. In the future, our project will allow for stricter filtering of job titles than the various job boards seem to provide.

Survival Analysis

It became clear that the job postings were not all equally likely to be filled, and that some job postings were more likely to be filled than others. To analyze this, we used survival analysis to determine the likelihood that a job posting would be filled based on the number of days it had been posted. This analysis allowed us to determine which factors were most likely to lead to a job being filled quickly, and which factors were most likely to lead to a job being filled slowly.

Data Visualization

Geographic Distribution of Mean Salary by State:

Here we can see the geographic distribution of mean salary by state. This shows how the west coast and the northeast have the highest salaries, while the midwest and the south have the lowest salaries. This can be useful for job seekers to know where they are most likely to find the highest paying jobs. Note that this data can be skewed as not every state requres jobs to post salaries.

This box and whisker plot shows the distribution of salaries by state. This is useful as it shows the range of salaries a Data Scientist can expect to make in each state. This can be useful for job seekers to know where they are most likely to find the highest paying jobs.
Distribution of Salaries by Job Title:

Here we have Job Archetype by count, colored by salary. What imediatly becomes apparent it how much less Data Analysts make when compared to the other archetypes. We can also see how much more ML Engineers make when compared to normal Data Scientist. This is useful for job seekers to know what kind of salary they can expect to make based on the job archetype they are looking for.
Most Asked for Skills by Job Archetype:

This graphic shows us the top 5 skills asked by Data Science positions and Data Analyst positions. Here we get some insight as to why Data Analysts make less than Data Scientists. Data Analysts are asked for SQL and Excel skills much more than Data Scientists, while Data Scientists are asked for Python and Machine Learning skills much more than Data Analysts. This is useful for job seekers to know what kind of skills they need to have to get the job they want and what skills they need to get paid more. For other job archetypes see the supplemental data.

Machine Learning

The final step in our analysis was to use natural language processing to identify the skills requested in the job descriptions. Specifically, we used a named entity recognition model (GLiNER) to identify the skills requested in the job. This combined with some manual additions created a list of skills that were requested in the job descriptions. This list was then used to create a reduced set of skills which could be used to filter job postings based on the skills requested. Jobs are then tagged with the skills requested in the job description, and these tags are used to filter the job postings.

Data Ethics

Because we allow users to upload their resumes, which often contain potentially sensitive personal information, we had to exercise caution when handling that data. Our project addresses this issue by quickly extracing the skills from each resume and then discarding the file, rather than storing them in our database. This ensures that our users are as protected as possible. We also considered some ethical issues related to our collection of the data. None of the job boards we collected our listings from had terms of service which prohibited web scraping, so our project was already protected on that front. However, we took the extra step of ensuring that our users have to visit the original job board to apply, rather than linking directly to the application from our website, to ensure that we are not depriving these sites of the web traffic which drives their revenue.

Although it is not implemented within the scope of this project, we did consider the future potential that our resume and job skill match system could have in working to combat bias in hiring. For example, companies could have their hiring practices examined by looking for discrepancies between the match scores of the candidates who apply for jobs with them, and the match scores of the people who are being hired. If companies are consistently hiring candidates who appear to be less qualified than a significant number of other applicants, that could suggest that there is an issue with their hiring process.

Results

Conclusions

We have successfully collected over 29,000 jobs from a variety of job boards, cleaned and normalized the data, and used statistical analysis and machine learning to identify the skills requested in the job descriptions. We have also created a web application which allows users to upload their resume or manually select their skills, and returns back to them a list of the jobs from our database which most closely match their profile.

Evaluation

In terms of success metrics, this project did not necessarily generate anything internally. The best way to judge how successful we have in been in accomplishing our goals will be to share our project and get user feedback (especially from our peers in this program), which we are in the process of doing. User feedback will be judged through several metrics, including general satisfaction with our tool, as well as comparison scores with other job boards to determine how interested they were in the top suggested jobs from each, given the same search parameters and user information. Pending that feedback, we can speak from our personal experience with testing our project to say that we have largely been successful in accomplishing our original goals, and have built a tool which is legitimately useful for data science job-seekers.

Future Improvements

In terms of future improvements we would still like to make, a large focus would be simply expanding the amount of information we extract from job listings. While our skill match system does a great job, it obviously struggles with job descriptions that are slightly vague, or that don’t necessarily list desired technical skills. A key improvement we could make would be to identify what kind of industry a job is in, for example. Other improvements we would like to make include expanding the scope of our job search to include more websites and search terms, adding more search parameters such as filtering by location(s), and updating the UI for a more professional and streamlined look.

Supplemental Data

Indeed: subset of our data