top of page

LinkedIn Web Scraping Bot

In 2022, while looking for my first Data Analyst job after graduation, I have to painfully perform a daily mediocre task of scrolling through every single Data Analyst posting on LinkedIn, Glassdoor, Indeed, etc. It was that moment that I stumbled upon the idea of creating a Web Scraping bot to do all the manual work.


For more detail about the project please visit: https://github.com/hoangp27/LinkedIn-Job-Scraping-bot


The Dashboard is the visualisation output of the produced data. To see more dashboards built by me please visit:


https://public.tableau.com/app/profile/hoang.pham3135



Formulating


At the time web scraping had already been a common thing in Data Science, with popular Python packages such as Selenium and Beautiful Soup. For such reasons, After doing research on several similar topics, I began to conceptualize the idea of how my bot should run. Specifically, the standard procedure should be something like this:


  1. Using Selenium package, the bot would access this link: https://ca.linkedin.com/jobs/search?keywords=Data%20Analyst&location=Canada&locationId=&geoId=101174742&f_TPR=r86400&position=1&pageNum=0 , which shows all the Data Analyst postings in Canada on LinkedIn within the latest 24 hour (for those looking for other position you can use different links!)

  2. The bot would loop through every single record, find relevant elements, and retrieve these data into table format.

  3. The data will be written on to a google sheet file, which is live connected to Tableau Public dashboard that automatically update on any change of the data.

  4. Finally, with Window Task Scheduler, the Python bot will automatically run daily (as a batch file) at a specific time. This will automate the whole process (almost).


Coding


This seems like a straightforward process, but there is lots of hindrance problems along the way. After roughly 4 weeks, I was able to get the bot running. Overall, the process seems like this.


Download and install the packages:

ree

Create a google sheet that take in new record.


ree

Selenium Package allows us to actively manipulate the website without manually clicking on it. After retrieving the page URL, I create an algorithm that makes the page automatically scrolling and click on the button "See more jobs" until it reach the bottom. The idea is to measure the page height before and after the scrolling action; if they are equal, it means we have reached the bottom and can break the loop.


ree

ree


Once we're done with the above steps, it's time to start scraping data. For each box on the left side, we can get the Job title, Company name, and Location.

ree

ree

Getting job description and other information is more complicated. I create a loop to get Selenium to click on every box on the left side then retrieve the data on the right side. For those who does not have a record, I simply assign a null value to it. We would get the job_id, seniority, job description, employment type, function, and industry:


ree

After this, we can create a data frame from panda to store all the extracted data. I also performed some minor data cleansing, and write these results to a daily csv/google sheet file.


ree

ree

Finding


I use all the retrieved data in Google Sheets to build the dashboard above. Since I use live connection, the dashboard is automatically updated when more rows are added to the table daily. Some of the interesting insights I found:

  • Most jobs come from Ontario, following by Quebec and British Columbia. I think these corresponds to the three megacities Toronto, Montreal and Vancouver. Although we are almost two year after the Covid lockdown, remote job still accounts for a big portion and probably will set the new normal trend for many upcoming position.

  • Out of more than 10,000 jobs the most demanding skill is SQL. Python and MS Excel are also needed in many position. I personally think Tableau and Power BI are not stated in as many positions as they should have been, since these are the standard BI tools for this industry. Overall, I expect to see more adaptation of these tools in the future.

  • Most job postings tend to happen in mid-day week, specifically anywhere from Tuesday to Thursday (kinda trivia when you think about it). I also found that, overall, entry level, mid-senior, and senior positions are nicely proportionate.


Concluding


This project takes lots of time and effort, but the experience has been fun. After roughly 25 days, I have to close the bot because Google Chrome kept releasing new chromedriver versions, which I do not have the time to keep monitoring and my Google Sheet data started to exceed the limit size for Tableau connection. But core concept is still shareworthy for those who want to replicate this project.



© 2023 by Hoang Pham

bottom of page