The aim of the project was to develop a model that extracts information of occupational fields and employees’ skills according to job advertisements from a job platform.
The project is based on a data set of posted job announcements obtained by the web crawling.
Challenges & Solutions
A lot of job platforms prohibit web crawling. Additionally, the used portal made permanently changes of a source code what interferes crawling as well.
A structured data set contained descriptions of positions, names of companies, locations and dates of publication was created by the Python-based web crawler. The database was available for download as a CSV file.
The occupational fields and skills were labeled with the help of the Dataturks tool and downloaded as a JSON format.
The labeled entities were divided into training and test data sets. By means of the spaCy library for natural language processing the algorithm was trained and the trained model was saved to be loaded anytime for the subsequent analysis. The test data set was evaluated by sklearn metrics with high precision (95.1%). Finally, the 30 DAX members were web crawled as well and the named entities corresponded to the occupational fields and skills were stored. Since both single words and phrases were labeled, couple of German stop words such as “und”, “sowie” and “auf“ appeared. These stop words were removed in order to get just the meaningful entities.
The obtained results are presented in a user-friendly dashboard. The user can select a company, skills or occupational fields depending on individual needs. Filtering by companies will provide a diagram where sizes of rectangles correspond to the number of opened positions in that company.
By selecting a specific skill or experience, a relational amount of opened positions in all companies will be displayed. Further analysis as combined skills can be carried out as well.
The created algorithm can be extended to scan the huge amount of candidates resumes a company obtains.
The results presents a solution to economize companies’ resources for market research.