Reform Index der Bertelsmann Stiftung

Labour market analysis using online job vacancies – strength, weaknesses, challenges and opportunities

Online job vacancies are increasingly being used to develop insights into the labour market. Such analyses are enabled by innovative data processing methods that allow us to tap into the potential of large amounts of text data. However, these do not come without methodological and statistical challenges. This article therefore draws attention to different aspects of the work with this type of data.


Due to the rapid increase in the quantity of job ads being published online and the development and accessibility of modern data analysis tools to extract information from text bodies, there is a growing interest in using online job vacancies (OJVs) as a source of information and insights into developments in the labour market.

However, before reaching any conclusions based on an analysis of OJVs, there are important aspects involved in the collection, processing, and usage of OJVs that must be taken into account. The purpose of this article is therefore to list and draw attention to some important strengths, weaknesses, challenges and opportunities of online job vacancies that researchers need to be aware of when using this type of data as a source of information about the labour market.


OJVs are useful when it comes to identifying developments in the labour market mainly because of their timely availability (as they can be collected and analysed shortly after being published) and the growing coverage of the usage of online job ads (meaning a high amount of job ads text is available for analysis). These allow one to:

1. Identify changes in hiring behaviour due to labour market trends

  • Job vacancies posted online can indicate which jobs employers have difficulty in filling through internal or informal recruiting channels. For example, scarcity of labour supply can be identified when companies start posting jobs in many different job websites or in foreign job ad platforms (e. g. due to local labour market tightness in the country of origin), or through comparing job ad statistics to employment statistics.

2. Identify trends in job requirements and conditions

  • With OJVs one can identify not only skills that have appeared (e. g. blockchain programming) or disappeared (e. g. floppy disk data management) over time, but also differences in job requirements and conditions (e. g. an increase in remote working) across time, regions, industries, occupations, firm size, etc.

3. Identify branding strategies of companies

  • Based on the wording that companies use to describe themselves or the working environment, OJVs can also shed light on the branding strategies of companies.

OJVs don't offer a complete picture of job requirements and the labour market. This happens for a few reasons:

  • Many vacancies are not published online. This creates representativity issues as certain industries and occupation types make more/less use of online job portals than others or because different seniority levels and occupation types require different hiring strategies (e. g. head-hunting, tapping networks). And since there is no ground truth to OJV data (that is, we do not know the real distribution of job data across any given variable), attempts to access the representativeness of any OJV dataset are limited.
  • Information is also missing due to implicit requirements (skills that are obviously needed and therefore not even mentioned), competition strategies (e. g. hiding what you are doing from competitors) and hiring strategies (e. g. signalling intent for branding purposes while real hiring happens through head-hunting; or even 'faking' increase in open vacancies to suggest business growth).
  • Arguably, information in job ads primarily represents what companies think is important to attract suitable candidates and not necessarily to give candidates a full understanding of the requirements and tasks of the position.

Working with OJV is not without challenges.

1. Data collection, cleaning, and processing complications

  • Data collection difficulties: there are three main different ways of extracting information from webpages, usually in preferential order: a) direct access (also called ‘API access’), which basically means having access to the website’s database, and which usually requires some sort of formal agreement with the website provider; b) by scraping (extracting only relevant structured data from webpages), or c) by crawling (browsing and downloading entire webpages, then using algorithms to “read” and understand information in them). Using different methods in different websites leads to very different results and thus information extraction difficulties.
  • (De-)duplication and merging difficulties:
    • job ads for a single open position are usually published across multiple platforms;
    • these job ads might be posted at different time periods with slight alterations to their text bodies (creating conflicting information across job ads for the same position);
    • one single job ad can also be used to promote multiple open positions;
    • thus different merging and information prioritisation methodologies during the deduplication process can lead to very different outcomes.

  • Scraping and text processing errors:
    • distinguishing job ad text from other text that appears on the webpage;
    • distinguishing where one job ad starts and the other begins (especially in poorly designed webpages);
    • designing different scrapers for different website structures (especially when scraping from many different websites).

  • Meaning and context identification challenges:
    • extracting information from the text and categorising the job ad according to this information (e. g. identifying the employing company and assign what sector it belongs to);
    • differentiating between different meanings of similar words and phrases (e. g. differentiating the skills “adapt to changing situations”, “adapt to change”, “adapt to change in marketing”);
    • identifying the subject which words refer to (e. g. a job that states the application will have to “work in an innovative environment” – does that mean the company is innovative or that the person needs to be innovative, or both?).

2. Differences in information disclosure, naming and language conventions

  • Information disclosure:
    • different habits in information disclosure (e. g. disclosing salary details, gender requirements);
    • differences in amount of information dedicated to describing the company or job responsibilities or skill requirements;
    • differences in the purpose of the usage of job ads (e. g. job ads being main source of information about job versus just a signalling tool while candidates are expected to get in touch with recruiters);
    • differences in education systems and thus education requirements.

  • Naming: different job titles for same position; same job title for different positions.
  • Language: inconsistencies due to different languages and characters used in different countries (e. g. alphabetic vs logographic systems).

Many of the potentials related to using OJVs as a data source can be leveraged and improved when integrating online job ads in other types of data.

  • Comparing current information on economic activity with information available in OJVs might allow one to acquire early insights about companies’ activities (e. g. moves into new locations; adoption of new technology; entry into new markets, etc.).
  • Complementing OJV data with existing labour market information also allows one to acquire new or missing insights. Examples include using supply-side data (like CVs of employees) to identify supply and demand mismatches; using data from online job platforms to identify how different OJV descriptions impact the number of applications received; compare skills demand with occupation salary to identify which skills possibly lead to the highest salary increases, etc.
  • Comparing OJV data to official employment statistics data (e. g. number of employed or number of initiated employment relationships/recruitments by industry, region, company size, etc.) can also help uncover biases and problems with data coverage.

To conclude, there are inherent benefits to using large quantities of online job vacancies as a source of information about developments in the labour market. However, this type of data also has some weaknesses that must be carefully considered and challenges that must be overcome to ensure the data is of good quality and can produce reliable insights. Conducting more experimentation and further developing methodologies to process and analyse OJVs are therefore important in order to better understand how this type of data can help researchers to produce relevant information about the labour market.