In case of open-ended questions in survey studies, questions are sometimes interpreted differently than desired. Answers may range from a few words to complete sentences. This can lead to a wide variety of possible answers.
In fact, spelling mistakes, abbreviations, synonyms and incomplete or unclear descriptions regularly occur. An unclear job description as “doctor”, for example, is very general and can have multiple meanings, such as veterinarian, ophthalmologist, general practitioner, pediatrician or surgeon. Open-ended answers are therefore not immediately suitable for structured research. Extensive data preprocessing is required.
For the study Beeld van de Nederlandse Bevolking (BNB), open-ended questions were asked to participants. The questions were about job functions and job activities, with questions such as "What is your position?" and "What are your activities?" Radboud University (RU) Nijmegen wanted to use these answers to conduct research.
CentERdata is asked to provide structure to the job functions of the participants on the basis of the retrieved open-ended answers. It concerns approximately 6,000 open answers. Manually assessing and categorizing job functions is a time-consuming task and there is always a chance that overview will be lost in the myriad of available job functions. That is why the aim is to automatically categorize the job functions into ISCO codes (International Standard Classification of Occupations) using AI techniques. This not only saves time and costs, but also is not prone to prejudices, lack of overview, and fatigue among assessors.
We have implemented advanced text analytics techniques to assign the respondent's job titles to the appropriate ISCO codes. We used the tool BERT (Bidirectional Encoder Representations from Transformers; J. Devlin et al. 2019) for this. This state-of-the-art technique, released by Google in 2019, offers the possibility to extract contextually meaning from pieces of text, which is then compared with the description of the ISCO codes in order to find the best matching ISCO code.
In the end, we produced a list of appropriate ISCO codes for all questions that were also subject to human random (double) validation to verify correctness. In principle, text matching and the coding, clustering, and categorization of topics can be applied to any kind of text, not only job descriptions.