Text mining (also known as text analytics) is the exploration and analysis of large amounts of unstructured text data. With the aid of computer algorithms, many useful aspects can be identified from text data, such as concepts, patterns, topics, keywords and sentiments.
Thanks to the development of big data platforms and complex algorithms that can analyze vast amounts of unstructured data, text mining is becoming more practical and accessible to data scientists and other data users. Increasingly, text mining is being employed to find valuable insights into corporate documents, customer emails, call center logs, social media messages, medical records, annual reports, legal texts and other important sources of text-based data.
Semantic analysis describes the process of understanding natural language - the way people communicate - based on meaning and context. The meaning of a piece of text can therefore also be captured via Machine Learning (ML). This includes analyses of open-ended question answers, dossier texts, paragraphs and reports, but also of messages such as e-mail and social media posts. The textual information can be analyzed and made visible with semantic weighting and coherence.
At CentERdata we apply techniques in the field of natural language processing (NLP) and write code in Python, R, and C #, among others. We use the advanced packages and toolboxes in this area. The basis for successful text mining and processing and analysis of data lies in writing strong algorithms, setting up automated processes (scripts) and linking with open data protocols such as APIs.
Prior to the 2017 Dutch national parliamentary elections, we analyzed Twitter messages on the subject of sentiment and party interaction. By tracing certain patterns, analyzing semantics and determining pragmatics for these messages (such as sentiment value), it was possible to establish correlations between political parties and the nature of the Twitter messages.
Another example of this type of analysis is the so-called happiness live ticker. In this project sentiment information is extracted from Dutch tweets to find out how happy people feel. This is visually displayed on the temporary happiness score of Dutch tweets web page.
Commissioned by Tilburg Law School, we applied topic modeling to English documents from listed companies of various European stock exchanges, such as annual reports and minutes of shareholders' meetings. These documents contain approximately five million unique words. The topics are displayed graphically using data science techniques. In this way, it becomes clear what the most important words are per topic, how important the topic is within the corpus, and what the similarities are with other topics.