How Multilingual Text Analytics leverages Patient Centricity in Health Care

By Matthias Hartung

At Semalytix, we are committed to supporting pharmaceutical companies in their endeavors concerning patient-focused drug development by providing access to the authentic, accurate, and unsolicited voice of billions of patients. Using Semalytix' analytical platform Pharos®, our customers can obtain meaningful insights from a variety of online sources in order to find out what really matters to patients coping with a particular disease:

  • How do patients live and feel with their disease? Which burdens are they facing?
  • How do they experience treatments? Which treatment benefits and values do they acknowledge?
  • What are their unmet needs?
  • Which trade-offs and preferences determine their choice of treatment?
  • What makes them feel better?


Pharos® enables brand and marketing managers, HEOR analysts and other stakeholders from the pharmaceutical industry to tune into the previously unheard and unbiased voice of patients in real-time, and learn from thousands of disease-specific patient populations around the world.


In order to gather, analyze and transform patients’ voices into valuable patient-reported health outcomes, Semalytix follows a machine reading approach that implements various methodologies from Natural Language Processing, Machine Learning and Artificial Intelligence. One of our machine learning models, for instance, is focused on sentiment analysis in order to predict whether a particular drug was perceived positively or negatively by a patient. Different algorithms analyze which aspects of patients’ quality of life are impacted the most, and in which way exactly, when being treated with a particular drug.


An inherent challenge is posed by the fact that many patients around the globe prefer to make their voice heard in their native language. Therefore, in order to understand patients’ preferences across different local markets, or to account for differences in various ethnographic populations, multilingual processing capabilities are crucial for online patient listening. Given the existence of more than 7,000 languages across the world, tailoring a uniquely engineered solution to each of these languages is undoubtedly inefficient and not scalable in view of the costs and efforts it takes to develop machine reading algorithms at performance levels that come close to human understanding.


At Semalytix, we therefore follow strict cross-lingual transfer policies in extending our multilingual machine reading stack: we take models and algorithms for which we have already achieved a substantial degree of technical maturity in some languages as a starting point to transfer them into different languages without the need to start from scratch. For this transfer step from a source language model into a target language model, various technical approaches exist in the scientific literature, depending on type and complexity of the source model, the particular language pair, and other parameters. Among these approaches, cross-lingual transfer learning can be seen as the state of the art due to its being highly versatile for different tasks and adaptable to particular technical domains or linguistic genres.


While effectively avoiding the need to tailor machine reading models to every individual language of interest, cross-lingual transfer learning still comes at its own cost. In the source language, ground-truth training signals need to be provided as manually annotated data points. Usually, the availability of such annotations is a given from the training phase of the already existing source language model. However, the main challenge in cross-lingual transfer approaches is for the model to learn a mapping of feature representations across languages, which requires some guidance in order to ensure that features strongly associated with task-specific information in the source language model (e.g., words and phrases that are highly indicative of positive sentiment in the English language) are appropriately “transferred” into the target language model.


Appropriate bilingual lexical resources such as translation dictionaries or task-specific vocabularies are required to inform the transfer learning procedure; the performance of the resulting target language model will strongly depend on the availability of high-quality, task-specific lexical resources. Hence, compared to traditional supervised learning approaches, cross-lingual transfer learning can be seen as trading manual annotation costs for selection and optimization efforts in leveraging lexical resources. Put differently, following a cross-lingual transfer learning methodology changes the fundamental underlying question from “What’s the most effective way to get labeled training data?” into “What are the most appropriate language resources I can access?”.


In this context, Semalytix is very fortunate to be part of the EU-funded Horizon 2020 research project “Prêt-à-LLOD”[] which aims to develop technical infrastructures for automated discovery, practical, industry-scale deployment of language resources in multilingual text analytics, and many other language technology tasks and applications. Prêt-à-LLOD capitalizes on the Linguistic Linked Open Data (LLOD) cloud [] – an initiative founded in 2012 by the Open Linguistics Group of the Open Knowledge Foundation []. In its current state, the LLOD cloud hosts more than 100,000 language resources for over 1,000 languages. As one of their main goals, the collaboration partners in Prêt-à-LLOD aim at transforming these resources into ready-to-use interoperable data assets that can be deployed in multilingual language technology workflows and applications.


For Semalytix, participating in Prêt-à-LLOD as an industry partner creates a major opportunity to continuously enhance the multilingual capabilities of Pharos®. Based on our highly sophisticated technical solutions, we want to provide a quantum leap in patient centricity in health care, because we believe that every patient’s voice should be heard – wherever they live on earth, and whichever language they speak.



Literature links:

  • Matthias Hartung, Matthias Orlikowski and Susana Veríssimo (2020): Evaluating the Impact of Bilingual Lexical Resources on Cross-lingual Sentiment Projection in the Pharmaceutical Domain. []
  • Thierry Declerck, John McCrae, Matthias Hartung et al. (2020): Recent Developments for the Linguistic Linked Open Data Infrastructure. LREC 2020. []
  • Jeremy Barnes and Roman Klinger (2019): Embedding Projection for Targeted Cross-lingual Sentiment. Model comparisons and a Real-world Study. Journal of Artificial Intelligence Research 66. []
  • Anders Søgaard, Ivan Vulic, Sebastian Ruder and Manaal Faruqui (2019): Cross-lingual Word Embeddings. Synthesis Lectures in Human Language Technologies. Morgan & Claypool. []

Tags: AI, Healthcare, Pharma, #pret_a_llod, LLOD, PatientListening