At Semalytix, we are developing cutting-edge technology for machine reading, allowing machines to automatically sift through large volumes of unstructured content (see our blog entry on Dark Data) to extract key insights that support decision making.
Our algorithms rely on training data that is human-provided and is used to teach an algorithm what to look for and what to extract. The provision of training data is costly and typically a bottleneck.
In most scenarios that the Semalytix machine reading stack is applied to, we are confronted with the need to analyze data in multiple languages. In order to analyze content in multiple languages, two approaches are possible. The first relies on using a lingua franca, English typically and using translation technology (e.g. DeepL) to translate all content into English, relying on an English analytical pipeline only. This has the drawback of incorporating translation errors within the process of automatic translation but also to loose linguistic nuances that are not preserved in the translation.
A second approach consists in training extraction systems for each language of interest. This requires training data in each language and thus represents a costly approach.
In a new paper accepted for presentation at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019), one of the top-5 conferences worldwide in the field of natural language processing, we have proposed an alternative approach. The approach relies on cross-lingual representations of the meaning of words, so called embeddings. Models can be trained on these cross-lingual representations and can then be directly used for multiple languages. The approach is a „zero-shot“ approach as it can be applied to a new language after having received zero examples for that new language. Our paper shows that this approach is feasible and can be used for opinion extraction across languages and has a quite impressive and competitive performance compared to state-of-the-art systems for a number of languages without having used training data for any of those languages. This lays the foundations for a cost-effective approach to multilingual machine reading.
The paper shows in addition that after adding a few training examples only in a target language on top of the zero-shot trained model, performance increases substantially, yielding top performance on the task of opinion and sentiment extraction with a few dozen of training examples only.
This result will allow Semalytix to cover more languages at greater accuracy than before.
Semalytix. Never content with the state-of-the-art. Continuing to innovate.