Embedding Non-English Data

6 min readMay 16, 2021

Introduction

Sentence embedding models like Sentence-BERT [1] have become indispensable tools in working with sentence-level data in multiple areas like semantic search, summarization, question answering, and many more. The strength of these models lies in their ability to assign semantically meaningful representations to sentences directly, because of the way they were trained. Before that, the most common way to generate a sentence-level representation was to perform some heuristic on the individual token embeddings within the sentence, think average- and max-pooling. While it is explainable and straightforward to use, it inherently leads to the loss of semantics during the pooling process.

Without going into too much detail on sentence embedding models here, for this article it is important to know two things:

Models like Sentence-BERT produce semantically meaningful representations for full sentences, that could successfully be used in both downstream tasks as well as comparing the embeddings directly in an Semantic Textual Similarity (STS) task.
Sentence-BERT is trained on English data (the SNLI [2] and MultiNLI [3] datasets). That means the model can only produce meaningful embeddings for the English language! The reason it is so widely popular is of course because there is a lot of English data out there and a lot of contemporary NLP challenges in business and academia involve English speaking users or researchers.

However, data practitioners and researchers that are seeking to work with non-English data, were faced with a monumental question:

How can we produce meaningful sentence-level embeddings for non-English data?

In the past few years, there have been a few publications that aimed to address this question by introducing models that could produce sentence-level embeddings for other languages. One example is the Multilingual Universal Sentence Encoder for Semantic Retrieval, which builds upon the Universal Sentence Encoder by Google. This model produces the sentence representations using a dual-encoder based model trained to maximize the representational similarity between sentence pairs drawn from parallel data [4]. Another notable one is LASER: Language-Agnostic Sentence Representation, which is trained in a similar fashion. And here we come close to the gist of what we want to have:

We want to have language agnostic sentence representations that map to nearby regions within semantic space if the sentences are similar.

Yes, that’s right. Semantically similar sentences in different languages have similar representations! That essentially means, that if we would translate an English sentence to, for example, Dutch, and then encode both sentences using such language-agnostic model, we would get embeddings with a high cosine similarity between them.

That has a lot of possible applications!

Think multilingual neural semantic search, question answering, translation ranking, and semantic textual similarity on itself. Here I want to zoom in on the first one, as it is a personal interest subject of mine.

In semantic search, we generally want to search through a collection of documents using a query. We do so by first embedding all the documents in the collection and indexing those embeddings. Afterwards, we embed the query document using the same embedding model and then perform some form of nearest neighbour search within the document embedding space. There are many great resources out there that explain this process, so for this article, I will assume you know the gist ;)

The thing is, the document embeddings and the query embedding have to be in the same embedding space for this search to work! This means that until recently if you wanted to use Sentence-BERT, you could only perform this operation on English data, as the model only produces meaningful embeddings for this language.

Language-agnostic BERT sentence embedding

Enter LaBSE [5]. And now comes the interesting part: in a setting where you have many English and Dutch documents, but the users do not speak sufficient English or Dutch, then they could write their query in their own language (as long as it is supported by the model, of course) and do a retrieval attempt using the embedding of this query!

Let’s see how this works in practice.

We are going to use the sentence-transformers Python library by great folks over at the UKPLab.

First, install the library, do the imports, and initialize a LaBSE model. This will download the model files once, so be patient with that! :)

Preparing to use LaBSE

Now that we have loaded the LaBSE model in memory, we can apply the model on our documents to produce the document embeddings.

Given the sentence “My name is Sarah and I work in London”, we can now produce a sentence embedding as follows:

Embedding a single English sentence

Let’s explore three possible applications.

English sentence similarity

The first one is the most common: given two sentences in English, we want to see the similarity between the two sentences. This can be done using many different embedding libraries but for the sake of completeness, I will show how you can do it with LaBSE as well:

Calculating similarity between two English sentences

Our two similar sentences do indeed have a high cosine similarity between them, at 0.91! This means that the LaBSE model produced meaningful embeddings for the sentences, and can be used for calculating the similarities between different sentences.

Now, this is not new, and as I said earlier, this can be achieved by multiple other (and more lightweight) models. The true strength of this model comes in the applications I’m about to showcase.

Dutch sentence similarity

In the introduction, I discussed how it is a very much needed functionality to be able to produce meaningful embeddings for other languages than English. LaBSE is a language-agnostic model, which suggests we could input text in any language (technically not any language, but a lot of them, 109 to be exact) and get meaningful embeddings! Let’s test it on a language I am working with heavily: Dutch.

Calculating similarity between two Dutch sentences

Great, the two Dutch sentences are actually very similar (believe me on this one), and their cosine similarity is a whopping ~0.95! Mind you, this is produced using the same model checkpoint as for the English language above! This means that LaBSE actually has an inherent understanding of Dutch and can successfully be used in comparing Dutch sentences with one another. This in itself is already a reason for me to be very happy and to use this great model, but it is even cooler than that.

Because the next and last functionality I want to show in this blog is quite mindblowing.

Cross-language sentence similarity

Sometimes, your data is in all kinds of languages, but the users of your platform are not proficient in all those languages.

Ideally, you want to be able to search through multilingual data using one language. Let’s see what LaBSE can make of this task, by embedding sentences in different languages using the same model and check the similarity between the produced embeddings.

Calculating similarity between English and Dutch sentences

Wow! The similarity between the Dutch and the English sentences is really high! This means that the model understands that the Dutch and the English sentences convey a very similar message!

Let’s try this with German and Dutch:

Calculating similarity between German and Dutch sentences

The two sentences are literal translations of each other in the respective languages and the model confirms that by producing sentence representations with an unbelievable 0.995 cosine similarity between them!

If we would have a semantic search system that is based on cosine similarities between the documents, and we would have a set of German sentences in the dataset, we could just query this dataset using a Dutch query, and almost certainly retrieve relevant results!

It’s easy to see how useful this is! The possibilities stream to my mind as I write this sentence ;)

Conclusion

On the basis of a few examples above we can see that the LaBSE model is a gateway to a whole range of possible new applications, from producing meaningful embeddings for non-English textual data to cross-lingual embedding comparisons.

I’m very excited to use LaBSE embeddings in my projects, and I hope I made you excited as well!

[1] Nils Reimers and Iryna Gurevych. 2019. SentenceBERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.

[2] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.

[3] Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.

[4] Chidambaram, M., Yang, Y., Cer, D., Yuan, S., Sung, Y. H., Strope, B., & Kurzweil, R. 2018. Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836.

[5] Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. 2020. Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852.