Corinne Sharabi
Corinne is the Social Media and Content Lead at BLEND. She is dedicated to keeping global business professionals up to date on all things localization, translation, language and culture.
In this episode of the Localization Leaders podcast, BLEND CEO Yoav Ziv sat down with one of the most influential voices in the language technology industry, Jaap van der Meer, CEO of TAUS. Together, they explore Jaap’s journey from linguist student to founder, how TAUS became a key data provider, and the evolution of translation technology in the age of AI and automation.
Watch the full interview below:
That’s a long story, as you can imagine, given that I’m somewhat older now. I’ve been working in this industry for half a century. I’m joking, but it’s been a long time.
Before TAUS, I studied linguistics and literature at the University of Amsterdam, and then found myself a job on the side of my studies at a company that I didn’t know before, but it had three letters: IBM, International Business Machines. They needed somebody to proofread all of their materials, documents, and brochures, which I was very capable of doing because of my linguistic studies. But one thing led to another. I bypassed Randstad, the temporary labor service company, and started invoicing directly to IBM.
Then, the first desktop computers came out and they had lots of manuals and they needed to be translated. I started a business.
I started maybe one of the first localization companies working for IBM and then Microsoft and all those other computer companies joined and started outsourcing the translation work to us, which then gradually became localization work.
I built up a company with offices in all the major capitals of Europe, providing this service to the computer industry. We sold this company to RR Donnelly and after some further changes it became Lionbridge.
Then, I joined another company headquartered in Salt Lake City called ALPS, Automated Language Processing Systems, a spin-off of Brigham Young University, which had the mission to translate into all the languages of the world. It was a machine translation company by origin, so to say, but it spun off into services because machine translation technology couldn’t be sold just like that. [Machine translation] wasn’t working very well compared to what we see today. That was a great adventure. It was a NASDAQ-listed company, and we managed to make it grow very quickly. That company then later became part of SDL and then RWS. So you can see I have my roots in some of the big companies in this industry.
From the start, I had no idea about computers. I was not a technologist, but when I started translating in the early days for IBM and some of the software companies, I didn’t think it was a very rewarding or an intellectually challenging job for somebody with an academic background. I very quickly became intrigued by what computers could do for language. Obviously, at the time, computers did number crunching, but not much with text apart from word processing and desktop publishing applications.
So in my first company, I started investing in technology. First translation memory software, then dictionary lookup tools, and one thing led to another. We started doing corpus analysis and early applications of NLP-related applications. I learned it on the job, so to say, and became a big advocate of automation. So, around 2005, when I was done running big global translation companies, I naturally moved into more of a think tank.
I organized meetings for early adopters of machine translation to come together and talk about whether statistical machine translation technology could potentially be implemented and used for real services – and that became TAUS.
We started TAUS as a think tank, but soon became collectors of language data or translation data. Everybody in these meetings said, “Can I have your translation memory data? Because if we have more data, then we can train our statistical engines to do a better job.” I thought it was a great idea. We had a small team in San Francisco working together for a couple of months to create the first incarnation of the TAUS data cloud, a reciprocal model. If you upload your data, you earn credits to download other people’s data. That was a good formula. Everybody started contributing.
Yeah, we started this even before Google Translate, way too early. Everybody who knows me says, “Jaap, you’re always too early.” But the thing is, what I told myself is that innovation and automation is always accelerating. As the time gap gets smaller and smaller, then eventually I’ll be just in time. Now, I’m just in time.
Let me back up a little bit and tell you how it happened. Through the TAUS data cloud, we became owners of the largest repository of data in our industry. With massive volumes of data–70 billion words and 500 language pairs–it was very tempting for us to develop our own machine translation platform because everybody knows to get a good machine translation engine, you need data and you need algorithms.
Often at our events we would ask the MT gurus, “What is more important to advance MT, the algorithms or the data?” Typically, the answer would be data because the algorithms are generally available, very often open source. We thought we should launch our own machine translation engine since we have direct access to so much data. But after careful consideration over the years, we never did it because we realized there will be plenty of machine translation engines out there. And as a small company, we can’t beat the big tech companies anyway.
But the questions always stayed on our minds–how do we then utilize all this data? We began to transform from a think tank to a software and data company. We sold data, managed it, cleaned it, and made more domain-specific data sets for companies that were developing machine translation like Microsoft, Google, Amazon, as well as the dedicated MT companies.
About three years ago, Uber came to us and said, “Hey, guys, can you help us develop a quality prediction model, a quality estimation model?” So we said, “Sure, we have the data, we can do that.”
I had heard about quality prediction models as a research topic for maybe a decade, but there were no practical implementations. Nobody was doing it. So we started building it, and to our own surprise, the model was able to assess whether something was properly translated or not. That was still the job of the human post-editor or the LQA specialist in many translation or localization companies. That was an eye opener. Then we started to refine it and build the infrastructure around it to make it scalable. That’s how it happened.
The way it works is not very different from how machine translation works. It uses transformers, embeddings, and mathematical models in a language agnostic, universal representation of languages. You transform a sentence into a mathematical representation, and then you match that with another language. Then you get a score that represents how close the two segments are from one language to the other.
You train the models with data, same as you train machine translation models, except now you also feed it with bad quality or negative examples so the model learns how to distinguish good from bad quality to cover all nuances.
It not only checks for accuracy, but also for fluency. As some of our users say, it provides an astonishing and surprisingly accurate judgment of quality. But to create the model, a lot of the work goes into finding the right data, tuning the data, tuning the models – it’s a lot of empirical work.
It’s very hard for people who are trained very traditionally to just automatically trust the model. We provide what we call a “Confidence Index” where you can check what the confidence level is and what the risk factor is for each language. Indeed, many of our customers run the EPIC model in parallel with human review in the beginning to judge performance and build trust.
That’s a very interesting question.
I was talking to a CEO of one of the super agencies some time ago who said that this dilemma reminded him of something the CEO of Unilever said – I know that 50% of my marketing budget is wasted, except I don’t know which 50%.
You could apply the same to localization because probably more than 50% of the machine translation output is good to go. But still, we have humans looking at everything because we don’t know which 50%. It’s a huge saving opportunity that’s out there. We can realize those efficiency gains with quality estimation, with EPIC.
This happened before in the translation industry when translation memory was invented. That’s the advantage I have now, of course, with my long history. I started translating on a typewriter with a magnetic card, so we could store one and a half pages on one card and then put a new card in.
When desktop computers came out, my company developed the first translation memory tools, designed to work on a desktop computer. That way translators and small agencies could improve their margins. But of course, customers found out and said, “Hey, we want this.” Then the first TMS systems came and centralized translation memory.
CSA research says the localization industry is handling less than 1% of the content that’s out there in the world. If we make localization more efficient, you’ll get more content. The volume of your business will grow. But will it really? And will it grow fast enough to compensate for the loss in revenue when all your prices are going down and if you give all these benefits?
It was only natural to start with a segment-based system, but we do get requests from customers to provide a document-based/file-based system or batch processing. We’re working on that, and we’ll probably have that ready within the next cycle.
There will be a file-based option in EPIC. We need to adapt. We’re also working on a self-hosting language model so that we don’t need to send everything out to the public.
The pace of change is speeding up. You can’t look much further ahead than maybe one or two years at the moment. On our new website, we refer to EPIC as your AI quality companion – we’re already making it a bit broader, imagining that this technology at its core will be all about helping humans with scalability.
Well, I was going to say Dutch because we hardly ever speak Dutch anymore, even in our own team. We only have three Dutch people, although we’re in Amsterdam. But I like Italian, too.
Taos, New Mexico. I discovered it after someone recommended I organize an event in Taos, New Mexico. I looked it up and we traveled there. We had a foundational meeting of the TAUS Data cloud there because it’s a remote place, amongst all the adobe houses in the middle of the desert next to the Rio Grande. It’s beautiful.
That’s a tough one. I actually never think of myself as being a “localization” person, actually.
What I say to myself is that if you’re on the service side in this industry, you can only be as good as your best customer. So that is one thing that I learned: get deep into your customer’s business perspective and see what you can learn from that and copy that.
I think maybe I did that in the past with IBM and how I started in the business. And then again recently with Uber.
I think if you ask me to give any advice to the listeners of this podcast, it’s to know your customer and learn from them.
What our customers are saying