practical tips on building an international presence

The dynamic sentence creation anti-pattern

By Isofarro on March 5th, 2010 - One comment

The natural internationalisation stumbling block, particularly for technical people, is that localisation isn’t just about translating static text strings. Surprisingly, many developers and programmers fail to consider that sentences in one natural language cannot be simply translated one word at a time to another language.

Differences in grammar

Every human language has its own grammatical rules and style. The chances of a grammatical structure being the same across a range of natural languages is extremely low. So code written around a specific grammatical construct in one human language presents an impossible internationalisation barrier when needing to be translated into another language.

There’s only two approaches that could work here:

Write a custom replacement function per language
Throw away the code and try again

Drawbacks of language dependent code

Unfortunately option 1 – writing a custom replacement function per language – would require a developer to be involved every time a new language needs to be supported. And that developer needs to know this new language well enough to make the necessary changes or additions to get his function returning the grammatically correct output each time.

For every natural language that needs to be supported, the developer has to supply a new function. That function most likely already contains business logic. Adding a new language means duplicating, or reimplementing existing business logic to meet the grammatical structures of the new language.

So what happens when the business logic needs to be updated? Yep, the developer now has to update every single copy of the function. Each time checking the natural language syntax is correct. That means that the developer maintaining this piece of code has to be broadly familiar with every language his code supports. And that is totally unrealistic.

Developer zugzwang.

A real-world case

I ran into a piece of dynamic sentence generation code about two years ago when I was tasked with localising some “global-ready” JavaScript for use in Europe. I was assured all that needed to be done is to replace the static strings in the code with a reference to a JavaScript translations lookup.

Here is a simplified JavaScript pseudo-code version of the code I uncovered. (Simplified so we can focus on the internationalisation issues without getting bogged down in convoluted business logic.)


function getMarketStatus(market) {
	var message = market.name;
	var now     = new Date().getTime();
	
	if (market.open) {
	
		message += " open"
		
		if (market.open.early) {
			message += " early"
			
			if (market.reason) {
				message += " for " + market.reason;
			}
			
		} elseif (market.open.late) {
			message += " late "
			
			if (market.reason) {
				message += " for " + market.reason;
			}
		}

		if (now < market.open.time) {
			message += " in " + 
				formatTimeLeft(now, market.open.time);
		}

	} elseif (market.close) {

		message += " close";
		
		if (market.close.early) {
			message += " early"
			
			if (market.reason) {
				message += " for " + market.reason;
			}
			
		} elseif (market.close.late) {
			message += " late "
			
			if (market.reason) {
				message += " for " + market.reason;
			}
		}

		if (now < market.close.time) {
			message += " in " + 
				formatTimeLeft(now, market.close.time);
		}

	} else {
		message += " closed";
	}

 	return message;
}

This piece of code generates one sentence of text summarising the market status. Either the market is open or closed, opening soon or closing soon, maybe earlier or later than normal (perhaps for a specified reason).

Identifying the possibilities

The function returns one of the following patterns (variable data identified with curly braced place-holders):

{market} open
{market} open in {timePeriod}
{market} open early
{market} open early in {timePeriod}
{market} open early for {reason}
{market} open early for {reason} in {timePeriod}
{market} open late
{market} open late in {timePeriod}
{market} open late for {reason}
{market} open late for {reason} in {timePeriod}
{market} close
{market} close in {timePeriod}
{market} close early
{market} close early in {timePeriod}
{market} close early for {reason}
{market} close early for {reason} in {timePeriod}
{market} close late
{market} close late in {timePeriod}
{market} close late for {reason}
{market} close late for {reason} in {timePeriod}
{market} closed

Where:

{market} is the name of the market under scrutiny, e.g. UK markets
{timePeriod} is the number of minutes and hours before the market opens or closes
{reason} is the stated reason why a marked opened or closed early or late.

The simple difficulty

That’s 21 different text strings. The likelihood of the word order remaining the same across different languages is close zero.

The simple case is that the order of atomic elements works in English, but unlikely to consistently work in other languages. And for this piece of logic to be fit for internationalisation, this order cannot be assumed to work. A translator needs to be able to use the most appropriate and correct order of the targeted language and culture.

Moreover, the difficulties don’t end there.

The disguised change of meaning

Perhaps the most insidious feature of the above code is that adding in an extra word significantly changes the meaning of previous words. Take for example these two generated sentences:

UK markets open
UK markets open in 20 minutes

The first sentence is a declaration that the market is currently open. The second, however, does not; it states that the market will open after a defined period of time. So the sentence has changed from a present tense declarative, to a future tense expectation.

The English grammar barely holds together in this change of tense, and it’s unlikely that more regular and refined languages could pull off this form of grammatical gymnastics.

Factor out the natural language

So how do we fix this? Rather simply, by avoiding constructing sentences fragments at a time. Figure out which pieces of information are needed, and then look up the most appropriate translatable sentence that matches the information.

We refactor the code above in a two step process:

Replace the sentence appending logic with something that keeps track of which bits of information needs to be conveyed, and pick the most appropriate sentence.
Add the dynamic data into the sentence by means of token or place-holder replacement

Step 1 requires a rethink of the business logic implementation. We have to keep track of what pieces of information we need to display, and from that pick the most appropriate sentence. An obvious way of doing this is to keep a translations hash with all the possible combinations, and keying those in a calculatable way.

I’ve done this the same way as the original code builds up a sentence, except I’m building up a lookup key. And the lookup key then maps to a full sentence. This level of abstraction divorces the actual sentence grammar from the business logic rather neatly. (This approach is analogous to bitwise logic; something familiar to most C developers)

After that, it’s a simple case of retrieving the correct sentence, and replacing any data place-holders with the actual information.

After these refactoring steps the code looks like this:


function getMarketStatus(market) {
	var status;
	var now     = new Date().getTime();
	
	// Collect the pertinent pieces of information
	// So we can pick the right translation string
	var message = {
		market:   market.name,
		sentence: ""
	};
	
	
	if (market.open) {
		status = market.open;
		message.sentence = "O";
		
	} elseif (market.close) {
		status = market.close;
		message.sentence = "C";
		
	} else {
		message.sentence = "X";
		
	}
	

	if (status) {
	
		// Make a note of early/late status
		if (status.early) {
			message.sentence += "E";

		} elseif (status.late) {
			message.sentence += "L";

		}

		// Make a note of any reason
		if (market.reason) {
			message.reason    = market.reason;
			message.sentence += "R";
			
		}

		// Make a note of any time period
		if (status.time) {
			message.timePeriod = 
				formatTimeLeft(now, market.open.time);
			message.sentence += "T";
			
		}
	}

	// Pick the right sentence to display
	var sentence = TRANSLATIONS[message.sentence];
	
	// Replace dynamic data
	return YAHOO.lang.substitute(
		sentence, message
	);
}

// Mapping each combination into a sentence
TRANSLATIONS = {
	O:    "{market} open",
	OT:   "{market} open in {timePeriod}",
	OE:   "{market} open early",
	OET:  "{market} open early in {timePeriod}",
	OER:  "{market} open early for {reason}",
	OERT: "{market} open early for {reason} in {timePeriod}",
	OL:   "{market} open late",
	OLT:  "{market} open late in {timePeriod}",
	OLR:  "{market} open late for {reason}",
	OLRT: "{market} open late for {reason} in {timePeriod}",
	C:    "{market} close",
	CT:   "{market} close in {timePeriod}",
	CE:   "{market} close early",
	CET:  "{market} close early in {timePeriod}",
	CER:  "{market} close early for {reason}",
	CERT: "{market} close early for {reason} in {timePeriod}",
	CL:   "{market} close late",
	CLT:  "{market} close late in {timePeriod}",
	CLR:  "{market} close late for {reason}",
	CLRT: "{market} close late for {reason} in {timePeriod}",
	X:    "{market} closed"
};

Once we move the TRANSLATIONS object to a language-specific file we can then allow translators to translate the sentences to the targeted language.

This technique is flexible enough even to tackle the irregular grammar when the market.close object exists but with no useful data, thus {market} close could easily be corrected to {market} will close soon.

Also the flexibility will make handling the {reason} wildcard a little easier, if not perfectly.

So we avoid the need to create dynamic sentences on the fly, and instead focus on what pieces of information we need to share. And then offer the translator sufficient flexibility, through the use of replaceable tokens, to set the most appropriate translation for the targeted language.

Back in the real world…

My refactored solution didn’t go live. Instead, the feature it powered was descoped in Europe. That was the cost of using this particular anti-pattern.

Filed in Translations Tagged with antipattern, dynamic, gotcha, grammar, i18n, internationalisation, text, translation

One Response to “The dynamic sentence creation anti-pattern”

1 Dirk Ginader says: March 20th, 2010 at 1:18 am

This is a brilliant solution that is solid and easy to adopt. Instant best practice right there. Thanks for sharing.