practical tips on building an international presence

Internationalisation Gotchas

By Isofarro on March 29th, 2010 - No comments

The amount of work required to internationalise a website is woefully underestimated, and sometimes, the codebase is compromised to the extent that many features or capabilities are impossible to deliver with the existing code. In many cases, complete ground-up rewrites are needed to separate cleanly the business logic from the localisation requirements of countries.

Mostly this requires developers and engineers of platforms to be aware of the internationalisation implications of their implementation choices. What may seem like an engineering best practice could very well be a massive barrier for localising sites to various countries.

Two major internationalisation steps

The common explanation of the work required is covered by two basic proclamations:

Everything must be in UTF-8
Static text strings must be replaced by translation references

Yes, UTF-8 is essential to working in many countries, especially in Asian markets and non-Western European countries. Even today, new systems freshly built are failing on this essential step.

Being able to specify translations for text strings is essential for any site that needs more than one language. But does this even cover 80% of the work required to internationalise a site?

Head-first internationalisation

In my 9 month stint making a “global-ready” Finance codebase barely functional in Europe I encountered a number of internationalisation issues that should have already been dealt with, but hadn’t.

From the outset, it seemed fairly straight-forward. Take the new quotes system code, which runs both the US Finance site and the Canadian Finance site, and launch it in Europe. So the UTF-8 work is already done, and I just needed to extract the text strings and replace them with translation lookups.

It took 3 months to get one page live in one non-English country. We lost a number of features on the way, and the end result is sub-par in a number of respects, including regulatory requirements. Not a resounding success, and a tough learning curve of how internationalisation can hurt when it isn’t considered properly in building global-ready platforms.

Complex translations

Translating static text sentences and phrases is bog standard. However, sometimes we need this static text to contain variable information (for example the text string You are on page 1 of 5). It would be madness to force a translation of every single combination of page number and total number of pages, so we use token replacement to solve this tiny problem.

Be careful when using token replacement using more than one token. Don’t use the order of the tokens as a means of matching from one to the other. For instance, a simple example of sprintf: sprintf("Welcome, %s of %s", $title, $land); what happens when in one cultural locale the $land needs to appear before $title? Rather use name replacements. If your chosen programming language doesn’t allow named token replacements, then it would be wise to change your programming language to one that can.

Sometimes developers and engineers get too clever and create business logic that creates a sentence by adding one word at a time depending on a plethora of business logic. This becomes unlocalisable as grammar constructs vary across languages. I call this the dynamic sentence creation anti-pattern, and show a safer method of accomplishing this.

Other forms of token replacement

Formatting currencies is an interesting variant of token replacement. The location of the currency symbol is either before or after the currency amount. But some currencies have tokens in the middle, too.

Plural forms also catch people out. In some languages zero is singular, in others it is a plural. Even making words plural (plural form) isn’t a case of checking whether it’s just one or more than one. Polish for example has at least five different orders until it settles down to a regular pattern. Some frameworks get this right, symfony for example.

Then comes ordinality indicators (1st, 2nd, 3rd …); the English pattern is fairly regular, but in Czech it requires a bit of thought to algorithmically calculate the right indicator.

Formatting numbers

This is one of those obvious matters. Numbers should be formatted using the preferred cultural approaches, using the appropriate separators. Piece of cake. Run all numbers through a locale-aware formatter, and you’re done.

Simple number formatting for countries
United States	1,200.50
United Kingdom	1,200.50
Germany	1.200,50
France	1 200,50

A translation string for the thousands separator, and one for the decimal point character. Except, if you are using a translation system that strips leading spaces out of text (because of sloppy XML imports), then you run into a nasty problem that the French thousands separator becomes an empty string. You need to be able to trust your translation system.

In Financial reporting there’s lots of data and lots of long numbers, so to squeeze as much information as possible onto the page we need to shortern long numbers, like Apple’s Market Cap of $209,380,000,000,000 is shorterned to $209.38B in the US. Which just about fits into the space available.

But, for France, this is a little more tricky for two reasons:

There isn’t a short form of Billions in France. The closest is ‘Md’ which means thousand million.
The French Finance industry prefer a space between the number and it’s suffix.

So to display Apple’s Market Cap in France we would print it as $209,38 Md. Again, if you’re translation system likes trimming off leading spaces, the French translation string for the shortened form of Billions is then incorrect.

Calculations

Numbers are numbers. Until they are formatted for your chosen locale. Then they are just strings. Adding up strings does one of three things:

Addition is overloaded and the strings are concatenated together. So 1 + 1 equals 11.
The strings, not being clean numbers are evaluated to the first character, and the rest of it is dropped. so 3,500 + 4,500 equals 7.
The strings, not being clean numbers are evaluated to zero, so 3,500 + 4,500 equals 0.

None of these do what you expect. Why would anyone want to do this? One feature on a finance portfolio is to track the value of your own share portfolio. So as you add a new share to your portfolio and add in the number of shares you own; that is multiplied by the current share price and added to the other share evaluations to give you a portfolio value.

Then periodically, when the share prices tick up or down, your portfolio valuation changes accordingly. Unfortunately when you use the DOM as the source of your data, and those numbers are locale-formatted numbers, just doing arithmetic on it results in incorrect values, because the underlying assumptions are broken.

As soon as you use the DOM as a data source, you are at the mercy of localisation formatting of the data. Either you need to unlocalise the data and get back the raw numbers, or you need to store the raw data somewhere else. The third possibility is to drop the feature.

Regulations

If your site is focused on a particular industry, you need to be well aware of the regulations of that industry in the countries you chose to support. If a country specifies that all financial information is displayed to at least four decimal places, then it isn’t a great idea to base your entire system on the assumption that producting data to two decimal places will be sufficient.

Be aware of the implications of regulation in various countries, especially regulation aimed at how information is presented. This is still an important part of an internationalisable platform or site.

Assumptions of timezones

Time calculations are impossible if you do not know what timezone your timestamps are in. They are still fraught with difficulty when the timezones are known, and are different.

Mostly if you know the timezones, then you can easily offset the hours. Except when the times coincide with Daylight Savings Time adjustments. Not all countries move their clocks forward or back on the same day. That’s why we have dedicated libraries on the server to deal with these calculations, and why anything time and date related should be calculated there rather than left to the developer to do in the browser.

Conclusion

The last thing you want to do for an internationalised site is to turn off features that cannot be localised, or knowingly produce something that lands in a legal quagmire of regulation. Unfortunately many of the above problems are directly related to taking a code base built for one country and adapting it as is to an international market. Certain aspects became uninternationalisable because of compromises on the server-side, and in the architecture.

Internationalisation of a code base involves cleanly separating the default locale assumptions from the business logic. This step is paramount in enabling the localisation of the platform to other countries. Skip this step at your peril, because the next step of localising it will be a painful and frustrating experience.

Filed in Translations Tagged with abbreviations, calculations, currencies, date, DOM, formatting, JavaScript, numbers, regulation, text, time