PikeDevel/I18n

Pike provides some useful tools to ease internationalization (i18n). Unfortunately, they're not terribly well documented. This page attempts to remedy this shortcoming.

Pike supports wide strings natively, internally in its string datatype, as well as within pike language code. The internal representation of wide strings is a variation of UTF-16. Most all Pike functions that deal with strings are wide-string aware. When using regular expressions, note that Regexp.SimpleRegexp is not wide string aware. Regexp.PCRE is wide string aware, if the PCRE library was compiled with UTF-8 support. Note that this implies a conversion from Pike's string representation to UTF-8, so performance may be negatively impacted.

A few builtin functions are available for converting to and from common wide string encodings:

  String.width();
  utf8_to_string();
  string_to_utf8();
  utf16_to_string();
  string_to_utf16();

Additionally, the Local.Charset module provides encoders and decoders for a vast array (nearly 400 at last count) of alternate character set encodings.

Localization using Gettext

Pike provides 2 options for handling i18n: Gettext and a native Pike translation system. Gettext is located within the Locale module and requires that the iconv and gettext libraries be present at compile time. Use of the Gettext module may be desirable for use in command-line and graphical applications where a single locale is used for the duration of program execution. Additionally, because the popular gettext library is used, a number of ancillary tools are available to make the translation easier.

Gettext has a number of drawbacks, including the dependence on one or more external libraries. Gettext is also not suitable for use scenarios where the locale to be used may change, such as in web applications where each request may require the use of a different locale. In situations where Gettext is not useable, we can employ the use of the i18n tools available within the Locale module.

Localization using Locale

The i18n support in the Locale module is very flexible; it enables strings to be localized dynamically, making the localization of web applications possible.

This module provides the ability to split all of the translated strings into one or more "projects". A project is a set of translated strings provided together in a set of files, one for each language. A simple application may use one project, while a more complex or modular application may use many projects. Normally, the minimum project size is one pike class file, though it is possible to have more than one project per class.

Pike provides some simple tools for assisting in string extraction, such as "pike -x extract_locale", which we'll explore in more detail later.

The heart of the whole approach is the use of Locale.translate(). This function is called each time a localized string is to be used, as in the following example:

  return Locale.translate("MyProject", "en", 21, "Some string");

As you can see, Locale.translate takes for arguments: the project name, the language, the string identifier and a fallback value to use in the event the string isn't available in the desired language.

Obviously, this isn't a very useful approach, as we've hardcoded the language parameter. Normally we'd replace the second argument with a call to some function that will tell us what the desired language is. As an example, the Fins framework provides a function called "get_lang()" located in the Request object.

  return Locale.translate("MyProject", id->get_lang(), 21, "Some string");

That's a little more useful, but it's rather cumbersome to write all of this every time we want to use a localized string. Let's write a macro to simplify matters:

  #define LOCALE(X, Y) Locale.translate("MyProject", id->get_lang(), X, Y)  return LOCALE(21, "Some string");

That's much better. Of course, we'd need to make sure that "id" was available everywhere we use this macro, and we'd also need to think about the project name aspect of things in situations where we have a multi-project localization situation.

That's all fine and well, but how do we tell the Locale module what projects and translations are available? Well, the first task is to extract the strings that need to be translated from your code.

NOTE: that the following description considers only one possible configuration; there is a certain amount of flexibility in how you arrange the extracted language files in a directory hierarchy. That said, this is employed by Roxen and other applications successfully, so it makes sense to employ this convention unless there are reasons not to.

In each file you wish to extract strings, you need to insert a snippet that tells the string extractor how to find strings, and where to put them. This string looks like:

  // <locale-token project="my_project">LOCALE</locale-token>

Normally, you'd place this near the top of your file. The string extractor will look for this snippet and configure itself accordingly. As you might guess, the value of the project attribute is the name of the localization project you wish to place the translated strings in. The value of the locale-token element is the name of the "mock function" we defined using a macro in the previous step.

 // <locale-token project="my_project">LOCALE</locale-token>
#define LOCALE(X, Y) Locale.translate("MyProject", id->get_lang(), X, Y)int main()
{
  werror(LOCALE(0, "Hello, World!");
  werror(LOCALE(0, "I'm sorry, Dave, I'm afraid I can't do that.");
  return 0;
}

In order to perform extraction, the extract_locale tool needs to know what files to search, where to put the extracted strings, and so on. The easiest way to do this is to create a configuration file for your localization project. This configuration file looks like this:

<?xml version="1.0" encoding="iso-8859-1"?><project name="MyProject">
  <nocopy />
  <baselang>eng</baselang>
  <xmlpath>translations/%L/MyProject.xml</xmlpath>
  <file>/PikeWiki/index.pikefileA.pike</file>
  <file>/PikeWiki/index.pikefileB.pike</file>
</project>

This configuration file tells the string extractor the name of the project, the base language to use for generating extracted strings, where to store the extracted strings, and what files to search. Note the "%L" in the xml path. That's used to identify the language of a given translation. Therefore, in our example, the English translation file for MyProject would be in translations/eng/MyProject.xml, Swedish in translations/swe/MyProject.xml and so forth. Make sure that the path to the base language directory exists before running the extractor.

Let's imagine that we're using the sample configuration above, and that translations/eng exists, as well as the two files specified in the configuration. We should get something similar to this:

$ pike -x extract_locale --config=MyProject.xml
Reading config for project "MyProject" in MyProject.xmlReading fileA.pike, parsing… (2 localizations)
Reading fileB.pike, parsing… (2 localizations)
Writing [eng] MyProject.xml… (4 ids)

When you run this command, you should see some output explaining that the locale extractor has found a number of localized strings, and it's generating a translation file. It's also assigning identifiers to each string and writing a new file that includes these identifiers. Our example configuration uses the nocopy element to tell the extractor to overwrite the existing file pike files, but it makes sense to do a dry run to make sure nothing gets fouled up first. Note that we used id values of zero (0) initially. Those will get reassigned by the string extractor so that there are no duplicate identifiers in a given project.

When you extract files initially, only the base language file will exist. To create additional translations, simply make copies and name them similarly (where lang is the 3 character ISO character code for your language.). It makes sense to store the translations for a given project in a directory created specifically for this purpose.

Information for each string that should be translated is kept within a <str id="foo">...</str> container. The id is unique for a certain string.

The original string (like in the source) is kept inside the -container inside a <o>...</o>. Do not change this - it is used to detected if the original string has changed since the translation was done.

Additionally, there can be <new/>-markers (useful for searching for untranslated strings), and <changed from="...">-tags (if the original string has changed so a translator can verify that the translation still is correct).

Once you've generated the translation files, the Locale module needs to be told where to find the translations for your project. We can use Locale.register_project() and Locale.set_default_project_path() for this:

  Locale.register_project("MyProject", "translations/%L/MyProject.xml");

  Locale.set_default_project_path("/path/to/translations/%L/%P.xml");

Note the use of "%L" and "%P": the language being requested will be substituted for "%L" and the project name will be subsituted for "%P" when looking for the translation file.

One of these two methods should be called before your code encounters a string that needs to be translated. The following is an example of our "fileA.pike". Note that the paths are relative to the current working directory; in real life you'd probably want to consider whether that's appropriate for your application.

 // <locale-token project="MyProject">LOCALE</locale-token>// note that in this case, we've hard-coded the language.
// that's probably not appropriate for actual real-world use.
#define LOCALE(X, Y) Locale.translate("MyProject", "swe", X, Y)
int main()
{
  Locale.register_project("MyProject", "translations/%L/MyProject.xml");
  werror(LOCALE(1, "Hello, world!"));
  werror(LOCALE(2, "I'm sorry, Dave, I'm afraid I can't do that."));
  return 0;
}

Strings generated by an application often include dynamically inserted values. Normally, an application would use string concatenation or sprintf() to perform this task, as in the following example:

  sprintf("My name is %s and I enjoy %s wine.", "John Doe", "red");

Now, if we were to translate the "My name is %s" into other languages, the order of the replacements may vary from language to language. Depending on the language, the user's name ("John Doe") may not be the first string replaced. In order to accomodate this, we probably need a way to identify a specific replacement within our string by means other than it's position.

A few options exist for handling this, depending on how you prefer to approach the problem. Pike's sprintf() method supports non-ordered substitutions, as in the following example:

  sprintf("%[2]s %s %[2]s %[0]s/%s %s %[2]s",
        "and", "or", "Hello!");

Hello! and Hello! and/or Hello! Hello!

Additionally, the Fins web framework provides a module called Tools.String.named_sprintf() which can be very helpful:

  Tools.String.named_sprintf("My name is %{name} and I enjoy %{color} wine.",      (["name": "John Doe", "color": "red"]) );

As you can see, named_sprintf() uses a name to identify the appropriate replacement value to use from a given dataset. By using Tools.String.named_sprintf(), we're no longer dependent on positional replacement parameters.

To take our earlier localization macro to the next level (for situations where there are parameters, we might define a macro like this:

  #define LOCALE(X,Y,Z)   Tools.String.named_sprintf(Locale.translate("MyProject", id->get_lang(), X, Y), Z)  return LOCALE(21, "%{name} prefers %{color} wine.", (["name": "John Doe", "color": "red"]));

Overview

Localization using Gettext

Localization using Locale