Firefox PHP

Structuring the language file and charset support

Posted by Maurice Makaay 
Structuring the language file and charset support
May 07, 2008 04:37AM
This post is started as a reaction to a whole bunch of German language files that Oliver uploaded to get rid of inconsistencies in the German files that are posted for Phorum and several modules.

Before doing a lot of refactoring to these language files in my modules, I think I'd rather see some other file name scheme for this. And since UTF-8 is the default language at install time for Phorum nowadays, it would be best to have a UTF-8 language file set too and to make a clear distinction between supported character sets in filenames. It's not only for German that this is interesting IMO, but it's something that I've bee thinking about a bit longer already. We just need to get charsets nailed down some more to make the charset support a bit more clear for our users and to not let them fall into the trap where they get weird results because they use a non UTF-8 language file with their fresh 5.2 forum.

What I would propose is to have the charset in the filename, using the following schema:

<english language name>[-<arbitrary specification>].<charset>.php

The specification would be what you require for "du-male" and the like. I would for the German language files also suggest to use "informal-male" style specifications, so people that do not talk German understand what the language file is about.

So based on this, I would like to propose the following naming for your German language file set:
german.iso-8859-1.php
german-informal-male.iso-8859-1.php
german-informal-female.iso-8859-1.php
With these filenames it's fully clear what they are about and what character set they use. Based on such filename, we could even build a module that creates utf-8 packages from the <charset> ones automatically when uploading a zip to the forums that only contains iso-8859-1 language files. In fact, I already do something like that locally for my Dutch language files, where I create the UTF-8 ones based on my ISO-8859-1 files. We could also do some checks on the language file setup and for example only provide a choice for language files that correspond to the active MySQL connection. So if the mysql charset is set to "latin1", we could map that to only supporting language charset iso-8859-1.

What do you guys think about this proposal? I think it would be a step in the right direction for handling charset mayhem.


Maurice Makaay
Phorum Development Team
my blog linkedin profile secret sauce
Re: Structuring the language file and charset support
May 07, 2008 04:51AM
sounds good to me.
just that a mapping table between mysql and php wouldn't be that easy.


Thomas Seifert
Re: Structuring the language file and charset support
May 07, 2008 05:40AM
Hi Maurice,

I agree that it is necessary to clear this situation and your ideas have my approval (even if it implies some work for me renaming a lot of languages files).

Since I just updated my Phorum from 3 to 5.2 I still have to use ISO-language files because of the latin1-charset in MySQL. Next month, when we switch to a new server I'll switch also to UTF-8.

It's exhausting to care language files for two different charsets. The use of named HTML for special characters (ä = &auml;) isn't neither a solution because beginners can run in problems when they start to edit language files (without knowing HTML and ignoring charset problems).

If you want to handle some conversion automatically, I prefer to use UTF-8 as base and generate ISO (or ASCII with named HTML). Each programmer which contributes a language file for Phorum should be able to store UTF-8.

Regards
Oliver


Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Re: Structuring the language file and charset support
May 07, 2008 05:50AM
maintaining two charsets for the language files is really not much work. you can simply use command line tools to convert between different charsets without trouble. you would only maintain one file and convert to the other one(s).


Thomas Seifert
Re: Structuring the language file and charset support
May 07, 2008 05:56AM
"Should be able to ..." and "Can ..." are two totally different things =) But I don't mind what way the conversion would go really. The iconv() function takes care of that nicely.

Fact is that I want to converge towards UTF-8, since that's the charset that poses the least problems on our users when used right. I would therefore even use the logic: convert non-utf8 to utf8 and no dot convert utf8 to non-utf8. Reasons for this logic:
  • One problem with converting from utf8 to some other charset is that somewhere the target charset should be registered. If I get a utf8 encoded German file, I have no easy way of knowing that the file could be correctly converted to iso-8859-1. A (hacky?) way could be to convert the file to a lot of other charsets and back and see what charsets survive the conversion. Based on that, you could determine possible valid charsets. Don't like it though.
  • Languages for which only UTF-8 language files are provided, probably never ran in other charsets anyway. Providing more charset types for those would only complicate business and it would collide with converging the charset support towards utf8.

The use of HTML entities is only partly feasible, since it wouldn't work well for the really freaky languages that only have special characters in them. Additionally, it does not work for the mail messages in the language file that need the correctly encoded characters. HTML entities in the mail messages would show up as the entities in the mail messages and not as the decoded special character.


Maurice Makaay
Phorum Development Team
my blog linkedin profile secret sauce
Re: Structuring the language file and charset support
May 07, 2008 07:09AM
Hi Maurice,

so, which consequences has it for my part of the work?

- For the next version of the German Language Files Package I'll use the proposed filenames.
- Since I change next month to an utf-8-database I have to create also utf-8 language files. Because of this, one of the next versions of the German Language Files Package will be include both charsets.
- In the future I have to maintain both charsets until Phorum offers additional support?

Regards
Oliver


Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Re: Structuring the language file and charset support
May 07, 2008 07:28AM
Using the proposed names for your next language pack is okay. I will start wrapping up the Dutch language files according to these guidelines as well. I think that checks are something for 5.3, where we can add support for fully depending on these kind of filenames. I don't want to do that in a stable tree. Until Phorum (your mean phorum.org uploads, right?) supports automatic conversions, you'll have to maintain the two language file versions yourself. Of course you are no way forced into that. If you only want to take care of a UTF-8 tree after your conversion, then that's fine with us too.


Maurice Makaay
Phorum Development Team
my blog linkedin profile secret sauce
Re: Structuring the language file and charset support
May 07, 2008 08:30AM
Hi Maurice,

Quote
Maurice Makaay
Until Phorum (your mean phorum.org uploads, right?) supports automatic conversions, you'll have to maintain the two language file versions yourself.

You wrote that but I didn't read it attentively. Now I understand.

Why not to handle language file like template files and keep them in cache? If the original file is not in utf-8 you can convert it for caching into utf-8. Could this be a smart way to handle charset problems?

Regards
Oliver


Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Re: Structuring the language file and charset support
May 07, 2008 08:37AM
I don't think thats a good idea because it would expect some given system tools or php extensions which are otherwise not required.
Its far easier to do something specific to phorum.org which could handle the stuff on upload as we have a controlled environment here.


Thomas Seifert
Re: Structuring the language file and charset support
May 11, 2008 05:50AM
Hi Maurice,

Quote
Oliver Riesen
- For the next version of the German Language Files Package I'll use the proposed filenames.
- Since I change next month to an utf-8-database I have to create also utf-8 language files. [...]
- In the future I have to maintain both charsets until Phorum offers additional support?

I changed file names in my German Language Files Package and added also utf-8 files. It's well working in my test forum. After updating documentation I'll publish a new version.

Regards
Oliver


Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Sorry, only registered users may post in this forum.

Click here to login