Structuring the language file and charset support
Posted by Maurice Makaay
May 07, 2008 04:37AM |
Admin Registered: 19 years ago Posts: 8,532 |
This post is started as a reaction to a whole bunch of German language files that Oliver uploaded to get rid of inconsistencies in the German files that are posted for Phorum and several modules.
Before doing a lot of refactoring to these language files in my modules, I think I'd rather see some other file name scheme for this. And since UTF-8 is the default language at install time for Phorum nowadays, it would be best to have a UTF-8 language file set too and to make a clear distinction between supported character sets in filenames. It's not only for German that this is interesting IMO, but it's something that I've bee thinking about a bit longer already. We just need to get charsets nailed down some more to make the charset support a bit more clear for our users and to not let them fall into the trap where they get weird results because they use a non UTF-8 language file with their fresh 5.2 forum.
What I would propose is to have the charset in the filename, using the following schema:
The specification would be what you require for "du-male" and the like. I would for the German language files also suggest to use "informal-male" style specifications, so people that do not talk German understand what the language file is about.
So based on this, I would like to propose the following naming for your German language file set:
What do you guys think about this proposal? I think it would be a step in the right direction for handling charset mayhem.
Maurice Makaay
Phorum Development Team
my blog
linkedin profile
secret sauce
Before doing a lot of refactoring to these language files in my modules, I think I'd rather see some other file name scheme for this. And since UTF-8 is the default language at install time for Phorum nowadays, it would be best to have a UTF-8 language file set too and to make a clear distinction between supported character sets in filenames. It's not only for German that this is interesting IMO, but it's something that I've bee thinking about a bit longer already. We just need to get charsets nailed down some more to make the charset support a bit more clear for our users and to not let them fall into the trap where they get weird results because they use a non UTF-8 language file with their fresh 5.2 forum.
What I would propose is to have the charset in the filename, using the following schema:
<english language name>[-<arbitrary specification>].<charset>.php
The specification would be what you require for "du-male" and the like. I would for the German language files also suggest to use "informal-male" style specifications, so people that do not talk German understand what the language file is about.
So based on this, I would like to propose the following naming for your German language file set:
german.iso-8859-1.php german-informal-male.iso-8859-1.php german-informal-female.iso-8859-1.phpWith these filenames it's fully clear what they are about and what character set they use. Based on such filename, we could even build a module that creates utf-8 packages from the <charset> ones automatically when uploading a zip to the forums that only contains iso-8859-1 language files. In fact, I already do something like that locally for my Dutch language files, where I create the UTF-8 ones based on my ISO-8859-1 files. We could also do some checks on the language file setup and for example only provide a choice for language files that correspond to the active MySQL connection. So if the mysql charset is set to "latin1", we could map that to only supporting language charset iso-8859-1.
What do you guys think about this proposal? I think it would be a step in the right direction for handling charset mayhem.
Maurice Makaay
Phorum Development Team



Re: Structuring the language file and charset support May 07, 2008 04:51AM |
Admin Registered: 21 years ago Posts: 9,240 |
May 07, 2008 05:40AM |
Admin Registered: 17 years ago Posts: 744 |
Hi Maurice,
I agree that it is necessary to clear this situation and your ideas have my approval (even if it implies some work for me renaming a lot of languages files).
Since I just updated my Phorum from 3 to 5.2 I still have to use ISO-language files because of the latin1-charset in MySQL. Next month, when we switch to a new server I'll switch also to UTF-8.
It's exhausting to care language files for two different charsets. The use of named HTML for special characters (ä = ä) isn't neither a solution because beginners can run in problems when they start to edit language files (without knowing HTML and ignoring charset problems).
If you want to handle some conversion automatically, I prefer to use UTF-8 as base and generate ISO (or ASCII with named HTML). Each programmer which contributes a language file for Phorum should be able to store UTF-8.
Regards
Oliver
Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
I agree that it is necessary to clear this situation and your ideas have my approval (even if it implies some work for me renaming a lot of languages files).
Since I just updated my Phorum from 3 to 5.2 I still have to use ISO-language files because of the latin1-charset in MySQL. Next month, when we switch to a new server I'll switch also to UTF-8.
It's exhausting to care language files for two different charsets. The use of named HTML for special characters (ä = ä) isn't neither a solution because beginners can run in problems when they start to edit language files (without knowing HTML and ignoring charset problems).
If you want to handle some conversion automatically, I prefer to use UTF-8 as base and generate ISO (or ASCII with named HTML). Each programmer which contributes a language file for Phorum should be able to store UTF-8.
Regards
Oliver
Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Re: Structuring the language file and charset support May 07, 2008 05:50AM |
Admin Registered: 21 years ago Posts: 9,240 |
May 07, 2008 05:56AM |
Admin Registered: 19 years ago Posts: 8,532 |
"Should be able to ..." and "Can ..." are two totally different things =) But I don't mind what way the conversion would go really. The iconv() function takes care of that nicely.
Fact is that I want to converge towards UTF-8, since that's the charset that poses the least problems on our users when used right. I would therefore even use the logic: convert non-utf8 to utf8 and no dot convert utf8 to non-utf8. Reasons for this logic:
The use of HTML entities is only partly feasible, since it wouldn't work well for the really freaky languages that only have special characters in them. Additionally, it does not work for the mail messages in the language file that need the correctly encoded characters. HTML entities in the mail messages would show up as the entities in the mail messages and not as the decoded special character.
Maurice Makaay
Phorum Development Team
my blog
linkedin profile
secret sauce
Fact is that I want to converge towards UTF-8, since that's the charset that poses the least problems on our users when used right. I would therefore even use the logic: convert non-utf8 to utf8 and no dot convert utf8 to non-utf8. Reasons for this logic:
- One problem with converting from utf8 to some other charset is that somewhere the target charset should be registered. If I get a utf8 encoded German file, I have no easy way of knowing that the file could be correctly converted to iso-8859-1. A (hacky?) way could be to convert the file to a lot of other charsets and back and see what charsets survive the conversion. Based on that, you could determine possible valid charsets. Don't like it though.
- Languages for which only UTF-8 language files are provided, probably never ran in other charsets anyway. Providing more charset types for those would only complicate business and it would collide with converging the charset support towards utf8.
The use of HTML entities is only partly feasible, since it wouldn't work well for the really freaky languages that only have special characters in them. Additionally, it does not work for the mail messages in the language file that need the correctly encoded characters. HTML entities in the mail messages would show up as the entities in the mail messages and not as the decoded special character.
Maurice Makaay
Phorum Development Team



May 07, 2008 07:09AM |
Admin Registered: 17 years ago Posts: 744 |
Hi Maurice,
so, which consequences has it for my part of the work?
- For the next version of the German Language Files Package I'll use the proposed filenames.
- Since I change next month to an utf-8-database I have to create also utf-8 language files. Because of this, one of the next versions of the German Language Files Package will be include both charsets.
- In the future I have to maintain both charsets until Phorum offers additional support?
Regards
Oliver
Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
so, which consequences has it for my part of the work?
- For the next version of the German Language Files Package I'll use the proposed filenames.
- Since I change next month to an utf-8-database I have to create also utf-8 language files. Because of this, one of the next versions of the German Language Files Package will be include both charsets.
- In the future I have to maintain both charsets until Phorum offers additional support?
Regards
Oliver
Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
May 07, 2008 07:28AM |
Admin Registered: 19 years ago Posts: 8,532 |
Using the proposed names for your next language pack is okay. I will start wrapping up the Dutch language files according to these guidelines as well. I think that checks are something for 5.3, where we can add support for fully depending on these kind of filenames. I don't want to do that in a stable tree. Until Phorum (your mean phorum.org uploads, right?) supports automatic conversions, you'll have to maintain the two language file versions yourself. Of course you are no way forced into that. If you only want to take care of a UTF-8 tree after your conversion, then that's fine with us too.
Maurice Makaay
Phorum Development Team
my blog
linkedin profile
secret sauce
Maurice Makaay
Phorum Development Team



May 07, 2008 08:30AM |
Admin Registered: 17 years ago Posts: 744 |
Hi Maurice,
You wrote that but I didn't read it attentively. Now I understand.
Why not to handle language file like template files and keep them in cache? If the original file is not in utf-8 you can convert it for caching into utf-8. Could this be a smart way to handle charset problems?
Regards
Oliver
Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Quote
Maurice Makaay
Until Phorum (your mean phorum.org uploads, right?) supports automatic conversions, you'll have to maintain the two language file versions yourself.
You wrote that but I didn't read it attentively. Now I understand.
Why not to handle language file like template files and keep them in cache? If the original file is not in utf-8 you can convert it for caching into utf-8. Could this be a smart way to handle charset problems?
Regards
Oliver
Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Re: Structuring the language file and charset support May 07, 2008 08:37AM |
Admin Registered: 21 years ago Posts: 9,240 |
May 11, 2008 05:50AM |
Admin Registered: 17 years ago Posts: 744 |
Hi Maurice,
I changed file names in my German Language Files Package and added also utf-8 files. It's well working in my test forum. After updating documentation I'll publish a new version.
Regards
Oliver
Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Quote
Oliver Riesen
- For the next version of the German Language Files Package I'll use the proposed filenames.
- Since I change next month to an utf-8-database I have to create also utf-8 language files. [...]
- In the future I have to maintain both charsets until Phorum offers additional support?
I changed file names in my German Language Files Package and added also utf-8 files. It's well working in my test forum. After updating documentation I'll publish a new version.
Regards
Oliver
Using Phorum since 7/2000: forum.langzeittest.de (actual version 5.2.23)
Modules "Made in Germany" for version 5.2: Author_as_Sender, CarCost, Close_Topic, Conceal_Message_Timestamp,
Format_Email, Index_Structure, Mailing_List, Pervasive_Forum, Spritmonitor, Terms_of_Service and German_Language_Files_Package.
Sorry, only registered users may post in this forum.