Difference between revisions of "ReNamer:Rules:Translit"
(Unicode character forms) |
|||
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | [[Image: | + | {{Go|up=ReNamer:Rules|prev=ReNamer:Rules:CleanUp|next=ReNamer:Rules:RegEx}} |
+ | |||
+ | [[Image:TranslitRule.png|center]] | ||
+ | |||
+ | This rule transliterates one alphabet into another. Its main goal is to transliterate Non-English characters from different languages into their English/Latin representation. For example, the German character '''ü''' can be transliterated to '''ue''' (the name '''Müller '''can be also written as '''Mueller'''). | ||
This rule uses ''transliteration maps'' (explained below). | This rule uses ''transliteration maps'' (explained below). | ||
− | + | == Transliteration maps == | |
To transliterate, we create a pair of equivalent characters, like this: '''ü=ue''' | To transliterate, we create a pair of equivalent characters, like this: '''ü=ue''' | ||
Line 9: | Line 13: | ||
(Note that the right side of this equation has ''two'' characters. Any number of characters may be placed on both sides of the equation.) | (Note that the right side of this equation has ''two'' characters. Any number of characters may be placed on both sides of the equation.) | ||
− | We need several such ''equivalent character pairs'' to convert one language into another. | + | We need several such ''equivalent character pairs'' to convert one language into another. The entire set is called a ''transliteration map''. (This is really some kind of a find-and-replace rule.) |
ReNamer has several such built-in maps. Each map is named after a language (the second language in all maps is English). | ReNamer has several such built-in maps. Each map is named after a language (the second language in all maps is English). | ||
Line 23: | Line 27: | ||
Let us see how to do this. | Let us see how to do this. | ||
− | === Using a built-in transliteration map | + | == Automatic case conversion == |
+ | |||
+ | Translit rule does automatic case conversion with an algorithm adopted specifically for transliteration. Translit rule discard the case on the input, i.e. "A=B" is same as "a=b". Case is decided upon case of the input fragment. Multiple character fragments are treated as part of words, with their case decided based on the case of letters around them. | ||
+ | |||
+ | The logic for the case conversion is as follows (ReNamer Beta from 23 Aug 2009): | ||
+ | <pre> | ||
+ | set OUTPUT-PART to lower case | ||
+ | if first letter in INPUT-PART is upper case then | ||
+ | if length of OUTPUT-PART bigger than 1 then | ||
+ | if next letter in original name is upper case then | ||
+ | convert whole OUTPUT-PART to upper case | ||
+ | else | ||
+ | convert only first letter in OUTPUT-PART to upper case | ||
+ | else | ||
+ | convert whole OUTPUT-PART to upper case | ||
+ | </pre> | ||
+ | |||
+ | == Using a built-in transliteration map == | ||
To select any of the built-in maps, press the [[Image:TranslitMapsButton.png]] button. A list of available transliteration maps pops up: | To select any of the built-in maps, press the [[Image:TranslitMapsButton.png]] button. A list of available transliteration maps pops up: | ||
Line 29: | Line 50: | ||
Click on the desired transliteration map. As an example, let us click on the French (to English) transliteration map. | Click on the desired transliteration map. As an example, let us click on the French (to English) transliteration map. | ||
− | The '''Rules''' window changes immediately to show the French characters and their English | + | The '''Rules''' window changes immediately to show the French characters and their English equivalents. |
− | <center>[[Image: | + | <center>[[Image:TranslitRuleExample.png]]</center> |
You can edit any of the entry in this list, add new entries, or delete any of the entries. | You can edit any of the entry in this list, add new entries, or delete any of the entries. | ||
− | Note that such editing does not alter the saved version of the map. | + | Note that such editing does not alter the saved version of the map. The map is edited just for a one-time use. If you select the same Translit map again, ReNamer will load the ''original'' version, not the ''edited'' version. You will see how to [[#Saving_a_transliteration_map|alter a transliteration map]] in a section below. |
− | |||
− | |||
− | + | Next, select the rule's parameters as shown below: | |
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! Parameter | |
− | + | ! Details | |
|- | |- | ||
| forward | | forward | ||
| This is transliteration from-left-to-right direction, as defined in the map. | | This is transliteration from-left-to-right direction, as defined in the map. | ||
|- | |- | ||
− | | | + | | backward |
| This is transliteration from-right-to-left direction, as defined in the map. | | This is transliteration from-right-to-left direction, as defined in the map. | ||
|- | |- | ||
− | | | + | | skip extension |
− | | If this check box is | + | | If this check box is selected, the extension will be ignored by the rule. |
|} | |} | ||
− | Finally, press the [[Image:AddRuleButton.png]] button to add the rule to the stack. | + | Finally, press the [[Image:AddRuleButton.png]] button to add the rule to the stack. |
− | + | == Making your own transliteration map == | |
− | Click in the '''Translit Alphabet '''window, and start entering | + | Click in the '''Translit Alphabet''' window, and start entering your custom alphabet. |
− | + | Transliteration alphabet consists of two equivalence parts (or a couple), which are entered one per line and two parts separated with "=" (equal sign). Alphabet should not contain spaces and should have case discarded ([[ReNamer:Rules:Translit#Automatic_case_conversion|case is adjusted automatically]]). Also, make sure to put couples which contain greater number of characters at the top, so they will get processed first and will not get processed partially by shorter representations. Below is a simple example: | |
− | + | {| align="center" | |
− | + | | | |
− | + | <pre> | |
− | + | щ=sh | |
− | ' | + | ю=yu |
+ | я=ya | ||
+ | ь=' | ||
+ | э=e | ||
+ | </pre> | ||
+ | |} | ||
After entering all such transliterations, press the [[Image:AddRuleButton.png]] button to add the rule to the rule-stack. | After entering all such transliterations, press the [[Image:AddRuleButton.png]] button to add the rule to the rule-stack. | ||
− | Note that this rule is not saved yet (it was just composed for a one-time use). The following topic shows how to save a map. | + | Note that this rule is not saved yet (it was just composed for a one-time use). The following topic shows how to save a map. |
− | + | == Saving a transliteration map == | |
To save a newly composed Transliteration rule, | To save a newly composed Transliteration rule, | ||
Line 84: | Line 108: | ||
#Enter a new name for the map and press '''OK'''. The new map is saved. | #Enter a new name for the map and press '''OK'''. The new map is saved. | ||
− | The process | + | The process of saving an edited Transliteration map is similar. The only difference is that the '''Save Translit '''window (see above) shows the current map's name. You can press '''OK''' to save the changes you've just made, or enter a new name to create a new translit map for the edited version of the current map. |
The new map's name is added to the map list. | The new map's name is added to the map list. | ||
From now on, the new map will also be available as "standard". | From now on, the new map will also be available as "standard". | ||
+ | |||
+ | == Unicode character forms == | ||
+ | |||
+ | Have you encounter a case where some characters don't get converted, despite having a visually identical character defined in the Translit alphabet? | ||
+ | |||
+ | Unicode characters can be defined using exact character codes or using [https://en.wikipedia.org/wiki/Combining_character combining characters]. The displayed characters will look identical, but their binary content is completely different. The conversion process between these forms is covered by the [https://unicode.org/reports/tr15/ Unicode Normalization] standard. | ||
+ | |||
+ | Alphabets in the Translit rule are normally defined using exact character codes, so the combining characters won't get affected. You can put a piece of text through a ''Unicode analyzer'' to see exactly how each character is defined and to identify the use of combining characters. | ||
+ | |||
+ | To handle all possible forms of the same visual character in Translit alphabets, one could define all possible forms in an alphabet or one can simply strip away those combining characters, which can be accomplished by using the "Strip unicode marks" option found in the [[ReNamer:Rules:CleanUp|Clean Up rule]]. | ||
+ | |||
+ | [[Category:ReNamer]] |
Latest revision as of 09:43, 6 January 2023
This rule transliterates one alphabet into another. Its main goal is to transliterate Non-English characters from different languages into their English/Latin representation. For example, the German character ü can be transliterated to ue (the name Müller can be also written as Mueller).
This rule uses transliteration maps (explained below).
Transliteration maps
To transliterate, we create a pair of equivalent characters, like this: ü=ue
(Note that the right side of this equation has two characters. Any number of characters may be placed on both sides of the equation.)
We need several such equivalent character pairs to convert one language into another. The entire set is called a transliteration map. (This is really some kind of a find-and-replace rule.)
ReNamer has several such built-in maps. Each map is named after a language (the second language in all maps is English).
Each map can be used in both directions (e.g. French-to-English or English-to-French.)
When you start up the Translit Rule, its window does not show any maps. You are free to do any of the following:
- Use any of the built-in maps (and use it in forward or reverse direction)
- Create your own map and use it.
- Edit a built-in map first, and then use it.
Let us see how to do this.
Automatic case conversion
Translit rule does automatic case conversion with an algorithm adopted specifically for transliteration. Translit rule discard the case on the input, i.e. "A=B" is same as "a=b". Case is decided upon case of the input fragment. Multiple character fragments are treated as part of words, with their case decided based on the case of letters around them.
The logic for the case conversion is as follows (ReNamer Beta from 23 Aug 2009):
set OUTPUT-PART to lower case if first letter in INPUT-PART is upper case then if length of OUTPUT-PART bigger than 1 then if next letter in original name is upper case then convert whole OUTPUT-PART to upper case else convert only first letter in OUTPUT-PART to upper case else convert whole OUTPUT-PART to upper case
Using a built-in transliteration map
To select any of the built-in maps, press the button. A list of available transliteration maps pops up:
Click on the desired transliteration map. As an example, let us click on the French (to English) transliteration map.
The Rules window changes immediately to show the French characters and their English equivalents.
You can edit any of the entry in this list, add new entries, or delete any of the entries.
Note that such editing does not alter the saved version of the map. The map is edited just for a one-time use. If you select the same Translit map again, ReNamer will load the original version, not the edited version. You will see how to alter a transliteration map in a section below.
Next, select the rule's parameters as shown below:
Parameter | Details |
---|---|
forward | This is transliteration from-left-to-right direction, as defined in the map. |
backward | This is transliteration from-right-to-left direction, as defined in the map. |
skip extension | If this check box is selected, the extension will be ignored by the rule. |
Finally, press the button to add the rule to the stack.
Making your own transliteration map
Click in the Translit Alphabet window, and start entering your custom alphabet.
Transliteration alphabet consists of two equivalence parts (or a couple), which are entered one per line and two parts separated with "=" (equal sign). Alphabet should not contain spaces and should have case discarded (case is adjusted automatically). Also, make sure to put couples which contain greater number of characters at the top, so they will get processed first and will not get processed partially by shorter representations. Below is a simple example:
щ=sh ю=yu я=ya ь=' э=e |
After entering all such transliterations, press the button to add the rule to the rule-stack.
Note that this rule is not saved yet (it was just composed for a one-time use). The following topic shows how to save a map.
Saving a transliteration map
To save a newly composed Transliteration rule,
- Press the button.
- Select the last option (Save Translit...).
- Enter a new name for the map and press OK. The new map is saved.
The process of saving an edited Transliteration map is similar. The only difference is that the Save Translit window (see above) shows the current map's name. You can press OK to save the changes you've just made, or enter a new name to create a new translit map for the edited version of the current map.
The new map's name is added to the map list.
From now on, the new map will also be available as "standard".
Unicode character forms
Have you encounter a case where some characters don't get converted, despite having a visually identical character defined in the Translit alphabet?
Unicode characters can be defined using exact character codes or using combining characters. The displayed characters will look identical, but their binary content is completely different. The conversion process between these forms is covered by the Unicode Normalization standard.
Alphabets in the Translit rule are normally defined using exact character codes, so the combining characters won't get affected. You can put a piece of text through a Unicode analyzer to see exactly how each character is defined and to identify the use of combining characters.
To handle all possible forms of the same visual character in Translit alphabets, one could define all possible forms in an alphabet or one can simply strip away those combining characters, which can be accomplished by using the "Strip unicode marks" option found in the Clean Up rule.