diff --git a/doc/draft/draft-jseng-idn-admin-01.txt b/doc/draft/draft-jseng-idn-admin-01.txt new file mode 100644 index 0000000000..c0998e4b64 --- /dev/null +++ b/doc/draft/draft-jseng-idn-admin-01.txt @@ -0,0 +1,1175 @@ +INTERNET DRAFT Editors: James SENG +draft-jseng-idn-admin-01.txt John KLENSIN +18th Oct 2002 Authors: K. KONISHI +Expires 18th April 2003 K. HUANG, H. QIAN, Y. KO + + Internationalized Domain Names Registration and Administration + Guideline for Chinese, Japanese and Korean + +Status of this Memo + + This document is an Internet-Draft and is in full conformance + with all provisions of Section 10 of RFC2026 except that the + right to produce derivative works is not granted. + + Internet-Drafts are working documents of the Internet + Engineering Task Force (IETF), its areas, and its working + groups. Note that other groups may also distribute working + documents as Internet-Drafts. + + Internet-Drafts are draft documents valid for a maximum of + six months and may be updated, replaced, or obsoleted by other + documents at any time. It is inappropriate to use Internet- + Drafts as reference material or to cite them other than as + "work in progress." + + The list of current Internet-Drafts can be accessed at + http://www.ietf.org/ietf/1id-abstracts.txt + + The list of Internet-Draft Shadow Directories can be accessed at + http://www.ietf.org/shadow.html. + +Abstract + +Achieving internationalized access to domain names raises many complex +issues. These include not only associated with basic protocol design +(i.e., how the names are represented on the network, compared, and +converted to appropriate forms) but also issues and options for +deployment, transition, registration and administration. + +The IETF IDN working group focused on the development of a standards +track specification for access to domain names in a broader range of +scripts than the original ASCII. It became clear during its efforts +that there was great potential for confusion, and difficulties in +deployment and transition, due to characters with similar appearances +or interpretations and that those issues could best be addressed +administratively, rather than through restrictions embedded in the +protocols. + +This document provides guidelines for zone administrators (including +but not limited to registry operators and registrars), and information +for all domain names holders, on the administration of those domain +names which contain characters drawn from Chinese, Japanese and Korean +scripts (CJK). Other language groups are encouraged to develop their +own guidelines as needed, based on these guideline if that is helpful. + +Comments on this document can be sent to the authors at +idn-admin@jdna.jp. + +Table of Contents + +0. Pre-Note for ASCII-version of this document 2 + +1. Introduction 3 + +2. Definitions 5 + +3. Administrative Framework 6 +3.1. Principles underlying these Guidelines 7 +3.2. Registration of IDL 8 +3.2.1. Language character variant table 9 +3.2.2 Formal syntax 10 +3.2.3. Registration Algorithm 10 +3.3. Deletion and Transfer of IDL and IDL Package 12 +3.4. Activation and De-activation of IDN variants 13 +3.5. Adding/Deleting language(s) association 13 +3.6. Versioning of the language character variant tables 13 + +4. Example of Guideline Adoption 14 + +i. Notes 17 + +ii. Acknowledgements 17 + +iii. Authors 18 + +iv. Appendex A 18 + +v. Normative References 19 + +vi. Non-normative References 19 + +vii. Other Issues 19 + + + +0. Pre-Note for ASCII-version of this document + +In order to make meanings clear, especially in examples, Han ideographs +are used in several places in this document. Of course, these +ideographs do not appear in its ASCII form of this document. So, for +the convenience of readers of the ASCII format and some readers not +familiar with recognizing and distinguishing Chinese characters, each +use of a particular character will be associated with both its Unicode +code point and an "asterisk tag" with its corresponding Chinese +Romanization [ISO7098] with the tone mark represented by a number 1 to +4. Those tags have no meaning outside this document; they are intended +simply to provide a quick visual and reading reference to facilitate +the combinations and transformations of characters in the guideline and +table excerpts. Appendix A would provide the Romanization of the +ideographs in Japanese (ISO 3602) and Korean (ISO 11941). + +1. Introduction + +Defining and specifying protocols for Internationalized Domain Names +has been one of the most controversial tasks initiated by the IETF in +recent years. Domain names are the fundamental naming architecture of +the Internet; many Internet protocols and applications rely on the +stability, continuity, and absence of ambiguity of the DNS. + +The introduction of internationalized domain names (IDN) amplifies the +difficulty of putting names into identifiers and the confusion between +scripts and languages. It impacts many internet protocols and +applications and creates more complexity in technical administration +and services. + +While the IETF IDN working group [IDN-WG] focused on the technical +problems of IDN, administrative guidelines are also important in order +to reduce unnecessary user confusion and domain name disputes among +domain name holders. + +The IDN working group has completed working group last call for the +following internet-drafts: + +1. Preparation of Internationalized Strings [STRINGPREP] +2. Internationalizing Host Names In Applications [IDNA] +3. Punycode version 0.3.3 [PUNYCODE] +4. A Stringprep Profile for Internationalized Domain Names [NAMEPREP] + +These drafts specify that the intersystem protocols that make up the +domain name system infrastructure remain unchanged. Instead, they +introduce internationalization (I18N) [Note1] in client software +(particularly via the IDNA protocol) using an ASCII Compatible Encoding +(ACE) known as Punycode. + +The domain name protocols [STD13] also specify that characters are to +be interpreted so that upper and lower case Latin-based characters are +considered equivalent. But with the introduction of Unicode characters +beyond US-ASCII, and the possibility to represent a single character in +multiple ways in ISO10646/Unicode [UNICODE], a normalization process, +known as Nameprep, has been proposed to handle the more complex +problems of character-matching for those additional characters. +Nameprep is also executed by client software as described in IDNA. + +While Nameprep normalizes domain names so that the users have an +improved chance of getting the right domain name from information +provided in other forms, as required for I18N, Nameprep does not handle +any localization (L10N). + +This becomes significant when a domain name holder attempts to use a +Unicode string forming a "name", "word", or "phrase" that may have +certain meaning in a certain language or when used as a domain name. +Such Unicode string may have different variants in the context of the +language or culture. + +Generally, these localized variants in CJK can be classified into four +categories, as described by Halpern et al. [C2C]: [Note2] + +a. Character (or Code) variants + +Character (or Code) variants refer to variants that are generated by +character-by-character (or code-by-code) substitution. + +An example in English would be "A" or "a" (U+0041 or U+0061). +Two examples in Chinese would be U+98DB *fei1* or U+98DE *fei1* +and U+6A5F *ji1* or U+673A *ji1*. + +Note that this does not mean the choice between U+6A5F and U+673A is +always symmetric like the one between "A" and "a" -- it is a choice only +for Chinese but not for Japanese. + +The variants for particular characters may be just to drop them. For +example, points and vowels characters in Hebrew (U+05B0 to U+05C4) and +Arabic (U+064B to U+0652) are optional; the variants for strings +containing them are constructed by simply dropping those points and +vowels. + +Code variants may also occur when different code points are assigned to +what visually or abstractly are the "same" character, possibility due +to compatibility issues, type face differences or script range. For +example, LATIN CAPITAL LETTER A (U+0041) normally has an appearance +identical to GREEK CAPTIAL LETTER A (U+0391). CJK scripts have font +variants for compatibility (either U+4E0D or U+F967 may be used) and +"zVariant" (e.g. U+5154 and U+514E). + +The difficulty lies in defining which characters are the "same" and +which are not. + +b. Orthographic variants + +Orthographic variants refer to variants that are generated by word-by- +word substitution. + +An example in English would be "color" and "colour". + +It is possible for some of these orthographic variants to be generated +by character variants. For example "airplane" in Chinese may be either +U+98DB U+6A5F *fei1 ji1* or U+98DE U+673A *fei1 ji1*. + +Other orthographic variants may not be generated by character variants. +For example, in Chinese, both U+767C *fa1* and U+9AEE *fa4* +are related to U+53D1 *fa1 or fa4* depending on the word. For hair, +U+5934 U+53D1 *tou2 fa4*, the variant should be U+982D U+9AEE +*tou2 fa4* but not U+982D U+767C *tou2 fa1*. + +c. Lexemic variants + +Lexemic variants refer to variants that can be generated when language +is considered, by word-by-word substitution. + +An example in English would be cab, taxi, or taxicab. + +An example in Chinese would be U+8CC7 U+8A0A *zi1 xun4* or +U+4FE1 U+606F *xin4 xi1*. + +Note that there is no relationship between U+8CC7 and U+4FE1 or U+8A0A +and U+606F, i.e., the sequence U+8CC7 U+606F *zi1 xi1* does not +exist in Chinese. + +d. Contextual variants + +Contextual variants refer to variants that are generated by word-by- +word substitutions with context considered. + +In English, the word "plane" has different meanings and could be +replaced by with different equivalent words (synonyms) such as +"airplane" or "plane" (as in a flat-surface or device for smoothing +wood) depending on context. And, of course, "plain", which is +pronounced the same way, and indistinguishable in speech-to-text +contexts such as computer input systems for the visually impaired, is a +different word entirely. + +Similarly, the word U+6587 U+4EF6 *wen2 jian4* could be either +document U+6587 U+4EF6 *wen2 jian4* or data file U+6A94 U+6848 +*dang3 an4* depending on context. + +Although domain names were designed to be identifiers without any +language context, users have not been prevented from using strings in +domain names and interpreting them as "words" or "names". It is likely +that users will do this with IDN as well. Therefore, given the added +complications of using a much broader range of characters, precautions +will be required when deploying IDN to minimize confusion and fraud. + +The intention of these guidelines is to provide advice about the +deployment of IDNs, with language consideration, but focusing only on +the category of character variants to increase the possibility of +successful resolution and reduced confusion while accepting inherent +DNS limitations. + +2. Definitions + +Unless otherwise stated, the definitions of the terms used in this +document are consistent with "Terminology Used in Internationalization +in the IETF" [I18NTERMS]. + +"FQDN" refers to a fully-qualified domain name and "domain name label" +refers to a label of a FQDN. + +RFC3066 [RFC3066] defines a system for coding and representing +languages. + +ISO/IEC 10646 is a universal multiple-octet coded character set that is +a product of ISO/IEC JTC1/SC2/WG2, Work Item JTC1.02.18 (ISO/IEC 10646). +It is a multi-part standard: Part 1, published as ISO/IEC 10646- +1:2000(E) covering the Architecture and Basic Multilingual Plane; Part +2, published as ISO/IEC 10646-2:2001(E) covers the supplementary +(additional) planes. + +The Unicode Consortium publishes "The Unicode Standard -- Version 3.0", +ISBN 0-201-61633-5. In March 2002, Unicode Consortium published Unicode +Standard Annex #28. That annex defines Version 3.2 of The Unicode +Standard, which is fully synchronized with ISO/IEC 10646-1:2000 (with +Amendment 1). + +The term "Unicode character" is used here to refer to characters chosen +from The Unicode Standard Version 3.2 (and hence from ISO/IEC 10646). +In this document, the characters are identified by their positions (or +"code points"). The notation U+12AB, for example, indicates the +character at the position 12AB (hexadecimal) in the Unicode 3.2 table. + +Similarly, "Unicode string" refers to a string of Unicode characters. +The Unicode string is identify by the sequence of the Unicode +characters regardless of the encoding scheme. + +The term "IDN" is often used to refer to many different things: (a) an +abbreviation for "Internationalized Domain Name" (b) a fully-qualified +domain name that contains at least one label that contains characters +not appearing in ASCII (c) a label of a domain name that contains at +least one character beyond ASCII (d) a Unicode string to be processed +by Nameprep (e) an IDN Package (in this document context) (f) a +Nameprep processed string (g) a Nameprep and Punycode processed string +(h) the IETF IDN Working Group (g) ICANN IDN Committee (h) other IDN +activities in other companies/organizations etc. + +Because of the potential confusion, this document shall use the term +"IDN" as an abbreviation for "Internationalized Domain Name" only. + +And also, this document provides a guideline to be applied on a per +zone basis, one label at a time, the term "Internationalized Domain +Name Label" or "IDL" will be used instead. + +In this document, the term "registration" refers to the process by +which a potential domain name holder requests that a label be placed in +the DNS, either as an individual name within a domain or as a sub- +domain delegation from another domain name holder. A successful +registration would then lead to the label or delegation records being +placed in the relevant zone file. The guidelines presented here are +recommended for all zones, at any hierarchy level, in which CJK +characters are to appear, not just domains at the first or second level. + +CJK characters are characters commonly used in Chinese, Japanese or +Korean language including but not limited to ASCII (U+0020 to U+007F, +Han Ideograph (U+3400 to U+9FAF and U+20000 to U+2A6DF), Bopomofo +(U+3100 to U+312F and U+31A0 to U+31BF), Kana (U+3040 to U+30FF), Jamo +(U+1100 to 11FF and U+3130 to U+318F), Hangul (U+AC00 to U+D7AF and +U+3130 to U+318F) and its respective compatibility forms. + +3. Administrative Framework + +Zone administrators are responsible for the administration of the +domain name labels under their control. A zone administrator might be +responsible for a large zone such as a Top Level Domain (TLD), generic +or country code, or a smaller one such as a typical second or third +level domain. A large zone would often be more complex then a smaller +one (sometimes it is just larger). However, normally, actual technical +administrative tasks -- such as addition, deletion, delegation and +transfer of zones between domain name holders -- are similar for all +zones. + +At the same time, different zones may have different policies and +processes. For example, a pay-per-domain policy and registry/registrar +model for .COM may not be applicable to such domains as .SG or .IBM.COM. +The latter, for example, has very restricted policies about who is +permitted to have a domain name label under IBM.COM, the types of +string that are permitted, and different procedures for obtaining those +string. + +This document only provides guidelines for how CJK characters should be +handled within a zone, how language issues should be considered and +incorporated, and how domain name labels containing CJK characters +should be administered (including registration, deletion and transfer +of labels). It does not provide any guidance for handling of non-CKJ +characters or languages in zones. + +Other IDN policies, as the creation of new TLDs, or the cost structure +for registrations, are outside the scope of this document. Such +discussions should be conducted in forums outside the IETF as well. + +Technical implementation issues are not discussed here either. For +example, the decision as to whether various of the guidelines should be +implemented as registry or registrar actions is left to zone +administrators, possibly differing from zone to zone. + +3.1. Principles underlying these Guidelines + +In many places, this document would assumes "First-Come-First-Serve" +(FCFS) as a conflict policy in the event of a dispute although FCFS is +not listed as one of the principles. If other policies dominate +priorities and "rights", one can use these guidelines by replacing uses +of FCFS in this document by appropriate other policy rules specific to +the zone. In other cases, some of these guidelines may not be +applicable although, some alternatives for determining rights to labels +-- such as use of UDRP or mutual exclusion -- might have little impact +on other aspects of these guidelines. + +(a) Each IDL to be registered should be associated with one or more +languages. + +Although some Unicode strings may be pure identifiers made up of an +assortment of characters from many languages and scripts, IDLs are +likely to be names or phrases that have certain meaning in some +language. While a zone administration might or might not require +"meaning" as a registration criterion, the possibility of meaning +provides a useful tool when trying to avoid user confusion. + +Zone administrators should administratively associate one or more +language with each IDL. These associations should either be pre- +determined by the zone administrator and applied to the entire zone or +chosen by the registrants on a per-IDL basis. The latter may be +necessary for some zones, but will make administration more difficult +and will increase the likelihood of conflicts in variant forms. + +A given zone might have multiple languages associated with it, or have +no language specified at all, but doing so may provide additional +opportunities for user confusion, and is therefore not recommended. + +The zone administrator must also verify the validity of the IDL +requested by using information associated with the chosen language and +possibly other rules as appropriate. + +(b) When an IDL is registered, all of the character variants for the +associated language(s) should be reserved for the registrant. Each +language associated with the IDL will lead to different character +variants. + +IDL reservations of the type described here normally do not appear in +the distributed DNS zone file. In other words, these reserved IDLs do +not resolve. Domain name holders could request these reserved IDLs to +be placed in the zone file and made active and resolvable as, e.g., +aliases or synonyms. + +Since different languages may imply different sets of variants, the +IDLs reserved for one IDL may overlap those reserved for another. In +this case, the reserved IDLs should be bound to one registration or the +other, or excluded from both, according to the applicable registration +or dispute resolution policy for the zone. + +(c) For a given base language, the IDL may have one or more recommended +variants that should be suggested to the domain name holder for active +registration as synonyms. + +Some language rules may prefer certain variants over others. To +increase the likelihood of correct and predictable resolution of the +IDL by end-users, the recommended variants should be active. + +(d) The IDL and its reserved variants with the language(s) association +must be atomic. + +The IDL and its reserved variants for the associated language(s) are to +be considered as a single unit -- an "IDL Package". For a given IDL, +that IDL package is defined by these guidelines and created upon +registration. + +The IDL Package is atomic: Transfer and deletion of IDL are performed +on the IDL Package as a whole. IDL, either active or reserved, within +the IDL Package must not be transferred or deleted individually. I.e., +any re-registration, transfers, or other actions that impact the IDL +should also impact the reserved variants. Separate registration or +other actions for the variants are not possible if these guidelines are +to accomplish their purpose. + +Conflict policy of the zone may result in violation of the IDL Package +atomicity. In such case, the conflict policy would take precedence. + +3.2. Registration of IDL + +Conforming to the principles described in 3.1, the registration of an +IDL would require at least two components, i.e., the character variant +tables for the language and the registration algorithm. + +3.2.1. Language character variant table + +Any lines starting with, or portions of lines after, the hash +symbol("#") are treated as comments. Comments have no significance in +the processing of the tables, nor are there any syntax requirements +between the hash symbol and the end of the line. Blank lines in the +tables are ignored completely. + +Every language should have a character variant table provided by a +relevant group (or organization or other body) and based on established +standards. The group that defines a particular character variant table +should document references to the appropriate standards in beginning of +table, tagged with the word "Reference" followed by an integer (the +reference number) followed by the description of the reference. For +example, + +Reference 1 CP936 (commonly known as GBK) +Reference 2 zVariant, zTradVariant, zSimpVariant in Unihan.txt +Reference 3 List of Simplified character Table (Simplified column) +Reference 4 zSimpVariant in Unihan.txt +Reference 5 variant that exists in GB2312, common simplified hanzi + +Each language character variant table must have a version number. This +is tagged with the word "Version" followed by an integer then followed +by the date in the format YYYYMMDD, where YYYY is the 4 digit Year, MM +is the 2 digit Month and DD is the 2 digit Day of the publication date +of the table + +Version 1 20020701 # July 2002 Version 1 + +The table has three fields, separated by semicolons. The fields are: +"valid code point"; "recommended variant(s)"; and "character +variant(s)". + +Only code points listed in the "valid code point" field are allowed to +be registered as part of a IDL associated with that language. + +There can be one or more "recommended variant(s)" (i.e., entries in the +"recommended variant(s)" column). If the "recommended variant(s)" +column is empty, then there is no corresponding variant. + +The "character variant(s)" column contains all variants of the code +point, including but not limited to the code point itself and the +"recommended variant(s)". + +If the variant is composed of a sequence of code points, then sequence +of code points is listed separated by a space in the "recommended +variant(s)" or "character variant(s)". + +If there are multiple variants, each variant must be separated by a +comma in the "recommended variant(s)" or "character variant(s)". + +Any code point listed in the "recommended variant(s)" column must be +allowed, by the rules for the relevant language, to be registered. +However, this is not a requirement for the entries in the "character +variant(s)" column; it is possible that some of those entries may not +be allowed to be registered. + +Every code point in the table should have a corresponding reference +number (associated with the references) specified to justify the entry. +The reference number is placed in parentheses after the code point. If +there is more than one reference, then the numbers are placed within a +single set of parentheses and separated by commas. + +3.2.2. Formal syntax + +This section uses the IETF "ABNF" metalanguage [ABNF] + +LanguageCharacterVariantTable = 1*ReferenceLine VersionLine 1*EntryLine +ReferenceLine = "Reference" SP RefNo SP RefDesciption [ Comment ] CRLF +RefNo = 1*DIGIT +RefDesciption = *[VCHAR] +VersionLine = "Version" SP VersionNo SP VersionDate [ Comment ] CRLF +VersionNo = 1*DIGIT +VersionDate = YYYYMMDD +EntryLine = VariantEntry/Comment CRLF +VariantEntry = ValidCodePoint [ "(" RefList ") ] ;" RecommendedVariant +";" CharacterVariant [ Comment ] +ValidCodePoint = CodePoint +RefList = RefNo 0*( "," RefNo ) +RecommendedVariant = CodePointSet 0*( "," CodePointSet ) +CharacterVariant = CodePointSet 0*( "," CodePointSet ) +CodePointSet = CodePoint 0* ( SP CodePoint ) +CodePoint = 4DIGIT [DIGIT] [DIGIT] +Comment = "#" *VCHAR + +YYYYMMDD is an integer representing a date where YYYY is the 4 digit +year, MM is the 2 digit month and DD is the 2 digit day. + +3.2.3. Registration Algorithm + +(An explanation of these steps follows them) + +1. IN <= IDL to be registered and + {L} <= Set of languages associated with IN +2. {V} <= Set of version numbers of the language character + variant tables derived from {L} +3. NP(IN) <= Nameprep processed IN and + check availability of NP(IN). + If not available, route to conflict policy. +4. For each AL in {L} +4.1. Check validity of NP(IN) in AL. If failed, stop processing. +4.2. PV(IN,AL) <= Set of available Nameprep processed recommended + variants of NP(IN) in AL +4.3. RV(IN,AL) <= Set of available Nameprep processed character + variants of NP(IN) in AL +4.4. End of Loop +5. {PV} <= Set of all PV(IN,AL) with optional processing. +6. {ZV} <= {PV} set-union NP(IN) +7. {RV} <= Set of all RV(IN,AL) set-minus {ZV} +8. Create IDL Package for IN using IN, {L}, {V}, {ZV} and {RV} +9. Put {ZV} into zone file + +Explanation + +Step 1 takes the IDL to be registered and the associated language(s) as +input to the process. + +Step 2 extract the set of version numbers of the associated language(s) +tables. + +Step 3 Nameprep processed the IDL. If the Nameprep processed IDL is +already registered or reserved, then the conflict policy is applied +here. For example, if FCFS is used, the registration process would stop +here. + +Step 4 goes through all languages associated with the proposed IDL, +checks for validity in each language, and generates the recommended +variants and the reserved variants. + +In step 4.1, IDL validation is done by checking that every code point +in the Nameprep processed IDL is a code point allowed by the "valid +code point" column of the character variant table for the language. If +one or more code points are invalid, the registration process must stop +here. + +Step 4.2 generates the list of recommended variants of the IDL by doing +a combination of all possible variants listed in "recommend variant(s)" +column for each code point in the Nameprep processed IDL. Generated +variants must be processed with Nameprep. If any of the recommended +variants of the IDL is registered or reserved, then the conflict policy +will be applied although this does not prevent the IDL from being +registered. For example, if FCFS is used, then the conflicting +variant(s) will be removed from the list. + +Step 4.3 generates the list of reserved variants by doing a combination +of all the possible variants listed in "character variant(s)" column +for each code point in the Nameprep processed IDL. Generated variants +must be Nameprep processed. If any of the variants are registered or +reserved, then the conflict policy will apply here although this does +not prevent the IDL from being registered. For example, if FCFS is +used, then the conflict variants will be removed from the list. + +The "combination" in Step 4.2 and Step 4.3 could achieve by a recursive +function similar to the following pseudo code: + +Function Combination(Str) + F <= first codepoint of Str + SStr <= Substring of Str, without the first code point + NSC <= {} + + If SStr is empty Then + For each V in (Variants of code point F) + NSC = NSC set-union (the string with the code point V) + End of Loop + Else + SubCom = Combination(SStr) + For each V in (Variants of code point F) + For each SC in SubCom + NSC = NSC set-union (the string with the + first code point V followed by the string SC) + End of Loop + End of Loop + Endif + + Return NSC + + +Step 5 generates the list of all recommended variants for all language. +Optionally, the algorithm may reduce the list of recommended variants +by prompting the user to select the recommended variants. + +Step 6 generates the list of variants including the Nameprep processed +IDL which to be activated and Step 7 generates the list of reserved +variants. + +Then an "IDL Package" for IDL is created in Step 8 with the original +IDL, the associated language(s), all the list of activated IDLs and the +list of variants. The version numbers of the language character +variants tables are also stored in the IDL Package. + +Lastly, the activated IDLs are converted using ToASCII [IDNA] with +UseSTD13ASCIIRules on and then put into the zone file. If the IDL is a +subdomain name, it will be delegated. The activated IDLs may be +delegated to a different domain name server so long it is owned by the +same domain name holder. + +3.3. Deletion and Transfer of IDL and IDL Package + +In normal domain administration, every domain name label is independent +of all other domain name labels. Registration, deletion and transfer +of domain name labels is done on a per domain name label basis. +Depending on the zone's administrative policies, aliases (e.g., "CNAME" +entries) may be bound to particular labels with rules about whether one +can be changed without the other. Current policies in gTLDs generally +prohibit registration of such aliases, in part to avoid needing to form +and enforce policies about these change (or binding) rules. + +However, with internationalization, each IDL is bound to a list of +variant IDLs (with the list depending on the associated language), +bound together in an IDL Package. + +Because all variants of the IDL should belong to a single domain name +holder, the IDL Package should be treated as a single entity. +Individual IDL, either active or reserved, within the IDL Package must +not be deleted or transferred independently of the other IDLs. +Specifically, if an IDL is to be deleted or transferred, that action +must be taken only as part of an action that affects the entire IDL +Package. + +If the local conflict policy requires IDL to be transferred and deleted +independently of the IDL Package, the conflict policy would take +precedence. In such event, the conflict policy should be associated +with a transfer or delete procedure taking IDL Package into +consideration. + +When an IDL Package is deleted, all the active and reserved variants +would be available again. IDL Package deletion does not change any +other IDL Packages, including IDL Packages that have variants that +conflict with the variants in the deleted IDL Package. This is to be +consistent with the atomicity and predictability of the IDL Package. + +3.4. Activation and De-activation of IDL variants + +As there are active IDLs and inactive IDLs within an IDL Package, +processes are required to activate or de-activate IDL variants in an +IDL Package. + +The activation algorithm is described below: + +1. IN <= IDL to be activated & PA <= IDL Package +2. NP(IN) <= Nameprep processed IN +3. If NP(IN) not in {RV} then stop +4. {RV} <= {RV} set-minus NP(IN) and {ZV} <= {ZV} set-union NP(IN) +5. Put {ZV} into the zone file + +Similarly, the deactivation algorithm: +1. IN <= IDL to be deactivated & PA <= IDL Package +2. NP(IN) <= Nameprep processed IN +3. If NP(IN) not in {ZV} then stop +4. {RV} <= {RV} set-union NP(IN) and {ZV} <= {ZV} set-minus NP(IN) +5. Put {ZV} into the zone file + +3.5. Adding/Deleting language(s) association + +The list of variants is generated from the IDL and tables for the +associated languages. If the language associations are changed, then +the lists of variants have to be updated. On the other hand, the IDL +Package is atomic and the list of variants must not be changed after +creation. + +Therefore, this document recommends deleting the IDL Package followed +by a registration with the new set of languages rather than attempting +to add or delete language(s) association within the IDL Package. Zone +administrators may find it desirable to devise procedures to prevent +other parties from capturing the labels in the IDL Package during these +operations. + +3.6. Versioning of the language character variant tables + +Language character variants tables are subjected to changes over time +and the changes may or may not be backward compatible. It is possible +that different version of the language character variants tables may +produce a different set of recommended variants and reserved variants. + +New IDL Packages should use the latest version of the language +character variants tables. + +Existing IDL Packages created using previous version of language +character variants tables are not affected when there a new version of +the character variants table is released. + +4. Example of Guideline Adoption + +To provide a meaningful example, some language character variant tables +have to be defined. Assume, then, that the following four language +character variants tables are defined (note that these tables are not a +representation of the actual table and they do not contain sufficient +entries to be used in any actual implementation): + +a) language character variants tables for zh-cn and zh-sg + +Reference 1 CP936 (commonly known as GBK) +Reference 2 zVariant, zTradVariant, zSimpVariant in Unihan.txt +Reference 3 List of Simplified character Table (Simplified column) +Reference 4 zSimpVariant in Unihan.txt +Reference 5 variant that exists in GB2312, common simplified hanzi + +Version 1 20020701 # July 2002 + +56E2(1);56E2(5);5718(2) # sphere, ball, circle; mass, lump +5718(1);56E2(4);56E2(2),56E3(2) # sphere, ball, circle; mass, lump +60F3(1);60F3(5); # think, speculate, plan, consider +654E(1);6559(5);6559(2) # teach +6559(1);6559(5);654E(2) # teach, class +6DF8(1);6E05(5);6E05(2) # clear +6E05(1);6E05(5);6DF8(2) # clear, pure, clean; peaceful +771E(1);771F(5);771F(2) # real, actual, true, genuine +771F(1);771F(5);771E(2) # real, actual, true, genuine +8054(1);8054(3);806F(2) # connect, join; associate, ally +806F(1);8054(3);8054(2),8068(2) # connect, join; associate, ally +96C6(1);96C6(5); # assemble, collect together + + +b) language variants table for zh-tw + +Reference 1 CP950 (commonly known as BIG5) +Reference 2 zVariant, zTradVariant, zSimpVariant in Unihan.txt +Reference 3 List of Simplified Character Table (Traditional column) +Reference 4 zTradVariant in Unihan.txt + +Version 1 20020701 # July 2002 + +5718(1);5718(4);56E2(2),56E3(2) # sphere, ball, circle; mass, lump +60F3(1);60F3(1); # think, speculate, plan, consider +6559(1);6559(1);654E(2) # teach, class +6E05(1);6E05(1);6DF8(2) # clear, pure, clean; peaceful +771F(1);771F(1);771E(2) # real, actual, true, genuine +806F(1);806F(3);8054(2),8068(2) # connect, join; associate, ally +96C6(1);96C6(1); # assemble, collect together + +c) language variants table for ja + +Reference 1 CP932 (commonly known as Shift-JIS) +Reference 2 zVariant in Unihan.txt +Reference 3 variant that exists in JIS X0208, commonly used Kanji + +Version 1 20020701 # July 2002 + +5718(1);5718(3);56E3(2) # sphere, ball, circle; mass, lump +60F3(1);60F3(3); # think, speculate, plan, consider +654E(1);6559(3);6559(2) # teach +6559(1);6559(3);654E(2) # teach, class +6DF8(1);6E05(3);6E05(2) # clear +6E05(1);6E05(3);6DF8(2) # clear, pure, clean; peaceful +771E(1);771E(1);771F(2) # real, actual, true, genuine +771F(1);771F(1);771E(2) # real, actual, true, genuine +806F(1);806F(1);8068(2) # connect, join; associate, ally +96C6(1);96C6(3); # assemble, collect together + +d) language variants table for ko + +Reference 1 CP949 (commonly known as EUC-KR) +Reference 2 zVariant in Unihan.txt + +Version 1 20020701 # July 2002 + +5718(1);56E2(1);56E3(2) # sphere, ball, circle; mass, lump +60F3(1);60F3(1); # think, speculate, plan, consider +654E(1);6559(1);6559(2) # teach +6DF8(1);6E05(1);6E05(2) # clear +771E(1);771F(1);771F(2) # real, actual, true, genuine +806F(1);8054(1);8068(2) # connect, join; associate, ally +96C6(1);96C6(1); # assemble, collect together + +Example 1: IDL = (U+6E05 U+771F U+6559) *qing2 zhen1 jiao4* + {L} = {zh-cn, zh-sg, zh-tw} + +NP(IN) = (U+6E05 U+771F U+6559) +PV(IN,zh-cn) = (U+6E05 U+771F U+6559) +PV(IN,zh-sg) = (U+6E05 U+771F U+6559) +PV(IN,zh-tw) = (U+6E05 U+771F U+6559) +{ZV} = {(U+6E05 U+771F U+6559)} +{RV} = {(U+6E05 U+771E U+6559), + (U+6E05 U+771E U+654E), + (U+6E05 U+771F U+654E), + (U+6DF8 U+771E U+6559), + (U+6DF8 U+771E U+654E), + (U+6DF8 U+771F U+6559), + (U+6DF8 U+771F U+654E)} + +Example 2: IDL = (U+6E05 U+771F U+6559) *qing2 zhen1 jiao4* + {L} = {ja} + +NP(IN) = (U+6E05 U+771F U+6559) +PV(IN,ja) = (U+6E05 U+771F U+6559) +{ZV} = {(U+6E05 U+771F U+6559)} +{RV} = {(U+6E05 U+771E U+6559), + (U+6E05 U+771E U+654E), + (U+6E05 U+771F U+654E), + (U+6DF8 U+771E U+6559), + (U+6DF8 U+771E U+654E), + (U+6DF8 U+771F U+6559), + (U+6DF8 U+771F U+654E)} + +Example 3: IDL = (U+6E05 U+771F U+6559) *qing2 zhen1 jiao4* + {L} = {zh-cn, zh-sg, zh-tw, ja, ko} + +NP(IN) = (U+6E05 U+771F U+6559) *qing2 zhen1 jiao4* +Invalid registration because U+6E05 is invalid in L = ko + +Example 4: IDL = (U+806F U+60F3 U+96C6 U+5718) + *lian2 xiang3 ji2 tuan2* + {L} = {zh-cn, zh-sg, zh-tw} + +NP(IN) = (U+806F U+60F3 U+96C6 U+5718) +PV(IN,zh-cn) = (U+8054 U+60F3 U+96C6 U+56E2) +PV(IN,zh-sg) = (U+8054 U+60F3 U+96C6 U+56E2) +PV(IN,zh-tw) = (U+806F U+60F3 U+96C6 U+5718) +{ZV} = {(U+8054 U+60F3 U+96C6 U+56E2), + (U+806F U+60F3 U+96C6 U+5718)} +{RV} = {(U+8054 U+60F3 U+96C6 U+56E3), + (U+8054 U+60F3 U+96C6 U+5718), + (U+806F U+60F3 U+96C6 U+56E2), + (U+806f U+60F3 U+96C6 U+56E3), + (U+8068 U+60F3 U+96C6 U+56E2), + (U+8068 U+60F3 U+96C6 U+56E3), + (U+8068 U+60F3 U+96C6 U+5718) + +Example 5: IDL = (U+8054 U+60F3 U+96C6 U+56E2) + *lian2 xiang3 ji2 tuan2* + {L} = {zh-cn, zh-sg} + +NP(IN) = (U+8054 U+60F3 U+96C6 U+56E2) +PV(IN,zh-cn) = (U+8054 U+60F3 U+96C6 U+56E2) +PV(IN,zh-sg) = (U+8054 U+60F3 U+96C6 U+56E2) +{ZV} = {(U+8054 U+60F3 U+96C6 U+56E2)} +{RV} = {(U+8054 U+60F3 U+96C6 U+56E3), + (U+8054 U+60F3 U+96C6 U+5718), + (U+806F U+60F3 U+96C6 U+56E2), + (U+806f U+60F3 U+96C6 U+56E3), + (U+806F U+60F3 U+96C6 U+5718), + (U+8068 U+60F3 U+96C6 U+56E2), + (U+8068 U+60F3 U+96C6 U+56E3), + (U+8068 U+60F3 U+96C6 U+5718)} + +Example 6: IDL = (U+8054 U+60F3 U+96C6 U+56E2) + *lian2 xiang3 ji2 tuan2* + {L} = {zh-cn, zh-sg, zh-tw} + +NP(IN) = (U+8054 U+60F3 U+96C6 U+56E2) +Invalid registration because U+8054 is invalid in L = zh-tw + +Example 7: IDL = (U+806F U+60F3 U+96C6 U+5718) + *lian2 xiang3 ji2 tuan2* + {L} = {ja,ko} + +NP(IN) = (U+806F U+60F3 U+96C6 U+5718) +PV(IN,ja) = (U+806F U+60F3 U+96C6 U+5718) +PV(IN,ko) = (U+806F U+60F3 U+96C6 U+5718) +{ZV} = {(U+806F U+60F3 U+96C6 U+5718)} +{RV} = {(U+806F U+60F3 U+96C6 U+56E3), + (U+8068 U+60F3 U+96C6 U+5718), + (U+8068 U+60F3 U+96C6 U+56E3)} + +i. Notes + +1. The terms "i18n" and "l10n", sometimes used in upper-case form (i.e., +"I18N" and "L10N"), have become popular in international standards +usage as abbreviations for "internationalization" and "localization", +respectively. The abbreviations were derived by using the first and +last letters of the words, with the number of characters that appear +between them. I.e., in "internationalization", there are 18 characters +between the initial "i" and the terminal "n". + +2. Every human language is unique and therefore, every linguistic and +localization issue is also unique. It is difficult or impossible to +make comparisons across multiple languages or to classify them into +categories. And any cross-language analogies are, by their very nature, +imperfect at best. + +For example, to classify Traditional Chinese/Simplified Chinese as +upper/lower case makes as much sense as to classify TC/SC as "spelling +variant" like "color" and "colour". Both comparisons are potentially +useful but neither is completely correct. + +3. The variants in CJK are very complex and require many different +layers of solution. This guideline is a one of the solution components, +but not sufficient, by itself, to solve the whole problem. + +ii. Acknowledgements + +The authors gratefully acknowledge the contributions of: + +V.CHEN, N.HSU, H.HOTTA, S.TASHIRO, Y.YONEYA and other Joint Engineering +Team members at the JET meeting in Bangkok. + +Yves Arrouye, an observer at the JET meeting, for his contribution on +the IDL Package. + +Soobok LEE +L.M TSENG +Patrik FALTSTROM +Paul HOFFMAN +Erin CHEN +LEE Xiaodong +Harald ALVESTRAND + +iii. Author(s) + +James SENG +PSB Certification +3 Science Park Drive +#03-12 PSB Annex +Singapore 118233 +Phone: +65 6885-1657 +Email: jseng@pobox.org.sg + +Kazunori KONISHI +JPNIC +Kokusai-Kougyou-Kanda Bldg 6F +2-3-4 Uchi-Kanda, Chiyoda-ku +Tokyo 101-0047 +JAPAN +Phone: +81 49-278-7313 +Email: konishi@jp.apan.net + +Kenny HUANG +TWNIC +3F, 16, Kang Hwa Street, Taipei +Taiwan +TEL : 886-2-2658-6510 +Email: huangk@alum.sinica.edu + +QIAN Hualin +CNNIC +No.6 Branch-box of No.349 Mailbox, Beijing 100080 +Peoples Republic of China +Email: Hlqian@cnnic.net.cn + +KO YangWoo +PeaceNet +Yangchun P.O. Box 81 Seoul 158-600 +Korea +Email: newcat@peacenet.or.kr + +John C KLENSIN +1770 Massachusetts Ave, No. 322 +Cambridge, MA 02140 +USA +Email: Klensin+ietf@jck.com + +iv. Appendix A + +[How to read the Han Ideograph provided in this document. -- Will +complete this section in next revision] + +v. Normative References + +[ABNF] Augmented BNF for Syntax Specifications: ABNF, RFC 2234, D. + Crocker and P. Overell, Eds., November 1997. + +[I18NTERMS] Terminology Used in Internationalization in the IETF, + draft-hoffman-i18n-terms-07.txt, September 2002, + Paul Hoffman, work in progress + +[RFC3066] Tags for the Identification of Languages, RFC3066, + Jan 2001, H. Alvestrand + +[IDNA] Internationalizing Domain Names in Applications, + draft-ietf-idn-idna, Feb 2002, Patrik Faltstrom, + Paul Hoffman, Adam M. Costella, work in progress + +[PUNYCODE] Punycode: An encoding of Unicode for use with IDNA, + draft-ietf-idn-punycode, Feb 2002, Adam M. Costello, + work in progress + +[STRINGPREP]Preparation of Internationalized Strings, + draft-hoffman-stringprep, Feb 2002, Paul Hoffman, + Marc Blanchet, work in progress + +[NAMEPREP] Nameprep: A Stringprep Profile for Internationalized + Domain Names, work in progress, draft-ietf-idn-nameprep, + Feb 2002, Paul Hoffman, Marc Blanchet, work in progress + +[UNIHAN] Unicode Han Database, Unicode Consortium + ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt + +[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version + 3.0", ISBN 0-201-61633-5. Unicode Standard Annex #28, + (http://www.unicode.org/unicode/reports/tr28/) defines + Version 3.2 of The Unicode Standard. + +[ISO7098] ISO 7098;1991 Information and documentation -- Romanization + of Chinese, ISO/TC46/SC2. + +vi. Non-normative References + +[IDN-WG] IETF Internationalized Domain Names Working Group, + idn@ops.ietf.org, James Seng, Marc Blanchet. + http://www.i-d-n.net/ + +[STD13] Paul Mockapetris, "Domain names - concepts and facilities" + (RFC 1034) and "Domain names - implementation and + specification" (RFC 1035), STD 13, November 1987. + +[C2C] Pitfalls and Complexities of Chinese to Chinese Conversion, + http://www.cjk.org/cjk/c2c/c2c.pdf, Jack Halpern, Jouni + Kerman + +vii. Other Issues + +It is possible that many variants generated may have no meaning in the +associated language or languages. The intention is not to generate +meaningful "words" but to generate similar variants to be reserved. + +The language Character Variants tables are critical to the success of +the guideline. A badly designed table may either generate too many +meaningless variants or may not generate enough meaningful variants. +The principles to be used to generate the tables are not within the +scope of this document, nor are the tables themselves. + +This document recommends against registration of IDL in a particular +language until the language character variants table for that language +is available. + +Outstanding Issues + +(1) Erin suggested (if I (JcK) correctly understood her) that, if +multiple languages are associated with a given name, the recommended +variant list for a given code point be treated as the intersection of +the variant lists for each of the languages, not the union. As I +understand the current algorithm, it effectively takes the union. +Taking the intersection has the technical advantage that it would +significantly reduce the number of variant strings that must be +reserved. It also has the policy advantage of discouraging people +from registering with multiple languages if they don't need to - +otherwise, we will have everyone trying to register in all of the +possibly-relevant languages, which would make this effort a good deal +less effective than it might be. + +Taking the intersection is also consistent with a rule that appears to +exist now. As shown in Example 3, if an attempt is made to register a +name and associate it with multiple languages, it must be valid in all +of those languages or the registration attempt will fail. So we +intersect the validity criteria on a language basis, and should +probably intersect the variants. + +But that is an algorithm change, since we have to extract the variant +lists for each code point for each language, take the intersection, +and then process against that, rather than against each language in +turn. + +[JS - I disagree in taking the intersection of the set. No doubt by +doing intersection we will reduce the abuse of specifying multiple +language to increase the set of reserved variants, our goal is +precisely to reserve as much variants as possible for the domain name +holder, not vice versa. + +Suppose we have a string ABC with variants ABD ACD ABF in Chinese, ABE +ACD in Japanese and CBD ACD in Korean. + +Assuming a registrant register ABC in CJK, right now he will get the +reserved set of {ABC, ACD, ABF, ABE, CBD}. + +On the other hand, if we do intersection, this set will be reduced to +{ACD}, leaving other variants like ABF, ABE and CBD open for potential +conflict. And the only way he can protect this confusion is to +individually register ABF, ABE and CBD manually individually, +something we trying to prevent.] + +[Further explanation by Erin: + +I'm sorry maybe my previous suggestion is not clear enough. + +I mean if multiple languages are associated with a given nanme, the +range of valid code point sould be the intersection of all the +associated languages. + +But, if multiple languages are associated with a given nanme, the +recommended variants should be take the union and put into zone file. +The same, the character variant code also sould be take the union for +each of the languages.] + +(2) A note went by indicating that the plan was to drop the Han +characters from the IETF-submission version of this document. We can +post I-Ds in PDF and publish RFCs in PDF and/or Postscript, as long as +we provide ASCII. I find having the Han characters very useful, and +trust that those of you who can read them find them even more so. So +I would suggest that we hand off the pair of an ASCII document (with +the Han characters removed) and a PDF document (that looks like the +Word text we have been looking it) to the I-D editor. I've got full +Acrobat here and can presumably produce the thing if needed. + +(3) We still need to sort out the issue of whether reserving a +variant that may (in a current or future table) conflict with another +character, with the possibility of activating it is an invitation to +cybersquatting and other abuses. That isn't clear, let me try an +illustration: suppose we have a character X, with variants A, B, and C, +and a character Y, with variants D and C. Now, if Y is registered +first, then its package includes {Y*, D, C}, using the symbol "*" to +denote an active name. When X is registered, its package consists of +{X, A, B}. X's owner can't reserve or activate C, since it was +reserved to Y. But much of the reason for doing all of this work was +the concern that C can be confused with either Y or X. So doesn't +this create an opportunity for Y to threaten, or extort money from, X +by threatening to activate C? + +[JS -- The conflict of X & Y over C in this case could be resolved by +existing conflict policy. The revised guideline now makes it possible +to modify the IDL Package in the event of dispute] + +That problem gets worse, I think, if Erin's suggestion in (1) is not +adopted. And I continue to believe that the only solution that will +work is to prevent anyone from activating C. Or, more generally, at +any given time, there will be a set of language variant tables that +will be considered valid by the administrator of a particular zone. +The zone administrator would take the union of all of those tables, +using the 'valid code point' as the key as usual, and then permanently +reserve any character that appeared most than once in a variant column. +Small matter of programming. + +(4) In page 9, on the paragraph starting with "The character +variant(s) column contains ..." + +Page: 21 +This seems to be saying that the code points listed in the third +column will always be a proper superset of the union of the first and +second columns. If that is correct, it violates a fundamental +principle that I was taught about good programming and systems design +-- minimization of duplication of information, since such duplicates +are error-prone. And, if I have not interpreted the intent correctly, +the text needs to be fixed. Somehow. + +[JS -- correct, it is duplicated. The duplication is bad from +system design view but it makes it 'complete' and easy to explain.] diff --git a/doc/draft/draft-klensin-idn-tld-00.txt b/doc/draft/draft-klensin-idn-tld-00.txt new file mode 100644 index 0000000000..cbe2e15b31 --- /dev/null +++ b/doc/draft/draft-klensin-idn-tld-00.txt @@ -0,0 +1,437 @@ +INTERNET-DRAFT John C Klensin +21 October 2002 +Expires April 2003 + + National and Local Characters in DNS TLD Names + draft-klensin-idn-tld-00.txt + +Status of this Memo + + This document is an Internet-Draft and is in full conformance + with all provisions of Section 10 of RFC2026 except that the + right to produce derivative works is not granted. + + Internet-Drafts are working documents of the Internet Engineering + Task Force (IETF), its areas, and its working groups. Note that + other groups may also distribute working documents as + Internet-Drafts. + + Internet-Drafts are draft documents valid for a maximum of six + months and may be updated, replaced, or obsoleted by other + documents at any time. It is inappropriate to use Internet- + Drafts as reference material or to cite them other than as + "work in progress." + + The list of current Internet-Drafts can be accessed at + http://www.ietf.org/ietf/1id-abstracts.txt + + The list of Internet-Draft Shadow Directories can be accessed at + http://www.ietf.org/shadow.html. + Internet-Drafts are working documents of the Internet Engineering + Task Force (IETF), its areas, and its working groups. Note that + other groups may also distribute working documents as + Internet-Drafts. + + +Abstract + + In the context of work on internationalizing the Domain Name System + (DNS), there have been extensive discussions about "multilingual" or + "internationalized" top level domain names (TLDs), especially for + countries whose predominant language is not written in a Roman-based + script. This document reviews some of the motivations for such + domains and the constraints that the DNS imposes. It then suggests + an alternative, local translation, that may solve a superset of the + problem while avoiding protocol changes, serious deployment delays, + and other difficulties. + +Table of Contents + +1 Introduction +1.1 Background on the "Multilingual Name" Problem +1.2 Domain Name System Constraints +1.3 Internationalization and Localization +2. Client-side solutions +2.1 IDNA and the client +2.2 Local translation tables for TLD names +3. Advantages and disadvantages of local translation +3.1 Every TLD in the local language and character set + +3.2 Unification of country code domains +3.3 User understanding of local and global reference +3.4 Limits on TLD propagation +4. Security Considerations +5. References +6. Acknowledgements +7. Author's Address + + +1. Introduction + +1.1 Background on the "Multilingual Name" Problem + +People who share a language prefer to communicate in it, using whatever +characters are normally used to write that language, rather than in some +"foreign" one. There have been standards for using mutually-agreed +characters and languages in electronic mail message bodies and selected +headers since the introduction of MIME in 1992 [MIME] and the Web has +permitted multilingual text since its inception. However, since domain +names are exposed to users in email addresses and URLs, and +corresponding arrangements in other protocols, demand rapidly arose to +permit domain names in applications that used characters other than +those of the very restrictive, ASCII-subset, "LDH" conventions [LDH]. +The effort to do this rapidly became known as "multilingual domain +names", although that is a misnomer, since the DNS deals only with +characters and identifier strings, and not, except by accident, what +people usually think of as "names". And there has been little actual +interest in what would actually be a "multilingual name" -- i.e., a name +that contains components from more than one language -- but only the use +of strings conforming to different languages in the context of the DNS. + +1.1.1 Approaches to the requirement + +If the requirement is seen, not as "modifying the DNS", but as +"providing users with access to the DNS from a variety of languages and +character sets", three sets of proposals have emerged in the IETF and +elsewhere. They are: + + (1) Perform processing in client software that recodes a user-visible + string into an ASCII-compatible form that can safely be passed + through the DNS protocols and stored in the DNS. This is the + approach used, for example, in the IETF's "IDNA" protocol [IDNA]. + + (2) Modify the DNS to be more hospitable to non-ASCII names and + strings. There have been a variety of proposals to do this in almost + as many ways, some of which have been implemented on a proprietary + basis by various vendors. None of them have gained acceptance in the + IETF community, primarily because they would take a long time to + deploy and would leave many problems unsolved. + + (3) Move the problem out of the DNS entirely, relying instead on a + "directory" or "presentation" layer to handle internationalization. + The rationale for this approach is discussed in [DNSROLE]. + +This document proposes a fourth approach, applicable to the top level +domains (TLDs) only (see section 1.2.1 for a discussion of the special +issues that make TLDs problematic). That approach could be used as an +alternate or supplement to the strategies summarized above. + + +1.1.2 Writing the name of one's country in its own characters + +An early focus of the "multilingual domain name" efforts was expressed +in statements such as "users in my country, in which ASCII is rarely +used, should be able to write an entire domain name in their own +character set. In particular, since all top-level domain names, at +present, follow the LDH rules, the somewhat more restrictive naming +rules discussed in [STD3], and the coding conventions specified in +[RFC1591], all fully-qualified DNS names were effectively required to +contain at least one ASCII label (the TLD name), and that was considered +inappropriate. One should, instead, be able to write the name of the +ccTLD for China in Chinese, the name of the ccTLD for Saudi Arabia in +Arabic, and so on. + +1.1.3 Countries with multiple languages and countries with multiple + names + +>From a user interface standpoint, writing ccTLD names in local +characters is a problem. As discussed in section 1.2.2, the DNS itself +does not easily permit a domain to be referred to by more than one name +(or spelling or translation of a name). Countries with more than one +official language would require that the country name be represented in +each of those languages. And, just as it is important that a user in +China be able to represent the name of the Chinese ccTLD in Chinese +characters, she should be able to access a Chinese-language site in +France using Chinese characters, requiring that she be able to write the +name of the French ccTLD in those characters rather than in a form based +on a Roman character set. + + +1.2 Domain Name System Constraints + +1.2.1 Administrative hierarchy + +The domain name system is designed around the idea of an "administrative +hierarchy", with the entity responsible for a given node of the +hierarchy responsible for policies applicable to its subhierarchies (Cf. +[STD13]). The model works quite well for the domain and subdomains of a +particular enterprise, where the hierarchy can be organized to match the +organizational structure, there are established ways to set policies and +there is, at least presumably, shared assumptions about overall goals +and objectives among all registrants in the domain. It is more +problematic when a domain is shared by unrelated entities which lack +common policy assumptions. It is difficult to reach agreement on rules +that should apply to all of them. That situation always prevails for +the labels registered in a TLD (second-level names) except in those TLDs +for which the second level is structural (e.g., the .CO, .AC, .GOV +conventions in many ccTLD) in which case, it exists for the labels +within that structural level. + +TLDs may, but need not, have consistent registration policies for those +second (or third) level names. Countries (or ccTLD administrators) have +often adopted rules about what entities may register in those ccTLDs, +and the forms the names may take. RFC 1591 outlined registration norms +for most of the gTLDs, even though those norms have been largely ignored +in recent years. And some recent "sponsored" domains are based on quite +specific rules about appropriate registrations. Homogeneous + +registration rules for the root are, by contrast, impossible: almost by +definition, the subdomains registered in it are diverse and no single +policy applying to all root subdomains (TLDs) is feasible. + +1.2.2 Aliases + +In an environment different from the DNS, a rational way to permit +assigning local-language names to a country code (or other) domain would +be to set up an alias for the name, or to use some sort of "see instead" +reference. But the DNS does not have quite the right facilities for +either. Instead, it supports a "CNAME" record, whose label can refer +onto to a particular label and not to a subtree. For example, if A.B.C +is a fully-qualified name, then a CNAME reference from X to A would make +X.B.C appear to have the same values as A.B.C. However, a CNAME +reference from Y to C would not make A.B.Y referenceable (or even +defined) at all. A second record type, DNAME [RFC2672], can provide an +alias for a portion of the tree. But it is problematic technically, and +its use is strongly discouraged except for transition uses from one +domain to another. + + +1.3 Internationalization and Localization + +It has often been observed that while many people talk about +"internationalization" (a term we typically use for making something +globally accessible while incorporating a broad-range "universal" +character set and conventions appropriate to all languages), they often really +mean, and want, "localization" (making things work well in a particular +locality, or well, but potentially differently, for a broad range of +localities). Anything that actually involves the DNS must be global and +hence internationalized since the DNS cannot meaningfully support +different responses based, e.g., on the location of the user making a +query. While the DNS cannot support localization internally, many of +the features discussed earlier in this section are much more easily +thought about in local terms --whether localized to a geographical area, +users of a language, or using some other criteria -- than in global ones. + +2. Client-side solutions + +Traditionally, the IETF has avoided becoming involved in standardization +for actions that take place strictly on individual hosts on the network, +assuming that it should confine itself to behavior that is observable +"on the wire", i.e., in protocols between network hosts. Exceptions to +this general principle have been made when different clients were +required to utilize data or interpret values in compatible ways to +preserve interoperability: the standards for email and web body formats, +and IDNA itself, are examples of these exceptions. Regardless of what +is required to be standardized, it is almost never required, and often +unwise, that a user interface, by default, present on-the-wire formats +to the user. However, in most cases when the presentation format and +the wire format differ, the client program must take precautions that +the wire format can be reconstructed from user input, or to keep the +wire format, while hidden, bound to the presentation mechanism so that +it can be reconstructed. And, while it is rarely a goal in itself, it +is often necessary that the user be at least vaguely aware that the wire +("real") format is different from the presentation one and that the wire +format be available for debugging. + + +2.1 IDNA and the client + +As mentioned above, IDNA itself is entirely a client-side protocol. It +works by providing labels to the DNS in a special format (so-called +"ACE"). When labels in that format are encountered, they are +transformed, by the client, back into internationalized (normally +Unicode) characters. In the context of this document, the important +obvservation about IDNA is that any application program that supports it +is already doing considerable transformation work on the client; it is +not simply presenting the on-the-wire formats to the user. + + +2.2 Local translation tables for TLD names + +We suggest that, in addition to maintaining the code and tables required +to support IDNA, clients may want to maintain a table that contains a +list of TLDs and that maps between them and locally-desirable names. +For ccTLDs, these might be the names (or locally-standard abbreviations) +by which the relevant countries are known locally (whether in ASCII +characters or others). With some care on the part of the application +designer (e.g., to ensure that local forms do not conflict with the +actual TLD names), a particular TLD name input from the user could be +either in local or standard form without special tagging or problems. +When DNS names are received by these client programs, the TLD labels +would be mapped to local form before IDNA is applied to the rest of the +name; when names are received from users, local TLD names would be +mapped to the global ones before being passed into IDNA or for other DNS +processing. + + +3. Advantages and disadvantages of local translation + +3.1 Every TLD in the local language and character set + +The notion of a top-level domain whose name matches, e.g., the name that +is used for a country in that country or the name of a language in that +language as, as mentioned above, immediately appealing. But most of the +reasons for it argue equally strongly for other TLDs being accessible +from that language. A user in Korea who can access the national ccTLD +in the Korean language and character set has every reason to expect that +both generic top level domains and and domains associated with other +countries would be similarly accessible, especially if the second-level +domains bear Korean names. A user in Spain or Portugal, or in Latin +America, would presumably have similar expectations, but would expect to +use Spanish names, not Korean ones. + +That level of local optimization is not realistic --some would argue not +possible-- with the DNS since it would ultimately require that every top +level domain be replicated for each of the world's languages. That +replication process would involve not just the top level domain itself: +in principle, all of its subtrees would need to be completely replicated +as well (or at least all of the subtrees for which a the language +associated with the a given replicant was relevant). The administrative +hierarchy characteristics of the DNS (see section 1.2.1) turn the +replication process into an administrative nightmare: every +administrator of a second-level domain in the world would be forced to +maintain dozens, probably hundreds, of similar zone files for the the +replicates of the domain. Even if only the zones relevant to a + +particular country or language were replicated, the administrative and +tracking problems to bind these to the appropriate top-level domain and +keep all of the replicas synchronized would be extremely difficulty at +best. And many administrators of third- and fourth-level domains, and +beyond, would be faced with similar problems. + +By contrast, dealing with the names of TLDs as a localization problem, +using local translation, is fairly simple. Each function represented by +a TLD -- a country, generic registrations, or purpose-specific +registrations -- could be represented in the local language and +character set as needed. And, for countries with many languages, or +users living, working, or visiting countries where their language was +not dominant, "local" could be defined in terms of the needs or wishes +of each particular user. + +3.2 Unification of country code domains + +It follows from some of the comments above that, while there appears to +be some immediate appeal from having (at least) two domains for each +country, one using the ISO 3166-1 code and another one using a name +based on the national name in the national language, such a situation +would create considerable problems for registrants in the multiple +domains. For registrants maintaining enterprise or organizational +subdomains, ease of administration in a single family of zone files will +usually make a registration in a single top-level domain preferable to +replicated sets of them, at least as long as their functional +requirements (such a local-language access) are met by the unified +structure. + +Of course, having replicated domains might be popular with registries +and registrars, since replication would almost inevitably increase the +total number of domains to be registered. + +3.3 User understanding of local and global references + +While the IDNA tables (actually Nameprep and Stringprep -- see the IDNA +specification) must be identical globally for IDNA to work reliably, the +tables for mapping between local names and TLD names could be locally +determined, and differ from one locale to another, as long as users +understood that international interchange of names required using the +standard forms. That understanding could be assisted by software. It +is likely that, at least for the foreseeable future, DNS names being +passed among users in different countries, or using different languages, +will be forced to be in ACE form to guarantee compatibility in any +event, so the marginal knowledge or effort needed to put TLD names into +standard form and transmit them that way would be very small. + +3.4 Limits on TLD propagation + +The concept of using local translation does have one side-effect, which +some portions of the Internet community might consider undesirable. +The size and complexity of translation tables, and maintaining those +tables, will be, to a considerable extent, a function of the number of +top-level domains, the frequency with which new domains are added, and +the number of domains that are added at a time. A country or other +locale that wished to maintain a few set of translations (i.e., so that +every TLD had a representation in the local language) would presumably +find setting up a table for the current collection of a few hundred + +domains to be a task that would take some days. If the number of TLDs +was relatively stable, with a relatively small number being added at +infrequent intervals, the updates could probably be dealt with on an ad +hoc basis. But, if large numbers of domains were added frequently, or +if the total number of TLDs became very large, maintaining the table +might require dedicated staff. Worse, updating the tables stored on +client machines might require update and synchronization protocols and +all of the related complexities. + +4. Security Considerations + +IDNA provides a client-based mechanism for presenting Unicode names in +applications while passing only ASCII-based names on the wire. As such, +it constitutes a major step along the path of introducing a client-based +presentation layer into the Internet. Client-based presentation layer +transformations introduce risks from variant tables that can change +meaning without external protection. For example, if a mapping table +normally maps A onto C and that table is altered by an attacker so that +A maps onto D instead, much mischief can be committed. On the other +hand, these are not the usual sort of network attacks: they may be +thought of as falling into the "users can always cause harm to +themselves" category. The local translation model outlined here does +not significantly increase the risks over those associated with IDNA, +but may provide some new avenues for exploiting them. + +Both this approach and IDNA rely on having updated programs present +information to the user in a very different form than the one in which +it is transmitted on the wire. Unless the internal (wire) form is +always used in interchange, there are possibilities for ambiguity and +confusion about references. + +5. References + +[DNSROLE] Klensin, J.C., "Role of the Domain Name System", work in + progress (draft-klensin-dns-role-04.txt). + +[IDNA] Faltstorm, F., P. Hoffman, A. M. Costello, "Internationalizing + Domain Names in Applications (IDNA)", work in progress + (draft-ietf-idn-idna-13.txt) + +[LDH] STD13 and comments + +[MIME] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail + Extensions): Mechanisms for Specifying and Describing the Format of + Internet Message Bodies", RFC 1341, June 1992. Updated and replaced + by Freed, N. and N. Borenstein, "Multipurpose Internet Mail + Extensions (MIME) Part One: Format of Internet Message Bodies", + RFC2045, November 1996. Also, Moore, K., "Representation of + Non-ASCII Text in Internet Message Headers", RFC 1342, June 1992. + Updated and replaced by Moore, K., "MIME (Multipurpose Internet + Mail Extensions) Part Three: Message Header Extensions for + Non-ASCII Text", RFC 2047, November 1996. + +[RFC1591] Postel, J., "Domain Name System Structure and Delegation", + RFC1591, March 1994. + +[RFC2672] Crawford, M., "Non-Terminal DNS Name Redirection", RFC 2672, + August 1999. + + +[STD3] Braden, R., Ed., "Requirements for Internet Hosts - Application and + Support", RFC1123, October 1989. + +[STD13] Mockapetris, P.V., 1034 "Domain names - concepts and + facilities", RFC 1034, and "Domain names - implementation and + specification", RFC 1035, November 1987. + +6. Acknowledgements + +This document was inspired by a number of conversations in ICANN, IETF, +MINC, and private contexts about the future evolution and +internationalization of top level domains. Discussions within, and +about, the ICANN IDN Committee have been particularly helpful, although +several of the members of that committee may be surprised about where +those discussions led. + +7. Author's Address + +John C Klensin +1770 Massachusetts Ave, #322 +Cambridge, MA 02140 USA +email: john+ietf@jck.com +