mirror of
https://gitlab.isc.org/isc-projects/bind9
synced 2025-08-28 21:17:54 +00:00
860 lines
30 KiB
Plaintext
860 lines
30 KiB
Plaintext
Internet Draft Mark Davis
|
||
draft-ietf-idn-lace-01.txt IBM
|
||
January 5, 2001 Paul Hoffman
|
||
Expires July 5, 2001 IMC & VPNC
|
||
|
||
LACE: Length-based ASCII Compatible Encoding for IDN
|
||
|
||
Status of this memo
|
||
|
||
This document is an Internet-Draft and is in full conformance with all
|
||
provisions of Section 10 of RFC2026.
|
||
|
||
Internet-Drafts are working documents of the Internet Engineering Task
|
||
Force (IETF), its areas, and its working groups. Note that other
|
||
groups may also distribute working documents as Internet-Drafts.
|
||
|
||
Internet-Drafts are draft documents valid for a maximum of six months
|
||
and may be updated, replaced, or obsoleted by other documents at any
|
||
time. It is inappropriate to use Internet-Drafts as reference
|
||
material or to cite them other than as "work in progress."
|
||
|
||
The list of current Internet-Drafts can be accessed at
|
||
http://www.ietf.org/ietf/1id-abstracts.txt
|
||
|
||
The list of Internet-Draft Shadow Directories can be accessed at
|
||
http://www.ietf.org/shadow.html.
|
||
|
||
|
||
Abstract
|
||
|
||
This document describes a transformation method for representing
|
||
non-ASCII characters in host name parts in a fashion that is completely
|
||
compatible with the current DNS. It is a potential candidate for an
|
||
ASCII-Compatible Encoding (ACE) for internationalized host names, as
|
||
described in the comparison document from the IETF IDN Working Group.
|
||
This method is based on the observation that many internationalized host
|
||
name parts will have a few substrings from a small number of rows of the
|
||
ISO 10646 repertoire. Run-length encoding for these types of
|
||
host names will be fairly compact, and is fairly easy to describe.
|
||
|
||
|
||
1. Introduction
|
||
|
||
There is a strong world-wide desire to use characters other than plain
|
||
ASCII in host names. Host names have become the equivalent of business
|
||
or product names for many services on the Internet, so there is a need
|
||
to make them usable by people whose native scripts are not representable
|
||
by ASCII. The requirements for internationalizing host names are
|
||
described in the IDN WG's requirements document, [IDNReq].
|
||
|
||
The IDN WG's comparison document [IDNComp] describes three potential
|
||
main architectures for IDN: arch-1 (just send binary), arch-2 (send
|
||
binary or ACE), and arch-3 (just send ACE). LACE is an ACE, called
|
||
Length-based ACE or LACE, that can be used with protocols that match arch-2
|
||
or arch-3. LACE specifies an ACE format as specified in ace-1 in
|
||
[IDNComp]. Further, it specifies an identifying mechanism for ace-2 in
|
||
[IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the
|
||
beginning of the name part).
|
||
|
||
In formal terms, LACE describes a character encoding scheme of the
|
||
ISO/IEC 10646 [ISO10646] coded character set (whose assignment of
|
||
characters is synchronized with Unicode [Unicode3]) and the rules for
|
||
using that scheme in the DNS. As such, it could also be called a
|
||
"charset" as defined in [IDNReq]. It can also be viewed as a specialized
|
||
UTF (transformation format), designed to work within the restrictions of
|
||
the DNS.
|
||
|
||
The LACE protocol has the following features:
|
||
|
||
- There is exactly one way to convert internationalized host parts to
|
||
and from LACE parts. Host name part uniqueness is preserved.
|
||
|
||
- Host parts that have no international characters are not changed.
|
||
|
||
- Names using LACE can include more internationalized characters than
|
||
with other ACE protocols that have been suggested to date. LACE-encoded
|
||
names are variable length, depending on the number of transitions
|
||
between rows in the ISO 10646 repertoire that appear in the name part.
|
||
Name parts that cannot be compressed using run-length encoding can have
|
||
up to 17 characters, and names that can be compressed can have up to 35
|
||
characters. Further, a name that has just a few row transitions
|
||
typically can have over 30 characters.
|
||
|
||
It is important to note that the following sections contain many
|
||
normative statements with "MUST" and "MUST NOT". Any implementation that
|
||
does not follow these statements exactly is likely to cause damage to
|
||
the Internet by creating non-unique representations of host names.
|
||
|
||
1.1 Terminology
|
||
|
||
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
|
||
"MAY" in this document are to be interpreted as described in RFC 2119
|
||
[RFC2119].
|
||
|
||
Hexadecimal values are shown preceded with an "0x". For example,
|
||
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
|
||
shown preceded with an "0b". For example, a nine-bit value might be
|
||
shown as "0b101101111".
|
||
|
||
Examples in this document use the notation for code points and names
|
||
from the Unicode Standard [Unicode3] and ISO 10646. For example, the
|
||
letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER
|
||
A".
|
||
|
||
LACE converts strings with internationalized characters into
|
||
strings of US-ASCII that are acceptable as host name parts in current
|
||
DNS host naming usage. The former are called "pre-converted" and the
|
||
latter are called "post-converted".
|
||
|
||
1.2 IDN summary
|
||
|
||
Using the terminology in [IDNComp], LACE specifies an ACE format as
|
||
specified in ace-1. Further, it specifies an identifying mechanism for
|
||
ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning
|
||
of the name part).
|
||
|
||
LACE has the following length characteristics.
|
||
|
||
- LACE-encoded names are variable length, depending on the number of
|
||
transitions between rows that appear in the name part.
|
||
|
||
- Name parts that cannot be compressed using run-length encoding can
|
||
have up to 17 characters.
|
||
|
||
- Names that can be compressed can have up to 35 characters.
|
||
|
||
-A name that has just a few row transitions typically can have over 30
|
||
characters.
|
||
|
||
|
||
2. Host Part Transformation
|
||
|
||
According to [STD13], host parts must be case-insensitive, start and
|
||
end with a letter or digit, and contain only letters, digits, and the
|
||
hyphen character ("-"). This, of course, excludes any internationalized
|
||
characters, as well as many other characters in the ASCII character
|
||
repertoire. Further, domain name parts must be 63 octets or shorter in
|
||
length.
|
||
|
||
2.1 Name tagging
|
||
|
||
All post-converted name parts that contain internationalized characters
|
||
begin with the string "lq--". (Of course, because host name parts are
|
||
case-insensitive, this might also be represented as "Lq--" or "lQ--" or
|
||
"LQ--".) The string "lq--" was chosen because it is extremely unlikely
|
||
to exist in host parts before this specification was produced. As a
|
||
historical note, in late October 2000, none of the second-level host
|
||
name parts in any of the .com, .edu, .net, and .org top-level domains
|
||
began with "lq--"; there are many tens of thousands of other strings of
|
||
three characters followed by a hyphen that have this property and could
|
||
be used instead. The string "lq--" will change to other strings with the
|
||
same properties in future versions of this draft.
|
||
|
||
Note that a zone administrator might still choose to use "lq--" at the
|
||
beginning of a host name part even if that part does not contain
|
||
internationalized characters. Zone administrators SHOULD NOT create host
|
||
part names that begin with "lq--" unless those names are post-converted
|
||
names. Creating host part names that begin with "lq--" but that are not
|
||
post-converted names may cause two distinct problems. Some display
|
||
systems, after converting the post-converted name part back to an
|
||
internationalized name part, might display the name parts in a
|
||
possibly-confusing fashion to users. More seriously, some resolvers,
|
||
after converting the post-converted name part back to an
|
||
internationalized name part, might reject the host name if it contains
|
||
illegal characters.
|
||
|
||
2.2 Converting an internationalized name to an ACE name part
|
||
|
||
To convert a string of internationalized characters into an ACE name
|
||
part, the following steps MUST be preformed in the exact order of the
|
||
subsections given here.
|
||
|
||
If a name part consists exclusively of characters that conform to the
|
||
host name requirements in [STD13], the name MUST NOT be converted to
|
||
LACE. That is, a name part that can be represented without LACE MUST NOT
|
||
be encoded using LACE. This absolute requirement prevents there from
|
||
being two different encodings for a single DNS host name.
|
||
|
||
If any checking for prohibited name parts (such as ones that are
|
||
prohibited characters, case-folding, or canonicalization) is to be done,
|
||
it MUST be done before doing the conversion to an ACE name part.
|
||
|
||
Characters outside the first plane of characters (those with codepoints
|
||
above U+FFFF) MUST be represented using surrogates, as described in
|
||
RFC 2781 [RFC2781].
|
||
|
||
The input name string consists of characters from the ISO 10646
|
||
character set in big-endian UTF-16 encoding. This is the pre-converted
|
||
string.
|
||
|
||
2.2.1 Check the input string for disallowed names
|
||
|
||
If the input string consists only of characters that conform to the host
|
||
name requirements in [STD13], the conversion MUST stop with an error.
|
||
|
||
2.2.2 Compress the pre-converted string
|
||
|
||
The entire pre-converted string MUST be compressed using the compression
|
||
algorithm specified in section 2.4. The result of this step is the
|
||
compressed string.
|
||
|
||
2.2.3 Check the length of the compressed string
|
||
|
||
The compressed string MUST be 36 octets or shorter. If the compressed
|
||
string is 37 octets or longer, the conversion MUST stop with an error.
|
||
|
||
2.2.4 Encode the compressed string with Base32
|
||
|
||
The compressed string MUST be converted using the Base32 encoding
|
||
described in section 2.5. The result of this step is the encoded string.
|
||
|
||
2.2.5 Prepend "lq--" to the encoded string and finish
|
||
|
||
Prepend the characters "lq--" to the encoded string. This is the host
|
||
name part that can be used in DNS resolution.
|
||
|
||
2.3 Converting a host name part to an internationalized name
|
||
|
||
The input string for conversion is a valid host name part. Note that if
|
||
any checking for prohibited name parts (such as prohibited characters,
|
||
case-folding, or canonicalization is to be done, it MUST be done after
|
||
doing the conversion from an ACE name part.
|
||
|
||
If a decoded name part consists exclusively of characters that conform
|
||
to the host name requirements in [STD13], the conversion from LACE MUST
|
||
fail. Because a name part that can be represented without LACE MUST NOT
|
||
be encoded using LACE, the decoding process MUST check for name parts
|
||
that consists exclusively of characters that conform to the host name
|
||
requirements in [STD13] and, if such a name part is found, MUST
|
||
beconsidered an error (and possibly a security violation).
|
||
|
||
2.3.1 Strip the "lq--"
|
||
|
||
The input string MUST begin with the characters "lq--". If it does not,
|
||
the conversion MUST stop with an error. Otherwise, remove the characters
|
||
"lq--" from the input string. The result of this step is the stripped
|
||
string.
|
||
|
||
2.3.2 Decode the stripped string with Base32
|
||
|
||
The entire stripped string MUST be checked to see if it is valid Base32
|
||
output. The entire stripped string MUST be changed to all lower-case
|
||
letters and digits. If any resulting characters are not in Table 1, the
|
||
conversion MUST stop with an error; the input string is the
|
||
post-converted string. Otherwise, the entire resulting string MUST be
|
||
converted to a binary format using the Base32 decoding described in
|
||
section 2.5. The result of this step is the decoded string.
|
||
|
||
2.3.3 Decompress the decoded string
|
||
|
||
The entire decoded string MUST be converted to ISO 10646 characters
|
||
using the decompression algorithm described in section 2.4. The result
|
||
of this is the internationalized string.
|
||
|
||
2.3.4 Check the internationalized string for disallowed names
|
||
|
||
If the internationalized string consists only of characters that conform
|
||
to the host name requirements in [STD13], the conversion MUST stop with
|
||
an error.
|
||
|
||
2.4 Compression algorithm
|
||
|
||
The basic method for compression is to reduce a substring that consists
|
||
of characters all from a single row of the ISO 10646 repertoire to a
|
||
count octet followed by the row header followed by the lower octets of
|
||
the characters. If this ends up being longer than the input, the string
|
||
is not compressed, but instead has a unique one-octet header attached.
|
||
|
||
Although the uncompressed mode limits the number of characters in a LACE
|
||
name part to 17, this is still generally enough for all names in almost
|
||
scripts. Also, this limit is close to the limits set by other encoding
|
||
proposals.
|
||
|
||
Note that the compression and decompression rules MUST be followed
|
||
exactly. This requirement prevents a single host name part from having
|
||
two encodings. Thus, for any input to the algorithm, there is only one
|
||
possible output. An implementation cannot chose to use one-octet mode or
|
||
two-octet mode using anything other than the logic given in this
|
||
section.
|
||
|
||
2.4.1 Compressing a string
|
||
|
||
The input string is in the UTF-16 encoding (big-endian UTF-16 with no
|
||
byte order mark).
|
||
|
||
Design note: No checking is done on the input to this algorithm. It is
|
||
assumed that all checking for valid ISO/IEC 10646 characters has already
|
||
been done by a previous step in the conversion process.
|
||
|
||
1) If the length (measured in octets) of the input is not even, or is
|
||
less than 2, stop with an error.
|
||
|
||
2) Set the input pointer, called IP, to the first octet of the input
|
||
string.
|
||
|
||
3) Set the variable called HIGH to the octet at IP.
|
||
|
||
4) Determine the number of contiguous pairs at or after IP that have
|
||
HIGH as the first octet; call this COUNT.
|
||
|
||
5) Put into an output buffer the single octet for COUNT followed by the
|
||
single octet for HIGH, followed by all those low octets. Move IP to the
|
||
end of those pairs; that is, set IP to IP+(2*COUNT).
|
||
|
||
6) If IP is not at the end of the input string, go to step 3.
|
||
|
||
7) If the length of the output buffer is less than or equal to the
|
||
length of the input buffer (in octets, not in characters), emit the
|
||
output buffer. Otherwise, output the octet 0xFF followed by the input
|
||
buffer. Note that there can only be one possible representation for a
|
||
name part, so that outputting the wrong name part is a serious security
|
||
error. Decompression schemes MUST accept only the valid form and MUST
|
||
NOT accept invalid forms.
|
||
|
||
2.4.2 Decompressing a string
|
||
|
||
1. Set the input pointer, called IP, to the first octet of the input
|
||
string. If there is no first octet, stop with an error.
|
||
|
||
2. If the octet at IP is 0xFF, set IP to IP+1, copy the rest of the
|
||
input buffer to the output buffer, and go to step 9.
|
||
|
||
3. Get the octet at IP, call it COUNT. If COUNT equals zero or is
|
||
greater than 36, stop with an error. Set IP to IP+1. If IP is now at the
|
||
end of the input string, stop with an error.
|
||
|
||
4. Get the octet at IP, call it HIGH. Set IP to IP+1.
|
||
|
||
5. If IP is now at the end of the input string, stop with an error. Get
|
||
the octet at IP, call it LOW. Set IP to IP+1.
|
||
|
||
6. Output HIGH, then LOW, to the output buffer.
|
||
|
||
7. Decrement COUNT. If COUNT is greater than 0, go to step 5.
|
||
|
||
8. If IP is not at the end of the input buffer, go to step 3.
|
||
|
||
9. If the length of the output buffer is odd, stop with an error.
|
||
Compress the output buffer into a separate comparison buffer following
|
||
the steps for compression above. If the contents of the comparison
|
||
buffer does not equal the input to the compression step, stop with an
|
||
error. Otherwise, send out the output buffer and stop.
|
||
|
||
2.4.3 Compression examples
|
||
|
||
The five input characters <U+30E6 U+30CB U+30B3 U+30FC U+30C9> are
|
||
represented in big-endian UTF-16 as the ten octets <30 E6 30 CB 30 B3 30
|
||
FC 30 C9>. All the code units are in the same row (03). The output
|
||
buffer has seven octets <05 30 E6 CB B3 FC C9>, which is shorter than
|
||
the input string. Thus the output is <05 30 E6 CB B3 FC C9>.
|
||
|
||
The four input characters <U+012F U+0111 U+0149 U+00E5> are represented
|
||
in big-endian UTF-16 as the eight octets <01 2F 01 11 01 49 00 E5>. The
|
||
output buffer has eight octets <03 01 2F 11 49 01 00 E5>, which is the
|
||
same length as the input string. Thus, the output is <03 01 2F 11 49 01
|
||
00 E5>.
|
||
|
||
The three input characters <U+012F U+00E0 U+014B> are represented in
|
||
big-endian UTF-16 as the six octets <01 2F 00 E0 01 4B>. The output
|
||
buffer is nine octets <01 01 2F 01 00 E0 01 01 4B>, which is longer than
|
||
the input buffer. Thus, the output is <FF 01 2F 00 E0 01 4B>.
|
||
|
||
2.5 Base32
|
||
|
||
In order to encode non-ASCII characters in DNS-compatible host name parts,
|
||
they must be converted into legal characters. This is done with Base32
|
||
encoding, described here.
|
||
|
||
Table 1 shows the mapping between input bits and output characters in
|
||
Base32. Design note: the digits used in Base32 are "2" through "7"
|
||
instead of "0" through "6" in order to avoid digits "0" and "1". This
|
||
helps reduce errors for users who are entering a Base32 stream and may
|
||
misinterpret a "0" for an "O" or a "1" for an "l".
|
||
|
||
Table 1: Base32 conversion
|
||
bits char hex bits char hex
|
||
00000 a 0x61 10000 q 0x71
|
||
00001 b 0x62 10001 r 0x72
|
||
00010 c 0x63 10010 s 0x73
|
||
00011 d 0x64 10011 t 0x74
|
||
00100 e 0x65 10100 u 0x75
|
||
00101 f 0x66 10101 v 0x76
|
||
00110 g 0x67 10110 w 0x77
|
||
00111 h 0x68 10111 x 0x78
|
||
01000 i 0x69 11000 y 0x79
|
||
01001 j 0x6a 11001 z 0x7a
|
||
01010 k 0x6b 11010 2 0x32
|
||
01011 l 0x6c 11011 3 0x33
|
||
01100 m 0x6d 11100 4 0x34
|
||
01101 n 0x6e 11101 5 0x35
|
||
01110 o 0x6f 11110 6 0x36
|
||
01111 p 0x70 11111 7 0x37
|
||
|
||
2.5.1 Encoding octets as Base32
|
||
|
||
The input is a stream of octets. However, the octets are then treated
|
||
as a stream of bits.
|
||
|
||
Design note: The assumption that the input is a stream of octets
|
||
(instead of a stream of bits) was made so that no padding was needed.
|
||
If you are reusing this algorithm for a stream of bits, you must add a
|
||
padding mechanism in order to differentiate different lengths of input.
|
||
|
||
1) Set the read pointer to the beginning of the input bit stream.
|
||
|
||
2) Look at the five bits after the read pointer. If there are not five
|
||
bits, go to step 5.
|
||
|
||
3) Look up the value of the set of five bits in the bits column of
|
||
Table 1, and output the character from the char column (whose hex value
|
||
is in the hex column).
|
||
|
||
4) Move the read pointer five bits forward. If the read pointer is at
|
||
the end of the input bit stream (that is, there are no more bits in the
|
||
input), stop. Otherwise, go to step 2.
|
||
|
||
5) Pad the bits seen until there are five bits.
|
||
|
||
6) Look up the value of the set of five bits in the bits column of
|
||
Table 1, and output the character from the char column (whose hex value
|
||
is in the hex column).
|
||
|
||
2.5.2 Decoding Base32 as octets
|
||
|
||
The input is octets in network byte order. The input octets MUST be
|
||
values from the second column in Table 1.
|
||
|
||
1) Count the number of octets in the input and divide it by 8; call the
|
||
remainder INPUTCHECK. If INPUTCHECK is 1 or 3 or 6, stop with an error.
|
||
|
||
2) Set the read pointer to the beginning of the input octet stream.
|
||
|
||
3) Look up the character value of the octet in the char column (or hex
|
||
value in hex column) of Table 1, and add the five bits from the bits
|
||
column to the output buffer.
|
||
|
||
4) Move the read pointer one octet forward. If the read pointer is not
|
||
at the end of the input octet stream (that is, there are more octets in
|
||
the input), go to step 3.
|
||
|
||
5) Count the number of bits that are in the output buffer and divide it
|
||
by 8; call the remainder PADDING. If the PADDING number of bits at the
|
||
end of the output buffer are not all zero, stop with an error.
|
||
Otherwise, emit the output buffer and stop.
|
||
|
||
2.5.3 Base32 example
|
||
|
||
Assume you want to encode the value 0x3a270f93. The bit string is:
|
||
|
||
3 a 2 7 0 f 9 3
|
||
00111010 00100111 00001111 10010011
|
||
|
||
Broken into chunks of five bits, this is:
|
||
|
||
00111 01000 10011 10000 11111 00100 11
|
||
|
||
Padding is added to make the last chunk five bits:
|
||
|
||
00111 01000 10011 10000 11111 00100 11000
|
||
|
||
The output of encoding is:
|
||
|
||
00111 01000 10011 10000 11111 00100 11000
|
||
h i t q 7 e y
|
||
or "hitq7ey".
|
||
|
||
|
||
3. Security Considerations
|
||
|
||
Much of the security of the Internet relies on the DNS. Thus, any
|
||
change to the characteristics of the DNS can change the security of
|
||
much of the Internet. Thus, LACE makes no changes to the DNS
|
||
itself.
|
||
|
||
Host names are used by users to connect to Internet servers. The
|
||
security of the Internet would be compromised if a user entering a
|
||
single internationalized name could be connected to different servers
|
||
based on different interpretations of the internationalized host
|
||
name.
|
||
|
||
LACE is designed so that every internationalized host name part
|
||
can be represented as one and only one DNS-compatible string. If there
|
||
is any way to follow the steps in this document and get two or more
|
||
different results, it is a severe and fatal error in the protocol.
|
||
|
||
|
||
4. References
|
||
|
||
[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name Proposals",
|
||
draft-ietf-idn-compare.
|
||
|
||
[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
|
||
draft-ietf-idn-requirement.
|
||
|
||
[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
|
||
technology -- Universal Multiple-Octet Coded Character Set (UCS) --
|
||
Part 1: Architecture and Basic Multilingual Plane. Five amendments and
|
||
a technical corrigendum have been published up to now. UTF-16 is
|
||
described in Annex Q, published as Amendment 1. 17 other amendments are
|
||
currently at various stages of standardization. [[[ THIS REFERENCE
|
||
NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]
|
||
|
||
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
|
||
Requirement Levels", March 1997, RFC 2119.
|
||
|
||
[RFC2781] Paul Hoffman and Francois Yergeau, "UTF-16, an encoding of ISO
|
||
10646", February 2000, RFC 2781.
|
||
|
||
[STD13] Paul Mockapetris, "Domain names - implementation and
|
||
specification", November 1987, STD 13 (RFC 1035).
|
||
|
||
[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
|
||
3.0", ISBN 0-201-61633-5. Described at
|
||
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
|
||
|
||
|
||
A. Acknowledgements
|
||
|
||
Rick Wesson pointed out some error conditions that need to be
|
||
tested for. Scott Hollenbeck pointed out some errors in the
|
||
compression.
|
||
|
||
Base32 is quite obviously inspired by the tried-and-true Base64
|
||
Content-Transfer-Encoding from MIME.
|
||
|
||
|
||
B. Sample code
|
||
|
||
The following is sample Javascript code for the LACE algorithm.
|
||
This code is believed to be correct, but there may be errors in
|
||
it. The code is provided as-is and comes with no warranty of
|
||
fitness, correctness, blah blah blah.
|
||
|
||
/**
|
||
* Converts to LACE compression format (without Base32) from
|
||
* UTF-16BE array
|
||
* @parameter iArray Array of bytes in UTF16-BE
|
||
* @parameter iCount Number of elements. Must be 0..63
|
||
* @parameter oArray Array for output of LACE bytes.
|
||
* Must be at least 100 octets long to provide internal working space
|
||
* @return Length of output array used
|
||
* @parameter parseResult output error value if any
|
||
* @author Mark Davis
|
||
*/
|
||
|
||
function toLACE(iArray, iCount, oArray, parseResult) {
|
||
//debugger;
|
||
if (iCount < 1 || iCount > 62) <20>{
|
||
parseResult.set("Lace: count out of range", iCount);
|
||
return;
|
||
}
|
||
if ((iCount % 2) == 1) <20>{
|
||
parseResult.set("Lace: odd length, can't be UTF-16", iCount);
|
||
return;
|
||
}
|
||
var op = 0; <20>// input index
|
||
var ip = 0; <20>// output index
|
||
var lastHigh = -1;
|
||
var lenp = 0;
|
||
while (ip < iCount) {
|
||
var high = iArray[ip++];
|
||
if (high != lastHigh) {
|
||
if (lastHigh != -1) { <20>// store last length
|
||
var len = op - lenp - 2;
|
||
oArray[lenp] = len;
|
||
} <20>
|
||
lenp = op++; // reserve space
|
||
oArray[op++] = high;
|
||
lastHigh = high;
|
||
}
|
||
oArray[op++] = iArray[ip++];
|
||
}
|
||
|
||
// store last len
|
||
|
||
var len = op - lenp - 2;
|
||
oArray[lenp] = len;
|
||
|
||
// see if the input is short, and we should
|
||
// just copy
|
||
|
||
if (op > iCount) {
|
||
if (op > 63) <20>{
|
||
parseResult.set("Lace: output too long", op);
|
||
return;
|
||
}
|
||
oArray[0] = 0xFF;
|
||
copyTo(iArray, 0, iCount, oArray, 1);
|
||
op = iCount + 1;
|
||
}
|
||
return op;
|
||
}
|
||
|
||
/**
|
||
* Converts from LACE compressed format (without Base32) to
|
||
* UTF-16BE array
|
||
* @parameter iArray Array of bytes in LACE format
|
||
* @parameter iCount Number of elements
|
||
* @parameter oArray Array for output of bytes, UTF16-BE.
|
||
* Must be at least iCount+1 long
|
||
* @return Length of output array used
|
||
* @parameter parseResult output error value if any
|
||
* @author Mark Davis
|
||
*/
|
||
|
||
function fromLACE(iArray, iCount, oArray, parseResult) {
|
||
var high;
|
||
if (iCount < 1 || iCount > 63) {
|
||
parseResult.set("fromLACE: count out of range", iCount);
|
||
return;
|
||
}
|
||
var op = 0;
|
||
var ip = 0;
|
||
var result = 0;
|
||
if (iArray[ip] == 0xFF) { <20>// special case FF
|
||
copyTo(iArray, 1, iCount-1, oArray, 0);
|
||
result = iCount-1;
|
||
} else {
|
||
while (ip < iCount) { <20>// loop over runs
|
||
var count = iArray[ip++];
|
||
if (ip == iCount) {
|
||
parseResult.set("fromLACE: truncated before high", ip);
|
||
return;
|
||
}
|
||
high = iArray[ip++];
|
||
for (var i = 0; i < count; ++i) {
|
||
oArray[op++] = high;
|
||
if (ip == iCount) <20>{
|
||
parseResult.set("fromLACE: truncated from count", ip);
|
||
return;
|
||
}
|
||
oArray[op++] = iArray[ip++];
|
||
}
|
||
}
|
||
result = op;
|
||
}
|
||
|
||
// check for uniqueness
|
||
|
||
var checkArray = [];
|
||
var checkCount = toLACE(oArray, result, checkArray, parseResult);
|
||
if (!equals(iArray, iCount, checkArray, checkCount)) {
|
||
parseResult.set("fromLACE: illegal input form");
|
||
return;
|
||
} <20>
|
||
return result;
|
||
}
|
||
|
||
/**
|
||
* Utility routine for comparing arrays
|
||
* @parameter array1 first array to compare
|
||
* @parameter count1 number of elements to compare in first array
|
||
* @parameter array2 second array to compare
|
||
* @parameter count1 number of elements to compare in second array
|
||
* @return true iff counts are same, and elements from 0 to count-1
|
||
* are the same
|
||
*/
|
||
|
||
function equals(array1, count1, array2, count2) {
|
||
if (count1 != count2) return false;
|
||
for (var i = 0; i < count1; ++i) {
|
||
if (array1[i] != array2[i]) return false;
|
||
}
|
||
return true;
|
||
}
|
||
|
||
/**
|
||
* Utility routine for getting array of bytes from UTF-16 string
|
||
* @parameter str source string
|
||
* @parameter oArray output array to fill in
|
||
* @return count of bytes put into oArray
|
||
*/
|
||
|
||
function utf16FromString(str, oArray) {
|
||
var op = 0;
|
||
for (var i = 0; i < str.length; ++i) {
|
||
var code = str.charCodeAt(i);
|
||
oArray[op++] = (code >>> 8); <20>// top byte
|
||
oArray[op++] = (code & 0xFF); // bottom byte
|
||
}
|
||
return op;
|
||
}
|
||
|
||
/**
|
||
* Utility routine to see if string doesn't need LACE
|
||
* @parameter str source string
|
||
* @return true if ok already
|
||
*/
|
||
|
||
function okAlready(str) {
|
||
for (var i = 0; i < str.length; ++i) {
|
||
var c = str.charAt(i);
|
||
if (c == '-' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9')
|
||
continue;
|
||
return false;
|
||
}
|
||
return true
|
||
}
|
||
|
||
/**
|
||
* Convert from bytes to base32
|
||
* @parameter input Input buffer of bytes with values 00 to FF
|
||
* @parameter inputLength Length of input buffer
|
||
* @parameter output Output buffer, to be filled with with values from
|
||
a-z2-7.
|
||
* Must be of at least length input*8/5 + 1
|
||
* @return Length of output buffer used
|
||
* @author Mark Davis
|
||
*/
|
||
|
||
function toBase32(input, inputLength, output, parseResult) {
|
||
//debugger;
|
||
var bits = 0;
|
||
var bitCount = 0;
|
||
var ip = 0;
|
||
var op = 0;
|
||
var val = 0;
|
||
while (true) {
|
||
|
||
// get bits if we don't have enough
|
||
|
||
if (bitCount < 5) {
|
||
if (ip >= inputLength) break;
|
||
// get another input
|
||
bits <<= 8;
|
||
if (baseDebugTo) alert("byte: " + input[ip].toString(16) + ",
|
||
bitCount: " + (bitCount+8));
|
||
|
||
bits = bits | input[ip++];
|
||
bitCount += 8;
|
||
}
|
||
|
||
// emit and remove them
|
||
|
||
bitCount -= 5;
|
||
val = (bits >> bitCount);
|
||
if (baseDebugTo) alert("Val: " + val.toString(16) + ", bitCount: "
|
||
+ bitCount);
|
||
output[op++] = toLetter(val);
|
||
//if (baseDebugTo) alert("out: " + output[op-1].toString(16));
|
||
bits &= ~(0x1F << bitCount);
|
||
}
|
||
|
||
// add padding and output if necessary
|
||
|
||
if (bitCount > 0) {
|
||
if (baseDebugTo) alert("bits*: " + bits.toString(16) +
|
||
", bitCount: " + bitCount);
|
||
val = bits << (5 - bitCount);
|
||
if (baseDebugTo) alert("out*: " + val.toString(16));
|
||
output[op++] = toLetter(val);
|
||
}
|
||
return op;
|
||
}
|
||
|
||
/**
|
||
* Convert from base32 to bytes
|
||
* @parameter input Input buffer of bytes with values from a-z2-7
|
||
* @parameter inputLength Length of input buffer
|
||
* @parameter output Output buffer, to be filled with bytes from
|
||
* 00 to FF
|
||
* Must be of at least length input*5/8 + 1
|
||
* @return Length of output buffer used
|
||
* @author Mark Davis
|
||
*/
|
||
|
||
function fromBase32(input, inputLength, output, parseResult) {
|
||
//debugger;
|
||
var inputCheck = inputLength % 8;
|
||
if (inputCheck == 1 || inputCheck == 3 || inputCheck == 6) {
|
||
parseResult.set("Base32 excess length", null, inputLength);
|
||
return;
|
||
}
|
||
var bits = 0;
|
||
var bitCount = 0;
|
||
var ip = 0;
|
||
var op = 0;
|
||
var val = 0;
|
||
while (ip < inputLength) {
|
||
|
||
// get more bits
|
||
var val = input[ip++];
|
||
val = fromLetter(val);
|
||
if (val < 0 || val > 0x3F) {
|
||
parseResult.set("Bad Base32 byte", val, ip-1);
|
||
return;
|
||
}
|
||
if (baseDebugFrom) alert("base32: " + val.toString(16));
|
||
bits <<= 5;
|
||
bits = bits | val;
|
||
bitCount += 5;
|
||
if (baseDebugFrom) alert("from: " + val.toString(16) +
|
||
", bitCount: " + bitCount);
|
||
|
||
// emit & remove if we can
|
||
|
||
if (bitCount >= 8) {
|
||
bitCount -= 8;
|
||
output[op++] = bits >> bitCount;
|
||
if (baseDebugFrom) alert("out2: " + (bits >> bitCount) +
|
||
", bitCount: " + bitCount);
|
||
bits &= ~(0xFF << bitCount);
|
||
}
|
||
}
|
||
|
||
// check that padding is with zero!
|
||
if (bits != 0) return -ip;
|
||
return op;
|
||
}
|
||
|
||
|
||
function toLetter(val) {
|
||
if (val > 25) return val - 26 + 0x32;
|
||
return val + 0x61;
|
||
// return val + (val < 26 ? 0x61 : 0x18);
|
||
}
|
||
|
||
function fromLetter(val) {
|
||
if (val < 0x61) return val + 26 - 0x32;
|
||
return val - 0x61;
|
||
}
|
||
|
||
|
||
|
||
C. Difrerences between -00 and -01
|
||
|
||
1: Minor typos.
|
||
|
||
2.1: Changed the tag to 'lq--'.
|
||
|
||
2.2 and 2.3: Added check for all-STD13 names in the steps.
|
||
|
||
2.4.1: Clarified first sentence. Step 5: fixed the moving of the IP.
|
||
|
||
2.4.2: Moved the last sentence of step 4 to be the first sentence of
|
||
step 5. Added the check for odd-length output. Changed the exit
|
||
comparision to doing a full comparison (instead of looking for lengths).
|
||
|
||
2.5.2: Changed the sense of the test in step 3 and added step 4 to check
|
||
for malformed input. Also made the output a buffer. Also added new step
|
||
1.
|
||
|
||
Changed Appendix B from IANA Considerations (of which there are none) to
|
||
Javascript code sample.
|
||
|
||
|
||
D. Author Contact Information
|
||
|
||
Mark Davis
|
||
IBM
|
||
10275 N. De Anza Blvd
|
||
Cupertino, CA 95014
|
||
mark.davis@us.ibm.com and mark.davis@macchiato.com
|
||
|
||
Paul Hoffman
|
||
Internet Mail Consortium and VPN Consortium
|
||
127 Segre Place
|
||
Santa Cruz, CA 95060 USA
|
||
paul.hoffman@imc.org and paul.hoffman@vpnc.org
|