Talk:Punycode

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Untitled[edit]

First, the linked "Punycode Exploit" should link to the advisory text, not the simple demonstration page. Secondly, this is not an exploit of punycode so much as it is an exploit of the fact that not running domain names through nameprep is asking for spoofing problems.

I would change the paragraph from:

  Punycode is easily exploitable, and for an example see Punycode exploit

to:

  Note that browsers which fail to run a string
  through nameprep before using it as a DNS name are
  vulnerable to spoofing exploits, as in Punycode spoof.

I suspect is should read more like:

   The fact that Unicode and Punycode strings result in homographs (strings
   which are different but visually indistinguishable or almost so) is a matter of
   some concern.  Phishing (a class of social engineering exploits) will make use of
   homographs like "www.paypa1.com" (note: this containst the digit, 1 ("one") rather
   than the letter, l ("ell").

I'd like to add some links to the discussion of "petnames" (which I've read from the cap-talk mailing list that's maintained by Jonathan Shapiro, creator of EROS).

Unforunately I, User:JimD, lack the time at the moment to do this properly. So, I'm tacking this in here in the hopes that it will help me remember or that someone here will take the ball and run with it. JimD 22:43, 2005 Feb 22 (UTC)

With the last two words an external link to the exploit.

You are right that this isn't a Punycode exploit: it's an IDNA exploit, and it's mentioned in the security section of RFC 3490. But I think you must be wrong about Nameprep - as far as I know, these browsers are already using Nameprep, and Nameprep has no effect on the domain label in question (pаypal). --Zundark 18:59, 7 Feb 2005 (UTC)
D'oh! - You're right, my bad. Nameprep wouldn't help since Unicode NFKC does not, in fact, normalize Cyrillic "а" to Latin "a". So never mind the suggested rephrasing. Instead, someone should rephrase it to mention that this is an IDNA problem or, better yet, move this paragraph and link to the IDNA page.



The explanation for how Punycode is supposed to work is unintelligible to me, and I don't consider myself particularly dull about technical matters. The main problem is that the author doesn't seem to have a grasp of where to begin and ends up explaining the thing in bits and pieces. Can someone replace this with something more coherent? 82.92.119.11 8 July 2005 17:17 (UTC)

The article contains a lot of information that is actually about IDNA rather than Punycode, and I think this is what makes it so confusing. There's no reason for this IDNA information to be here, since it's already in the appropriate article. Much of this information has been removed before, but it just gets added back in a different form. So I don't really hold out much hope for this article, and don't currently consider it worth the effort of working on. Fortunately, the article does have a link to RFC 3492, which is what people need to read if they want to understand Punycode. --Zundark 8 July 2005 18:24 (UTC)

Patent issues[edit]

Is Punycode patented? --84.61.51.126 13:01, 6 July 2006 (UTC)[reply]

Nobody's asserted a patent against it as far as I know. Given the nature of the patent system, that's the most anyone can say for certain. --Alvestrand 15:07, 18 October 2006 (UTC)[reply]

How about showing some examples?

Examples[edit]

From the text its not really clear about how exactly you get back to the name and even if its possible. Or how the position of the character is encoded. Since currently it looks to me as if "bücher" "bächer" "bchüer" etc. all get the same bcher-kva code??

I've rewritten the first part of the encoding description based on reading the rfc, more work is still needed though. Plugwash 01:45, 27 February 2007 (UTC)[reply]

Shouldn't "bücüher" be punycoded as "bcher-kvaba", and "bücherü" as "bcher-kvaea"? Moreover, should "ýbücher" be punycoded as "bcher-kvafa" or as "bcher-fakva"? (are both possible?) —Preceding unsigned comment added by 193.145.147.38 (talk) 18:21, 27 February 2008 (UTC)[reply]

Someone more knowledgable than I should explain why the non-ascii character "ü" being insertes at various places WITHIN an ascii string magically transmutes to "ý" if it is inserted BEFORE the string. It all sound like an obfuscation of alchemy to me. Xojo (talk) 01:42, 3 September 2010 (UTC)[reply]
Oh, that's just because "ý" (U+00FD) is the next code point after "ü" (U+00FC). No alchemy there! — Sebastian 07:13, 20 December 2011 (UTC)[reply]

Encoding principles example[edit]

The heading "Encoding procedure" says:

"The 'bücher' example given above . . ."

There is no "bücher" example "above" in the current form of the article.

This is probably an editorial anomaly in the wake of rearrangement of the article.

Le crayon rouge ne dort jamais.

Doug Kerr 17:56, 11 March 2007 (UTC)[reply]

I've reworded it. --Zundark 10:42, 12 March 2007 (UTC)[reply]

I did not understand this article but had no problem to understand Punycode from reading the RFC. I will reorder and enhance this article by. Starting with a quick description of the reasons why this particular encoding scheme is used and following the explanation order in the RFC, which start out with the decoding, which is far easier to understand then the encoding. Roeschter 19:25, 25 March 2007 (UTC)[reply]

Bootstring[edit]

'Bootstring' redirects here, even though it doesn't strike me as synonymous to 'Punycode' (Punycode, going by the title of ref1, Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA), seems to be a kind of Bootstring).

Could someone with more knowledge on the matter either create a Bootstring article describing what that is or make it clear that Bootstring indeed is merely a synonym? I'm curious which/what it is.

-pinkgothic (talk) 12:42, 12 November 2010 (UTC)[reply]

The section "Encoding of non-ASCII character insertions as code numbers" needs a complete rewrite. Someone went out of their way to use computer science jargon. 188.103.59.142 (talk) 16:24, 9 September 2011 (UTC)[reply]

Ambiguity in separation of ASCII characters[edit]

Resolved

Section Separation of ASCII characters says "Since it is a basic character, the ASCII hyphen may still appear in the string before this additional character, but the addition does not cause ambiguity". But precisely because the hyphen may still appear, we have a problem: Suppose you wanted to convert the string "bcher-kva" to a label - wouldn't that be just the same, thus rendering the label "bcher-kva" ambiguous? — Sebastian 07:19, 20 December 2011 (UTC)[reply]

The description appears to be incorrect: a hyphen is always added (unless there are no basic characters at all) so "bcher-kva" would be encoded as "bcher-kva-". See example (S) in RFC 3492. (But in practice Punycode is hardly used outside of IDNA, and IDNA wouldn't apply Punycode to such a string anyway.) I've undone 195.242.190.233's changes of 17 November 2010, which introduced the error. --Zundark (talk) 09:58, 20 December 2011 (UTC)[reply]
Thanks! — Sebastian 10:28, 20 December 2011 (UTC)[reply]

The RFC Website Has Changed[edit]

The current link to the Punycode RFC didn't work; it was redirected to an IETF home page. <https://www.rfc-editor.org/rfc/rfc3492.txt> is the new link to the Punycode RFC. — Preceding unsigned comment added by CTMacUser (talkcontribs) 06:37, 14 March 2019 (UTC)[reply]

WHY IS IT CALLED PUNYCODE?[edit]

Anyone know 179.98.225.10 (talk) 14:36, 16 March 2019 (UTC)[reply]

Because it is a PUN on "UNIcode" (Unicode) Uni rhymes with puny Firejuggler86 (talk) 21:31, 4 April 2020 (UTC)[reply]

Do you have a source? Your comment here was just cited in a Stack Exchange Q&A. If we had a reputable source, I could add this info both to the Q&A and to the Wikipedia article. Maxlaumeister (talk) 23:34, 26 June 2020 (UTC)[reply]
It seems to be so obvious that nobody said it explicitly. Even the Wiktionary article on Punycode just says "Blend of puny + Unicode" under etymology, but without any source. Someone could perhaps ask Adam M. Costello, the inventor of Punycode, personally. He seems like a really nice guy. But asking him doesn't directly give us a reference, unless the question is answered publicly on the internet somehow. --Jhertel (talk) 22:19, 28 June 2020 (UTC)[reply]
I just emailed Adam directly and published an article on my blog with our exchange. He mentioned three specific ways that he believes it's "puny"; he didn't mention "pun" at all. I'll let someone else determine how best to incorporate this info into Wikipedia. Maxlaumeister (talk) 20:32, 18 July 2020 (UTC)[reply]

Limited to 35![edit]

What will happen if the number of states to be skipped will get larger than 1.03e40=35*34*...*2? Unlikely given the application, but there is no answer given inside the article, as far as I've understood. Am I correct? — Preceding unsigned comment added by 212.82.199.28 (talk) 15:40, 9 December 2019 (UTC)[reply]

More examples are needed, especially pure ASCII examples and corner cases[edit]

I must admit that I understand very little of this article, especially because there are almost no examples. What I especially miss is how to encode pure basic ASCII strings in Punycode. It only seems to focus on strings that include at least one non-ASCII character, probably because Punycode was created specifically for handling this subset of Unicode strings. But I would assume that Punycode is also able to encode pure basic ASCII strings? At least, that's what the first sentence in the article claims: "Punycode is a representation of Unicode", which means it can handle all strings including basic ASCII strings which is a subset of all Unicode strings. If that is the case, how does it do that? For instance, what is the Punycode encoding of "London"? My guess is that it would be "London", but I can't find that information in the article; the section "Separation of ASCII characters" seems to say that "London" would be encoded as "London-" ("If any characters were copied, an ASCII hyphen is added to the output"), but I find that quite odd. Also, what is the encoding of the empty string? Is it the empty string? Simple examples are needed, preferably a table that especially lists both very common cases like pure basic ASCII strings ("a", "abc", "London", "A string with spaces" if a space is a basic ASCII character) as well as corner cases like the empty string (""), the string "3", the string "-", and a string consisting of a single non-basic ASCII letter such as "ü".

As a first step, though, I would simply add the example of "London" next to the mentioned example of München. But I still don't know the Punycode encoding of "London". I can't even use an online Punycode converter to figure it out, as those I have tried all seem to get it wrong, as they all add "xn--" in the beginning of encodings of non-ASCII strings, which does not seem to be part of Punycode (they seem to mistakenly mix in Nameprep, which is not part of Punycode, in the process).

--Jhertel (talk) 16:11, 15 February 2020 (UTC)[reply]

As a follow-up, the table below shows what I would like too see the encodings of, as that shows corner cases and gives a quick overview. I have filled out the only one I know the answer to (from the article); the rest are marked ???:

Input Punycode of input Description of input
??? The empty string.
a ??? Only basic ASCII characters, one, lowercase.
A ??? Only basic ASCII characters, one, uppercase.
3 ??? Only basic ASCII characters, one, a digit.
- ??? Only basic ASCII characters, one, a hyphen.
-- ??? Only basic ASCII characters, two hyphens.
abc ??? Only basic ASCII characters, more than one, all lowercase.
London ??? Only basic ASCII characters, more than one, one uppercase.
Lloyd-Atkinson ??? Only basic ASCII characters, one hyphen.
This has spaces ??? A string with spaces.
ü ??? No basic ASCII characters, one character.
αβγ ??? No basic ASCII characters, more than one character.
München Mnchen-3ya Mixed string, with one character that is not a basic ASCII character.
Mnchen-3ya ??? Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice).
München-Ost ??? Mixed string, with one character that is not basic ASCII, and a hyphen.
Bahnhof München-Ost ??? Mixed string, with one space, one hyphen, and one character that is not basic ASCII.

I just got the idea to use the Python library codecs with the encoding "punycode", which should be very trustworthy, so I will return soon with the actual punycodes for the table.

--Jhertel (talk) 18:07, 15 February 2020 (UTC)[reply]


Okay, so here is the table:

Input Punycode of input Description of input
The empty string.
a a- Only basic ASCII characters, one, lowercase.
A A- Only basic ASCII characters, one, uppercase.
3 3- Only basic ASCII characters, one, a digit.
- -- Only basic ASCII characters, one, a hyphen.
-- --- Only basic ASCII characters, two hyphens.
abc abc- Only basic ASCII characters, more than one, all lowercase.
London London- Only basic ASCII characters, more than one, one uppercase.
Lloyd-Atkinson Lloyd-Atkinson- Only basic ASCII characters, one hyphen.
This has spaces This has spaces- A string with spaces.
ü tda No basic ASCII characters, one character.
αβγ mxacd No basic ASCII characters, more than one character.
München Mnchen-3ya Mixed string, with one character that is not a basic ASCII character.
Mnchen-3ya Mnchen-3ya- Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice).
München-Ost Mnchen-Ost-9db Mixed string, with one character that is not basic ASCII, and a hyphen.
Bahnhof München-Ost Bahnhof Mnchen-Ost-u6b Mixed string, with one space, one hyphen, and one character that is not basic ASCII.

It was created by this Python (3.8) script:

examples = [
	("", "The empty string."),
	("a", "Only basic ASCII characters, one, lowercase."),
	("A", "Only basic ASCII characters, one, uppercase."),
	("3", "Only basic ASCII characters, one, a digit."),
	("-", "Only basic ASCII characters, one, a hyphen."),
	("--", "Only basic ASCII characters, two hyphens."),
	("abc", "Only basic ASCII characters, more than one, all lowercase."),
	("London", "Only basic ASCII characters, more than one, one uppercase."),
	("Lloyd-Atkinson", "Only basic ASCII characters, one hyphen."),
	("This has spaces", "A string with spaces."),
	("ü", "No basic ASCII characters, one character."),
	("αβγ", "No basic ASCII characters, more than one character."),
	("München", "Mixed string, with one character that is not a basic ASCII character."),
	("Mnchen-3ya", 'Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice).'),
	("München-Ost", "Mixed string, with one character that is not basic ASCII, and a hyphen."),
	("Bahnhof München-Ost", "Mixed string, with one space, one hyphen, and one character that is not basic ASCII.")]

def punycode(s):
	return s.encode("punycode").decode("ascii")

def nowrap(s):
        return "{{nowrap|" + s + "}}"

def code(s):
        return "<code>{}</code>".format(s)

def print_row(s, description):
        print("| {} || {} || {}".format(
                nowrap(code(s)),
                nowrap(code(punycode(s))),
                description))
        print("|-")

def print_table(examples):
        print('{| class="wikitable"')
        print('|-')
        print('! Input !! Punycode of input !! Description of input')
        print('|-')

        for (s, description) in examples:
                print_row(s, description)

        print('|}')

print_table(examples)

--Jhertel (talk) 19:04, 15 February 2020 (UTC)[reply]

Wait a moment…[edit]

So, technically, this means that https://xn--xn--wikipedia--.org redirects to https://xn--wikipedia-.org and then to https://wikipedia.org because the conversion method still works even if the original domain is LDH? 154.5.234.189 (talk) 04:55, 17 November 2020 (UTC)[reply]

There is no "redirect" at all. The browser will try to use the name exactly as you entered it. An un-cautious browser might display "xn--xn--wikipedia--.org" as "xn--wikipedia-.org" but would always use "xn--xn--wikipedia--.org" when communicating over the network because that's what you typed in. And real browsers won't decode your example even for display because it doesn't pass the smell test; it looks like you're trying to exploit something. Real browsers only decode a domain for display if that decode results in a string that utilizes exactly one non-ASCII writing system. Your example uses zero. Other phishing attempts use more than one. 108.246.204.20 (talk) 03:01, 15 June 2021 (UTC)[reply]

Section "Encoding of non-ASCII character insertions as code numbers" needs revamp[edit]

Maybe the section `Encoding of non-ASCII character insertions as code numbers` should have the hard to follow text that describes the state changes removed, and instead have a proper state machine as a formal definition with a state graph inserted and/or just pseudo code? Explaining it in plain English was worth a try, but I can only surmise that sadly I feel like it didn't really succeed. At least for me it isn't really possible to decipher from the text how the encoding actually works. 2003:EA:7F11:DE00:BC43:6BFF:FE40:1BA5 (talk) 12:38, 17 December 2020 (UTC)[reply]