Talk:Unicode and HTML

Computing Low‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Low	This article has been rated as Low-importance on the project's importance scale.

Internet Low‑importance

	Internet portal This article is within the scope of WikiProject Internet, a collaborative effort to improve the coverage of the Internet on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.InternetWikipedia:WikiProject InternetTemplate:WikiProject InternetInternet articles
Low	This article has been rated as Low-importance on the project's importance scale.

Character chart links[edit]

For the Unicode character charts, reverted the URL from http://www.unicode.org/charts/normalization/ back to http://www.unicode.org/charts/. The normalization charts only display the characters if you have the font already installed and do not seem to be as complete as the full charts available on the other URL. --Nate 15:37 Mar 7, 2003 (UTC)

Decimal vs hexadecimal[edit]

Hmm, I wasn't sure whether it was really more "intuitive" to use decimal instead of hexadecimal. I mean, both alternatives obscure the character, and the argument that "older web browsers" fail to parse hexadecimal is moot IMO since those web browsers will have a problem with non-8-bit characters anyway. I felt NPOV would dictate to say that in HTML, both hexadecimal and decimal can be used, so I changed that. Djmutex 10:51 May 2, 2003 (UTC)

Official 'character set'[edit]

The intro says: HTML 4.0 uses Unicode as its official character set.

Does somebody have a link to the place in the specification where this is stated? -- Hirzel

Section 5.1. The use of the term "character set" is misleading because it is so overloaded, but it is accurate. An HTML document must consist of Unicode characters. Those characters are, in turn, encoded (as iso-8859-1, utf-8, etc.). Today I added some text to the article to clarify this point. - mjb 23:48, 19 Aug 2004 (UTC)

I believe that statement is wrong. If I understand http://www.w3.org/TR/html401/charset.html#encodings correctly, it explicitly says that there is no "default" encoding. Instead, "conforming user agents" must be able to map any HTML document to Unicode (for example, to support all defined HTML named entities), and may apply heuristics on the HTML document if no charset is specified explicitly (either in the HTTP header, or a META tag, or a "charset" attribute on a specific element.) -- djmutex 21:20 10 Jun 2003 (UTC)

Re myself, the statement isn't really wrong, but it might be misleading. All Unicode characters must be supported by a browser, but there is no "default" character set. As a result, Unicode is the "official" character set, but it's not the default. djmutex 21:23 10 Jun 2003 (UTC)

Well, sorta, but not really. There is confusion resulting from the unfortunate use of the overloaded term "character set" and your apparent misunderstanding that Unicode is itself an encoding in the same sense as a "charset" (it's not).

HTML documents are indeed comprised of Unicode characters, always, but Unicode characters are abstract concepts: just "the idea of" a unit in a writing system, mapped to a similarly abstract concept of a non-negative integer number, its "code point". An HTML or XML document is defined as being a sequence of such characters, and therefore is itself an abstract entity. It is only when the document manifests as a sequence of bits/bytes on the network, on disk, or in memory that it has an encoding associated with it. The encoding maps the characters/code points to sequences of bits. You're right that there is no default encoding, at least in the HTML spec itself, but depending on how the document is transmitted, there may be a default of us-ascii or iso-8859-1 (RFCs 2616 and 3023 address this topic). I've modified the article somewhat to explain this without going into too much detail; there are articles devoted to these topics and we don't need to repeat them in depth here. -- mjb 23:48, 19 Aug 2004 (UTC)

Unicode display in Windows MSIE[edit]

Some multilingual web browsers that dynamically merge the required font sets on demand, e.g., Microsoft's Internet Explorer 5.0 and up on Windows, or Mozilla/Netscape 6 and up cross-platform, are capable of displaying all the Unicode characters on this page simultaneously after the appropriate "text display support packs" are downloaded. MSIE 5.5 would prompt the users if a new font were needed via its "install on demand" feature.

All of the characters in the table display correctly on my Mac's Safari and Firefox (thanks partly to Code2000 and Code2001 fonts). But my stock Windows XP installation doesn't show the last six letters in MSIE 6.0 or Firefox 1.0, and doesn't prompt me to do anything. Is the above passage incorrect, or is there something wrong with my Windows or Explorer? —Michael Z. 00:35, 2005 Jan 20 (UTC)

What is a "text display support pack". That phrase doesn't appear on the Internet, except for this page. —Michael Z. 14:20, 2005 Jan 20 (UTC)

The sentence in the article only states, that the browsers are able to switch between fonts, if these are installed. So your stock XP system doesn't have enough or the right fonts, I assume.

Also note, that the method the Mozilla switches, is more flexible. It will switch for a single missing diacrical character to another font if necessary. Ugly but better than nothing. See the Nirvana article for examples.

About the exact meaning of "" I'm wondering myself.

Pjacobi 21:47, 2005 Jan 20 (UTC)

I'm still confused about that passage. I've been editing pages with Old Cyrillic and IPA characters on them. Windows users complain that they can't see some of the characters unless we put them in a <span> with Arial Unicode MS as the first font choice. The characters are supported by a font present in Windows, but I see no "dynamic merging on demand", and no "install on demand" prompting. I would rewrite the description, but I don't know much about Windows and maybe the original author knows something that I don't know.

In the Nirvana article MSIE/Win shows the a-macrons. Firefox/Win also shows the n-dots and m-dots, but the font seems to match up with the rest of the page just fine. Both Mac browsers show all of that, plus the Chinese. But on the Mac, the n-dot in isn't bold-faced where it should be—like the Moz method you describe.

—Michael Z. 00:45, 2005 Jan 21 (UTC)

AFAIK the "dynamic font switching" in MSIE is only a lookup up code ranges and languages to fonst. Fonts which aren't in these lookup tables are never considered for display. Now, if a poor piece of text is mapped to font X by these MSIE tables, all codepoints not covered in X will just not display! So, for a bad solution of this problem, MSIE users want an explicit font tag.

In contrast Mozilla switches based on Codepoint availability in the font. I have a rather plain default font set, and the b-dots an dm-dots in Nirvana are displayed in the also installed Code 2000 by Moz.

Pjacobi 08:50, 2005 Jan 21 (UTC)

So MSIE/Win just chooses fonts based on the page's charset, or the specified lang? Does it honour lang attributes on DIVs, SPANs or other elements?

In contrast, Moz chooses fonts based on every single character on the page. Once I figure this out, I'll rewrite that paragraph, because the two browser's behaviour definitely can't be summed up as the same thing. —Michael Z. 2005-01-21 17:32Z

It's still a bit guesswork, so some tests or a really knowledgeable source is needed. My current hypothesis: MSIE/Win can mix different fonts on page, using explicit fonts and (I suppose lang tags). And (IMHO) it looks at the actual characters, but not to find a font really including them (I'd say it never asks a font which characters it supports), but only to switch to the correct "block". A chinese character will switch it to the font configured for chinese (without looking whether that character is really included), but an m-underdot, if at all, only switches to the standard Unicode font. Sorry for the confusion, but at least, I haven't programmed it. --Pjacobi 22:52, 2005 Jan 21 (UTC)

"IE5 was the first to use glyphs from 'best available' fonts"[edit]

Mjb, I don't know what Microsoft calls it, but it doesn't pick the right fonts to display all the characters on the page, the way other modern browsers do.

You'll notice that in many places in Wikipedia editors have added code like style="font-family:Arial Unicode MS, Lucida Sans Unicode, sans-serif;" to tables displaying Unicode characters. We have had to develop Template:IPA (documentation) and Template:Polytonic to display IPA and Polytonic Greek characters in MSIE. These are all hacks, aimed only at MSIE on Windows. On a stock Mac or Windows system the necessary fonts are present, and Safari and Firefox display all these characters. But MSIE displays little squares, unless web authors guess which fonts the system might have and specify them in each and every instance where these Unicode characters appear.

Example: Some IPA and obscure Cyrillic characters. Both lines look the same in a stock Mac OS X or Windows XP system in Firefox or Safari. In Explorer, the top line shows squares; the second line works, because Template:IPA explicitly tells it to use Lucida Unicode MS font.

un-formatted:
ѫ ѣ ʃ ʒ

in template:IPA:
ѫ ѣ ʃ ʒ

—Michael Z. 2005-01-31 07:22 Z

Hi. Yes, I see that IE is having trouble with unformatted text in your example.

I based my assertion on the mention of "font linking" in this paper, presented at the 16th International Unicode Conference back in 2000: New International features of Internet Explorer. I did not research the issue beyond this, but it does appear that IE has at least some support in this regard, and has it it in a less capable form since IE 4.0 days.

Researching a bit just now, I found another description of the technology: "Font linking is basically the technology that Internet Explorer uses to be able to display characters from multiple languages within a single page at once. So for example, you can have Japanese and Chinese and Korean and Arabic and Devanagari and whatever character set you want, all on the same page. And there are some neat pages of that on the Internet that actually demonstrate this capability. What Internet Explorer does is it looks up certain fonts within the operating system that support this ability called font linking. What that means is that these fonts have the ability, if a character is not within that current font, to be able to look up a character and an associated font. So, for example, you could set your page to display to Japanese and set the font to Mincho, a popular Japanese font. Now let's say you have Korean within the same page. Because of the way Internet Explorer handles this, and the way it keys off this font linking capability, it can identify that the Korean characters aren't within the Mincho font, but it can get references to a Korean font that will handle those characters. And so if you look up a page with both Japanese and Korean, you'll see the Japanese page using the Mincho font and the Korean part of the page will be using GulimChe, or another Korean font." [1]

This makes it sound perfect, and rather automatic, doesn't it? And in fact, on my system, with IE6 on Windows XP SP2, I have no problem rendering this test page. So I would conclude from this that IE is doing the same thing as other browsers; the others apparently just do it better or 'more thoroughly'. Someone will have to do further research in order to determine what the quirks are in IE's built-in font linking. Anyway, I don't think it was correct to assert that IE doesn't do it at all, while these others do.

Various other links I found via Google make it sound like "font linking" is something that one can also do when coding one's own apps (browser-based or standalone) by scripting an IE-specific COM object (MLang) in order to render multilingual text [2]. - mjb 02:34, 1 Feb 2005 (UTC)

I have a stock XP system for testing web sites, and I loaded that test page in my browsers. None displayed the Kanji or Hankaku, presumably because I haven't added any fonts to the system. Firefox displayed the three lines labelled Romanj, but MSIE 6 and Opera 7.5 only showed squares there.

MSIE 6 is ahead of Netscape 4, in that it can display Unicode from multiple encodings on one page. But I have yet to see any instance where it chooses a font other than what is specified in a web page (in very little testing, I admit). I'm curious to know how the font linking works. But in the mean time, in terms of multi-Unicode block display, it's the one browser that I have to do extra work for (as it also is in terms of CSS rendering). —Michael Z. 2005-02-1 04:27 Z

Unicode from multiple scripts (writing systems), you mean. Yes, I am curious about it, too. Like I said, "works for me," but I do have Japanese lanaguage support installed. (Control Panel > Regional and Language Options > Languages > Install files for East Asian Languages).

The "Romanj" (romaji, I think it's supposed to be… I'm sure there are better example pages out there) lines are using characters from the CJK Fullwidth Forms (U+FF01 to U+FFE5 or so), which are in the Adobe Glyph List. You would think that it is therefore likely that you'd have a font that supports them, but perhaps not. It is possible that you don't, and Firefox is instead "cheating" by substituting glyphs that in the font files are actually mapped to the Latin-1 range.

For purposes of the article, I think we should stop naming and comparing browsers entirely, so as to avoid getting further into advocacy / POV issues, and also because statements about current capabilities of popular browsers do not have much of a shelf life in general. Instead, I think we should just acknowledge that simultaneous display of characters from different scripts is dependent upon the user's installed fonts, and is subject to other technological limitations (e.g., console-based browsers don't even have access to fonts), so naturally, browsers, including the most popular ones, will almost inevitably have varying levels of support for it. - mjb 06:17, 1 Feb 2005 (UTC)

Good points. I was just reacting to a few statements in the article, and hadn't really been thinking of doing any real writing here. You got it right: some browsers are way past their shelf life.

Interestingly, Lynx (browser) has an amazing transliteration engine in it. You can view all kinds of Unicode pages. It does a passable job of rendering Cyrillic and even IPA in Latin characters. I believe it supports straight Unicode too, I think, but I haven't been able to get mine configured right. —Michael Z. 2005-02-1 07:45 Z

don't throw the information out with the POV. We need to stick to facts and we need to qualify those facts with version numbers. It doesn't help that IE is bloody unpridictable (ie i made at least one page that broke on IE for me but didn't for other people.

My guess is IE does use other fonts from those specified when rendering glyphs but ONLY by mapping specific code points to specific fonts not by searching for a font that can render the charaters it wan'ts. There may well be a configuration controlling this but if there is i don't know where. Plugwash 19:54, 14 July 2005 (UTC)[reply]

I did some more digging and I think what's going on is that it actually does look at other fonts. The problem is that is that those other fonts must be explicitly associated with the current base font in the registry (look for a key containing 'FontLink', like HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontLink\SystemLink). It may be a matter of careful use of language tags in the HTML to make sure a good base font is chosen, and then having the right mappings in the registry to associate that font with the fonts to fall back on. But there are still some nuances to base font selection that I don't understand. Anyone with more experience in this area, please comment! — mjb 20:56, 14 July 2005 (UTC)[reply]

This sounds promising: per IEBlog, Microsoft's Advanced Technology Center in Beijing is working on improvements to font linking and fallback for IE7. — mjb 08:22, 26 September 2005 (UTC)[reply]

Editing Forms and Encoding[edit]

Is it true that only Mozilla based browsers convert characters – not given in the default encoding – to unicode entities? Shouldn't this be mentioned in the article? --Hhielscher 09:32, 27 Mar 2005 (UTC)

in firefox i know i can just paste in charactors outside latin1 and they end up as entities. I just tried doing the same in IE and it did seem to convert it to an entity (though its not impossible that this was a hack performed by mediawiki rather than IE doing the right thing.

Character entity groups[edit]

While it is true that the character entities are divided into 3 groups, it does not help the reader's understanding of the relationship between Unicode and HTML to explain this to them. The groupings are basically arbitrary and exist as historical artifacts from the standardization processes that went into defining them (I had a hand in this, albeit very minor). If groupings are to be explained, it'd be better achieved by basing them on the comments from the .ent files, which go to a more precise level of detail that is aligned with the names of Unicode ranges.

I am also tired of cleaning up edits that, while detailed, take a very conversational, not encyclopedic, tone and are rife with errors in spelling, capitalization, punctuation and grammar. If I continue to see these, I am increasingly likely to revert them wholesale, regardless of what useful content they may include. Sorry to be surly, but I get the feeling that some are taking excessive advantage of others' willingness to clean up these mistakes. — mjb 19:45, 14 July 2005 (UTC)[reply]

Same characters?[edit]

Is the character encoded by an HTML number the same character that is encoded by Unicode by the same number? For example, is character number 2343 in HTML the same as 2343 in Unicode? --Abdull 14:31, 19 August 2005 (UTC)[reply]

Yes, it is, by definition (HTML does not define the meaning for the character numbers; it instead defers to unicode). --cesarb 15:36, 19 August 2005 (UTC)[reply]

yes the numbers do reffer to unicode code points. However most html entities are decimal (you can do hexadecmial ones but they aren't seen much) whilst unicode gennerally use hexadecimal when reffering to code points. Plugwash 16:30, 19 August 2005 (UTC)[reply]

Is there any reference to know since which version each browser supports hexadecimal entities, and which are the browsers that still don't support it? Because hexadecimal is so natural when so many charmap viewers give only hexadecimal unicode code… Lacrymocéphale —Preceding unsigned comment added by 217.195.19.145 (talk) 13:02, 3 June 2008 (UTC)[reply]

I don't want to use that font![edit]

Some web browsers, such as Mozilla Firefox, Opera, and Safari, are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system.

Code2000 is a great font with many characters, but some characters are pretty badly resembled with Code2000, for example the IPA characters. Since installing Code2000 on my Windows XP, Mozilla Firefox always uses Code2000 for every special character there is to be displayed. How do I tell Firefox to use another font for IPA? Actually, how does Firefox decide what font to use for special characters if there are four different to choose? --Abdull 14:31, 19 August 2005 (UTC)[reply]

try posting on a mozilla forum your likely to get better support there. Plugwash 21:19, 19 August 2005 (UTC)[reply]

Combiners[edit]

Is it possible to express combiners in escaped html? Eg 0041+0308 --207.109.251.117 03:43, 4 November 2005 (UTC)[reply]

of course, why wouldn't it be? Plugwash 00:25, 16 November 2005 (UTC)[reply]

Internet Explorer[edit]

On the page it says: "Internet Explorer is capable of displaying the full range of Unicode characters, but can't automatically make the necessary font choice. Web page authors must guess which appropriate fonts might be present on users' systems, and manually specify them for each block of text with a different language or Unicode range. A user may have another font installed which would display some characters, but if the web page author hasn't specified it, then Explorer will fail to display them, and show placeholder squares instead." What is the proper font choice and how would you change it? (I have Internet Explorer and access lots of the math pages on Wikipedia and see lots of "placeholder squares".)--SurrealWarrior 01:33, 4 December 2005 (UTC)[reply]

Browser support - entities or no entities?[edit]

Is there any difference in browser support for, e.g.,

Character X represented as a named/numeric entity (mdash, #8211)
versus
Character X as an actual utf-8 character, in a utf-8 encoded HTML document, properly served?

Not with any modern browser afaict but i think NS4 may have some strange behaviours regarding this. Most of the remaining problems with browser unicode support have to do with font selection and rendering complex text. Plugwash 17:44, 12 February 2006 (UTC)[reply]

What about IE 7 ?[edit]

I've been having a look at this page because I wondered whether the new IE 7 still has that annoying bug of not being able to choose an appropriate font. However, the page only mentions IE 6. Can anybody verify how it is on IE 7? Thanks.

Worst written Document I've ever come across[edit]

This is by far the worst document I've ever come across. I won't say it's greek because I can read greek, this is just complete utter rubbish. To say an HTML page is unicode is a bit like saying a cat is a dog. I create html pages using notepad and I know that my html pages can only have a very limited set of characters. At what point does the 8bit coding of my document turn into the ??bit coding needed for unicode? Or does unicode mean "any character defined by a number" as seems to be the definition used in the opening paragraph.

I was going to make it all a lot more readable by simply removing the whole first paragraph and useless to anyone except someone wanting a laugh. —Preceding unsigned comment added by 79.79.206.183 (talk) 11:19, 23 October 2008 (UTC)[reply]

Unicode defines a (large) set of numbers known as "code points" and what characters they represent (note however than in some cases unicode code points do not have a 1:1 mapping to user visible characters due to the presence of combining and control characters).

When you write a HTML document you are supposed to* specify what "charset" you are using. This charset defines how the sequence of bytes in your HTML file are interpreted.

The key to understanding the relationship between unicode and HTML is to understand that HTML regards all charsets as encodings of a subset of unicode. If you write your html documents in WINDOWS-1252 then you can only directly represent characters that are in WINDOWS-1252 but you can still indirectly represent any unicode character through an entity reference. Alternatively you can write your HTML document in UTF-8 and represent almost all** characters directly.

* if you don't specify what charset you are using the browser will likely make a default assumption which may or may not match the charset you actually used.

** A handful characters (which varies slightly by context) can't be represented directly because they are "markup sensitive"

-- Plugwash (talk) 01:35, 17 February 2012 (UTC)[reply]

unicode (UTF-8): about 50% of the web in 2010[edit]

It looks like unicode (UTF-8) is about 50% of the web in 2010 (Source: http://3.bp.blogspot.com/_7ZYqYi4xigk/S2Hcx0fITQI/AAAAAAAAFmM/ifZX2Wmv40A/s1600-h/unicode.png and http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html ) — Preceding unsigned comment added by 84.99.17.74 (talk) 20:30, 13 June 2011 (UTC)[reply]

60% and 80% with ASCII in 2012 http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.htmlhttp://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html — Preceding unsigned comment added by 86.69.108.41 (talk) 23:53, 16 February 2012 (UTC)[reply]

The section Frequency of usage says "the UTF-8 Unicode encoding became the most frequently used encoding on web pages, overtaking both ASCII (US) and 8859-1/1252", which is a quote from the cited Google page. But it doesn't make any sense, given that ASCII is a subset of UTF-8. I'm guessing maybe the Google author was referring to the stated charset of the pages, does anybody know? (Old now, I know...)Mcswell (talk) 01:13, 23 April 2021 (UTC)[reply]