Jeppe's Unicode page

This page is UTF-8 encoded. Take a look at the following character(s): ñ

If you see an "n" with a "~" above, your browser understands UTF-8 and you can read this page. If you see something else (typically an "A" with a "~" followed by a plus/minus sign) your browser does not understand UTF-8 and you should find yourself a better browser.

Welcome, καλώς ορίσατε, добро пожаловать

Unicode is a standard that allows you to write in virtually any language in the world, no matter what alphabet that language uses. UTF-8 is a way to put the Unicodes into a document like this one; ASCII characters are represented by just one bit octet ("byte") while other characters take two or more octets.

Let me show you some examples of what you can do in Unicode:

Here is a Danish word: Blåbærgrød (these are all Latin1 letters).

Here is the French oe ligature: Bœuf.

A Polish l with stroke: Złoty.

Hungarian vowels: Könnyű.

Something bad to say in Turkish: ağır işçi.

The following letters are useful in Esperanto: ĉĝĥĵŝŭ.

This is a famous Greek palindrome: ΝΙΨΟΝ ΑΝΟΜΗΜΑΤΑ ΜΗ ΜΟΝΑΝ ΟΨΙΝ.
The same with small letters may look like: Νίψον ανομήματα μη μόναν όψιν.

But with Unicode you can do much more than this! For example you can use the Greek alphabet or the Cyrillic alphabet (with which Russian, Bulgarian and many other European languages are written). In the heading above, both the Greek καλώς ορίσατε and the Russian добро пожаловать mean "welcome".

So what about all the non-European languages? Arabic and Hebrew? Chinese? Japanese and Korean? And how about very "exotic" (to me) alphabets like Armenian or Tibetan? Oh yes, Unicode supports all these and many more, too.

And you can use Unicode for Mathematics as well!

The cool thing is that you can write all these characters in the same document. Of course, to see everything correctly it does not suffice to have an up-to-date browser: You must also have all the fonts installed properly on your computer. The font installation procedure depends on what operating system your machine is running (Windows, Unix, etc.).

Before Unicode

The "old" standard in HTML and many other text formats is 8bit encoded iso-8859-1, also called Latin1. This consists of only 96 special characters used in Western Europe. Please note that although the Unicode is a superset of Latin1, the UTF-8 codes are different from the 8bit encoding. The following table gives some useful information:

Character  Latin1    Unicode     UTF-8   Latin1
            code                        interpr.

             A0      00 A0       C2 A0   Â 
   ¡         A1      00 A1       C2 A1   Â¡
   ¢         A2      00 A2       C2 A2   Â¢
   £         A3      00 A3       C2 A3   Â£
   ¤         A4      00 A4       C2 A4   Â¤
   ¥         A5      00 A5       C2 A5   Â¥
   ¦         A6      00 A6       C2 A6   Â¦
   §         A7      00 A7       C2 A7   Â§
   ¨         A8      00 A8       C2 A8   Â¨
   ©         A9      00 A9       C2 A9   Â©
   ª         AA      00 AA       C2 AA   Âª
   «         AB      00 AB       C2 AB   Â«
   ¬         AC      00 AC       C2 AC   Â¬
            AD      00 AD       C2 AD   Â
   ®         AE      00 AE       C2 AE   Â®
   ¯         AF      00 AF       C2 AF   Â¯
   °         B0      00 B0       C2 B0   Â°
   ±         B1      00 B1       C2 B1   Â±
   ²         B2      00 B2       C2 B2   Â²
   ³         B3      00 B3       C2 B3   Â³
   ´         B4      00 B4       C2 B4   Â´
   µ         B5      00 B5       C2 B5   Âµ
   ¶         B6      00 B6       C2 B6   Â¶
   ·         B7      00 B7       C2 B7   Â·
   ¸         B8      00 B8       C2 B8   Â¸
   ¹         B9      00 B9       C2 B9   Â¹
   º         BA      00 BA       C2 BA   Âº
   »         BB      00 BB       C2 BB   Â»
   ¼         BC      00 BC       C2 BC   Â¼
   ½         BD      00 BD       C2 BD   Â½
   ¾         BE      00 BE       C2 BE   Â¾
   ¿         BF      00 BF       C2 BF   Â¿
	   		       
   À         C0      00 C0       C3 80   Ã[80]
   Á         C1      00 C1       C3 81   Ã[81]
   Â         C2      00 C2       C3 82   Ã[82]
   Ã         C3      00 C3       C3 83   Ã[83]
   Ä         C4      00 C4       C3 84   Ã[84]
   Å         C5      00 C5       C3 85   Ã[85]
   Æ         C6      00 C6       C3 86   Ã[86]
   Ç         C7      00 C7       C3 87   Ã[87]
   È         C8      00 C8       C3 88   Ã[88]
   É         C9      00 C9       C3 89   Ã[89]
   Ê         CA      00 CA       C3 8A   Ã[8A]
   Ë         CB      00 CB       C3 8B   Ã[8B]
   Ì         CC      00 CC       C3 8C   Ã[8C]
   Í         CD      00 CD       C3 8D   Ã[8D]
   Î         CE      00 CE       C3 8E   Ã[8E]
   Ï         CF      00 CF       C3 8F   Ã[8F]
   Ð         D0      00 D0       C3 90   Ã[90]
   Ñ         D1      00 D1       C3 91   Ã[91]
   Ò         D2      00 D2       C3 92   Ã[92]
   Ó         D3      00 D3       C3 93   Ã[93]
   Ô         D4      00 D4       C3 94   Ã[94]
   Õ         D5      00 D5       C3 95   Ã[95]
   Ö         D6      00 D6       C3 96   Ã[96]
   ×         D7      00 D7       C3 97   Ã[97]
   Ø         D8      00 D8       C3 98   Ã[98]
   Ù         D9      00 D9       C3 99   Ã[99]
   Ú         DA      00 DA       C3 9A   Ã[9A]
   Û         DB      00 DB       C3 9B   Ã[9B]
   Ü         DC      00 DC       C3 9C   Ã[9C]
   Ý         DD      00 DD       C3 9D   Ã[9D]
   Þ         DE      00 DE       C3 9E   Ã[9E]
   ß         DF      00 DF       C3 9F   Ã[9F]
	   		       
   à         E0      00 E0       C3 A0   Ã 
   á         E1      00 E1       C3 A1   Ã¡
   â         E2      00 E2       C3 A2   Ã¢
   ã         E3      00 E3       C3 A3   Ã£
   ä         E4      00 E4       C3 A4   Ã¤
   å         E5      00 E5       C3 A5   Ã¥
   æ         E6      00 E6       C3 A6   Ã¦
   ç         E7      00 E7       C3 A7   Ã§
   è         E8      00 E8       C3 A8   Ã¨
   é         E9      00 E9       C3 A9   Ã©
   ê         EA      00 EA       C3 AA   Ãª
   ë         EB      00 EB       C3 AB   Ã«
   ì         EC      00 EC       C3 AC   Ã¬
   í         ED      00 ED       C3 AD   Ã
   î         EE      00 EE       C3 AE   Ã®
   ï         EF      00 EF       C3 AF   Ã¯
   ð         F0      00 F0       C3 B0   Ã°
   ñ         F1      00 F1       C3 B1   Ã±
   ò         F2      00 F2       C3 B2   Ã²
   ó         F3      00 F3       C3 B3   Ã³
   ô         F4      00 F4       C3 B4   Ã´
   õ         F5      00 F5       C3 B5   Ãµ
   ö         F6      00 F6       C3 B6   Ã¶
   ÷         F7      00 F7       C3 B7   Ã·
   ø         F8      00 F8       C3 B8   Ã¸
   ù         F9      00 F9       C3 B9   Ã¹
   ú         FA      00 FA       C3 BA   Ãº
   û         FB      00 FB       C3 BB   Ã»
   ü         FC      00 FC       C3 BC   Ã¼
   ý         FD      00 FD       C3 BD   Ã½
   þ         FE      00 FE       C3 BE   Ã¾
   ÿ         FF      00 FF       C3 BF   Ã¿

The last column shows what the UTF-8 bit sequence would look like if it was mistakenly interpreted as 8bit encoded iso-8859-1.

You can see another table that shows how UTF-8 works.

Addendum: People who want to use the character € (EURO SIGN), should use UTF-8 (as here), not the lame Windows-1252. The EURO SIGN has number U+20AC which corresponds to the UTF-8 encoding E2 82 AC. In HTML, you can even avoid UTF-8, using the decimal form: €

Links:

The Unicode Consortium
- See graphical representations og all the Unicode characters at charts.unicode.org.
RFC 2279 about the UTF-8 encoding.
Charset names (FTP document).
Specifying the character encoding, a subsection of the HTML 4.0 Specification of W3C.
Roman Czyborra's excellent Unicode/Unix site.
i18n: HTML Character set issues beyond HTML3.2, by A. J. Flavell.
See a UTF-8 invitation/advertisement in many languages! Do your browser and OS understand them all?

Back to Jeppe's page.