C-Kermit Case Study #08

[ Previous ] [ Next ] [ Index ] [ C-Kermit Home ] [ Kermit Home ]

Article: 10928 of comp.protocols.kermit.misc
From: fdc@watsun.cc.columbia.edu (Frank da Cruz)
Newsgroups: comp.protocols.kermit.misc
Subject: Case Study #8: Unicode
Date: 15 Jan 2000 21:07:28 GMT
Organization: Columbia University

Who doesn't know what Unicode is? Now that computing has become so widespread and Web-centric -- a revolution in itself -- we are on the brink of another revolution in computing, one that can have profound effects on all of us and perhaps even on the course of history.

Until recently, most computer text has been recorded in single-byte 7-bit or 8-bit character sets (1), one per language or language group. For example, the default character set of the Web was originally ISO 8859-1 Latin Alphabet 1, which can encode English plus most West European languages: Italian, Spanish, German, Icelandic, etc. But it can't encode East European languages like Polish, Czech, or Hungarian, even though they use the same alphabet, because the accents are different. Nor can it represent languages like Russian, Arabic, Hebrew, or Japanese that use other writing systems. Therefore, to write in languages other than our own we often have to switch character sets, and as anybody who has tried it can tell you, that's a tricky business. It's even trickier if we need to mix different languages in the same document; for example, Portuguese, Romanian, Russian, and Armenian.

The great promise of the Internet is its potential to bring people in all countries together as never before. We can get to know one other and appreciate each other's languages and cultures with unprecedented convenience. And the great lesson of mass computer and Internet culture so far is: for anything to catch on, it has to be easy. Coping with the current Babyl of character sets is anything but easy: different platforms use different private character sets (such as PC code pages), which must map to any of an array of standard character sets (such as the ISO Latin alphabets) or to different private character sets on other platforms. If languages are to be mixed, elaborate and often product-specific switching mechanisms are required.

Unicode to the rescue. For more than 10 years, a consortium of corporate, academic, and standards-body representatives has been working to create a single universal character set capable of representing all the world's writing systems. To find out all about Unicode, visit the Unicode Consortium website:

http://www.unicode.org/

Unicode marks a fundamental change in how we compute. Each character is represented not by a single byte (1), but can be one, two, three, four, or more bytes, depending on the Unicode Transformation Format (UTF) used and the specific characters involved. But since we have fifty years of software and data using the one-byte-per-character model, the transition to Unicode will be a long process. One, however, that is well underway.

A major part of this transition is the creation of Unicode fonts. The work is being done piecemeal, with each font containing a (perhaps) different subset of Unicode, with additional characters and writing systems added over time. Your computer might already support Unicode to some extent. To check, visit:

http://www.columbia.edu/kermit/utf8.html

This is a no-frills plain-text web page containing text in many languages (2) encoded in Unicode Transformation Format 8 (UTF-8). You might see a lot of "unknown glyph" boxes or gibberish, depending on your browser, font, and locale.

Now visit:

http://www.hclrss.demon.co.uk/unicode/fonts.html

for a survey of Unicode fonts, to see how you might be able to broaden the horizons of your own computer right now. Try installing an updated font and visiting the UTF-8 Sample page again.

What you see marks a great leap forward: a vendor-neutral, application-independent method for encoding text in many languages -- and some day, we hope, all languages. Unlike other Web pages you might have seen, there are no tricks here -- for example, no GIFs to represent Chinese or Hebrew. It's just plain text. You can select and copy it like any other text, but whether you can paste it into another application depends on the other application. On Windows 95 and later, for example, you can paste it into Microsoft Word if it has a Unicode font such as Arial or Times New Roman selected, and see several of the non-Roman scripts (but not necessarily all of them).

The Kermit Project has been a member of the Unicode Consortium for years, and now C-Kermit and Kermit 95 support Unicode as transfer character-set, a file character-set, and a terminal character-set. All of a sudden you have convenient cross-platform tools for migration to Unicode and interfacing between Unicode and traditional environments. For example:

C-Kermit's Unicode support is integrated with all its other character-sets, which cover:

Most of what you see on the UTF-8 Sample Page, you can also see on your Kermit 95 screen; it's "just" a matter of having the right font.

As usual, I've rambled on longer than planned and still only scratched the surface. For greater detail, read Section 6.6 of the C-Kermit 7.0 Update Notes.

Notes:

  1. Oversimplification. Traditional East Asian character sets, among others, use various multibyte encodings.
  2. If you can add languages to this page, please let me know.
  3. To learn about Unicode support in Linux, visit http://www.cl.cam.ac.uk/~mgk25/unicode.html.

- Frank

[ Top ] [ Previous ] [ Next ] [ Index ] [ C-Kermit Home ] [ Kermit Home ]


C-Kermit / Columbia University / kermit@columbia.edu / 15 Jan 2000