Title: Intelligent Character Encoding Speaker: Dr. Ching-Chun Hsieh Institute of Information Science, Academia Sinica P.O. Box 1-7, Nankang, 115 Taipei, Taiwan, ROC Phone: (O) 886-2-2782-5472, 886-2-2788-3799 X1409 (with voice mail) Phone: (H) 886-2-2782-4388, Fax: 886-2-2782-4814 E-mail: hsieh@sinica.edu.tw, hsieh@gate.sinica.edu.tw Summary: Missing or unencoded characters for a language constitute practical problems in language processing applications. The problems are particularly serious for large character sets, such as the Chinese character set, and other Han-related character sets. Applications which require the use of a large character set, such as e-library and historical article archiving, are often bothered by the small character sets currently popular in non-critical applications. This tutorial is aimed at resolving the problems with such missing characters for Chinese language processing. The main problems with missing characters are reviewed first. We then introduce some theoretical approaches, based on the sub-structures of the Chinese characters, for encoding those missing characters. The structure hierarchy of sub-structures in a very large set of Chinese characters are reviewed. Databases for the sub-structures are maintained in compliance with the sturctural hierarchy of the Chinese characters, as outlined in the theoretical parts. We then demonstrate how the structural components of the Chinese characters can be used to encode missing characters in an unambiguous and portable way. Some practicle implementation for office and web applications will also be demonstrated. Tutorial Outline: - Theoretical Issues: - missing character problems for large character set apps - document archiving, missing fonts, variant characters, document indexing, retrieval and information portability - substructures of Chinese characters - hierarchy of substructures and characteristics of terminal and non-terminal nodes - classification, relationship and statistics - missing character encoding - Implemantation Issues: - document editing, preview and printing - add-ons: MS Excell, FrontPage, Words - web pages and JavaApplet - web server, clients, and center server Bio: Education: Ph.D./CS National Chiao-Tung University, Hinchu, Taiwan, 1972. M.S./CS National Chiao-Tung University, Hinchu, Taiwan, 1966. B.E./EE National Taiwan University, 1963. Academic Appointments: Research Fellow, Institute of Information Science, 1983-present. Adjoin professor, Department of Library and Information Science, National Taiwan University, 1990-present. Visiting scholar, Institute for Research in Humanities, Kyoto University, 1997, 9.1-10.15. Professor and Head, Department of Electronic Engineering, National Taiwan Institute of Technology, 1977-1983. Director, Graduate School of Engineering and Technology, National Taiwan Institute of Technology, 1980-1983. Professor, Computer Science Department, National Chiao-Tung University, 1974-1977. Associate professor, Computer Science Department, National Chiao-Tung University, 1971-1974. Visiting Scientist, RLE. EE Dept., MIT, Cambridge Massachusetts, 1969-1971 Instructor, Graduate School of Electronic Engineering, National Chiao-Tung University, 1966-1969 Professional Appointments: Researcher, the Science and Technology Advisory Group of the Executive Yuan, 1988-1999 Member of the At-large Committee of the National Information Infrastructure Initiative, Executive Yuan, 1995-present. Founder and Chairman, ROC Computational Linguistic Society, 1986-1989 Major Fields of Interest: Chinese Language Information Processing System The Studies on the Font and Glyph of Chinese Character Information Policy, Information Society, and Information Ethics Digital Library/Museum for Chinese Classics About Dr. Ching Chun Hsieh: Dr. Ching-Chun Hsieh jointed the Institute of Information Science as a research fellow in the summer of 1983. From 1984 to 1990, he was appointed as the Director of the Computing Center of Academia Sinica. In these years, he established the Computing facilities and the campus network for Academia Sinica, and developed the Chinese Language Full-text Processing capability for Sinology Studies, such as the 25 Dynastic Database. Dr. Hsieh has extensive experience of consulting at many governmental organizations. At present, he is a part-time researcher at the Science and Technology Advisory Group of the Executive Yuan. And, he is holding an adjoin professor at the Department of Library and Information Science of the National Taiwan University. In this Information Science area, he was invited to conduct two very important research projects for government information policy planning in the past two years. These projects are: "A Preliminary Studies on the Cultural and Social Impact of Information Technology" in 1997 and " Planning Chinese Language Teaching on Internet" in 1998. In 1988, the ROC Computational Linguistic Society (abbr. ROCLing) was founded by Dr. Hsieh, and then he was elected as the president of that society for two years. Then, he severed as a member of the board of directors of ROCLing twice. Dr. Hsieh also has actively participated the ISO/IEC JTC1/SC18/WG8 affairs since 1988. In this working group, he helps the ISO/IEC to develop document processing standards with Chinese language capability. One of his early contributions to develop standard is the design of the Chinese Character Code for Information Interchange (abbr. CCCII) in 1980. The code, after a selection of a proper subset of its original character set and then re-named as the East Asian Character Code (EACC), has been adopted as U.S. standard for Chinese, Japanese and Korean processing since 1989. Now, Dr. Hsieh is in charge of the Document Processing Laboratory of the Institute of Information Science. In this lab, several very successful projects had been completed since 1990. For instance, a project to develop a new Hypertext Retrieval System and Multi-Version Control for Ancient Chinese Document, a project on the Auto-classification of Chinese Text based on Term-Document Vector-Space Model, and a project to develop a Glyph Database on Internet for Chinese characters, etc. Now, the on-going projects in this lab include: a project to solve the Missing Character Problem, and a project to develop a Workstation for Sinology Studies. Besides, Dr. Hsieh has been invited to be the director of the Digital Museum/Library Program by the National Science Council. From July 1998 to November 1999. The yearly budget of this program is approximately 2 million USD. Also, he has been appointed as the Vice Chairman of the Coordinate Committee for the Digital Resource Development of Classical Chinese Document in Academia Sinica since 1996. Since October 1999, Dr. Hsieh has been appointed as the Coordinator for preparing a 5 year National Digital Archive Project. This NDA project at present phase includes 7 organizations, namely the National Palace Museum, the National Library, the National Museum of History, the National Museum of Natural Science, the National Taiwan University, the Taiwan Provincial Archive Council, and the Academia Sinica. This project will be launched since January 2001. The yearly budget of this project is estimated around 8 to 10 million USD. February 24, 2000 updated