Title: Treebanking and Parsing Speaker: Dr. Keh-Jiann Chen Institute of Information Science, Academia Sinica 115 Taipei, Taiwan, R.O.C. Phone: (886)-2-2788-3799 ext 1510 e-mail: kchen@iis.sinica.edu.tw Summary: In this tutorial, the issues regarding the Chinese language parsing and its relations to the grammar and threebanks will be addressed. In particular, focuses will be on the difficulties for processing Chinese language. Parsing natural language sentences makes use of many different knowledge sources, such as lexical, syntax, semantic, and common sense knowledge. Preparation of knowledge bank is a very difficult task, since there are vast amount of knowledge and they are not well organized. Corpus-based approach provided a way of automatically extract different knowledge. From part-of-speech tagged corpora to the syntactic structure annotated treebanks, each contribute more explicit linguistic knowledge at different level for better automation on knowledge extraction. Treebanks provide an easy way of extracting grammar rules and their occurrence probability. In addition, word-to-word relations are also precisely associated. Hence it raises the following important issues. How were treebanks prepared? What are tag-sets suitable for Chinese treebanks? How many annotated tree structures are sufficient in a treebank for the purpose of grammar generation? What are tradeoffs between grammar coverage and ambiguities? We will try to answer the above questions in this tutorial. Regarding sentence parsing, some special characteristics of Chinese language cause difficulties for processing Chinese and make processing of Chinese different from other natural languages. For instances, there is no blanks to mark word boundaries and new words can be easily coined in Chinese text. No inflectional marking for grammatical functions and relatively free order of constituents cause the ambiguities and difficulties of the Chinese sentence parsing. We will try to provide the solutions to the above issues occurring at the stage of word segmentation, such as segmentation ambiguities, unknown word identification and the issues occurring at the stage of parsing, such as the problems of under coverage of grammars, structure ambiguities, and thematic role assignments. Tutorial Outline: 1. Introduction to Chinese treebanking and parsing a. Preparation of Chinese treebank b. Characteristics of Chinese language and their relations to Chinese language processing 2. Uses of treebanks and grammar extraction a. Searching tools b. Grammar extraction from treebanks c. Grammar coverage and ambiguities 3. Lexical analysis a. Word segmentation b. Unknown word identification 4. Structure analysis and role assignment a. Knowledge for sentence parsing b. Resolution of structure ambiguities c. Thematic-role assignments Bio: His current research interests include Chinese language processing, lexical semantics, and corpus linguistics. He is developing research environments for Chinese natural language processing including lexical databases, tagged corpora, treebanks, and parsers.