Software Tools for Some Natural Language Texts Computer Processing

来源 :Computer Technology and Application | 被引量 : 0次 | 上传用户:hzpjhuang
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  Abstract: Software tools are developed for computer realization of syntactic, semantic, and morphological models of natural language texts, using rule based programming. The tools are efficient for a language, which has free order of words and developed morphological structure like Georgian. For instance, a Georgian verb has several thousand verb-forms. It is very difficult to express rules of morphological analysis by finite automaton and it will be inefficient as well. Resolution of some problems of full morphological analysis of Georgian words is impossible by finite automaton. Splitting of some Georgian verb-forms into morphemes requires non-deterministic search algorithm, which needs many backtrackings. To minimize backtrackings, it is necessary to put constraints, which exist among morphemes and verify them as soon as possible to avoid false directions of search. Software tool for syntactic analysis has means to reduce rules, which have the same members in different order. The authors used the tool for semantic analysis as well. Thus, proposed software tools have many means to construct efficient parser, test and correct it. The authors realized morphological and syntactic analysis of Georgian texts by these tools. In the presented paper, the authors describe the software tools and its application for Georgian language.
  Key words: Software tools, parsing algorithm, backtracking, syntactic analyzer, constraints.
  1. Introduction
  The computer morphological analysis of Georgian words is one of the main components for solving such problems as machine translation from Georgian language into the other languages, as well as the automated checking of orthography of Georgian texts, and some problems of artificial intelligence, which require computer processing of Georgian texts. The full system for computer morphological analysis of Georgian words does not exist yet. If we need to use Georgian language to communicate with computer, the solving of above-mentioned problem is very urgent. Solving this problem using of finite automaton, which is widely used for the languages from Western Europe, is not feasible. This is happening because finding of some verb-forms of Georgian language require backtracking, which is impossible with finite automaton. From the other side, using of full search algorithm slows the process of morphological analysis. For this reason, we proposed a method, which is making the analysis process faster, than full search algorithm [1]. This method uses constraints to establish correct morpheme’s selection. Already separated presumable morphemes from a word, morphological analysis tool checks it on satisfaction of their constraints. If the constraint is satisfied, the tool continues separation of other morphemes. In opposite case, it performs backtracking to search the new alternatives and rejects the last separated morpheme. In this way, the process of removing of incorrect alternatives happens in advance, what speeds up the searching process. The constraints are logical expressions, which we can compose from the features of morphemes. The tool checks, if separated morpheme’s feature has particular value, which defines correctness of the separation. We compose the values of morphemes’ features according to morphology of Georgian language. Under full computer morphological analysis, we understand all valid splitting of a word-form in morphemes and establishment of morphological categories for each splitting. The definition contains ambiguities of words. The following ambiguity is widespread:
  (1) Graphical coincidence of different verb-forms(by meaning) in presence circle, which have the same root. For instance, verb-form “agebs”, which may mean loss (many) or build (plan) and so on;
  (2) Graphical coincidence of a verb-form with its infinitive, for instance, “amoxsna” may mean“resolution” or “he has resolved”;
  (3) In time of splitting of verb-form, graphical coincidence of morphemes from different neighboring classes, for instance, “a” as the preverbal or vowel prefix or first letter of a verb’s root in the following verb-forms: “a-a-alebs”, “a-alebs” and “aldeba”(inflame). When we see first letter of the verb-form“aaalebs”, we cannot say, which morpheme we have, before we have seen following two letters. In first example, first “a” is preverbal. In second example, first“a” is vowel prefix and in third example first “a” is first letter of the root “al”. This means, that Georgian verbs splitting in morphemes needs at least parsing algorithm for LL(2) grammar [2], i.e., full morphological analysis of Georgian words by finite automaton is impossible.
  In the second case, morphological analysis for verb-form “amoxsna” must give two different parsing: one - for infinitive and second - for verb-form. For this, we need nondeterministic algorithm. Deterministic algorithm cannot give two different parses for the same word-form. Thus, deterministic algorithm is not valid for full morphological analysis of Georgian words. All authors fulfilled morphological analyses for Georgian words by finite automaton or by deterministic algorithm [3-4].
  For full morphological analysis, we must apply non-deterministic algorithm. For instance, we can apply from left to right in depth search algorithm with backtrackings. As far as backtrackings take down the speed of the algorithm, we must find a method, which reduces them. Such possibility exists. We can exclude morphemes, which conflict with found morphemes at a moment. In other case, we can classify morphemes so, that one representative of each class will meet as maximum one times in a word-form. Among morphemes of a verb-form is important the root. We can divide verb-forms into classes so, that each morpheme, which can meet in a word-form, will indicate only one morphological category. All this reduces backtrackings and establishing morphological categories considerably. After this, the establishment of morphological categories of a word is easy. We realized full morphological analysis of Georgian words by the tools [5-7]. We use formal grammar rules with constraints as for morphological so for syntactic and semantic analyses. Constraints based on grammar symbols’ features represent logical expressions built on feature values. A feature placed on a grammar rule defines the applicability of the rule in concrete case. We use the rule, if the constraint’s value is true, otherwise we must take other rule. In reality, a constraint defines the context, where the rule is applicable. Parsing algorithm and a rule choice strategy define the choice of a rule for verification of its applicability. The paper is organized as follows: Section 2 describes of software tools. Section 3 presents the definition of feature structures. Section 4 gives formal definition of a constraint. Section 5 describes our approach to morphological analysis. Section 6 gives the example of grammar file composition. Section 7 explains the syntactic analyzer. Section 8 is the conclusion and section 9 is the acknowledgment.
  2. Software Tools
  The “Software Tools for Morphological and Syntactic Analysis of Natural Language Texts” is a software system designed for the processing of natural language texts. We used the system to find syntactic and morphological structure of Georgian language texts. Using specific formalism, which we created for this purpose, allow us to write down syntactic and morphological rules defined by particular natural language grammar. This formalism represents the new, complex approach, which solves problems of morphological and syntactic analyses for some natural language. We implemented a software system according to this formalism [1]. One can realize syntactic analysis of sentences and morphological analysis of word-forms with this software system. We designed several special algorithms for this system. Using of the formalism, which is described in Refs.[8-9], is very difficult to use for Georgian language, as far as expressing of some morphological rules is very complicated and understanding of such writing is difficult.
  The system consists of two parts: Syntactic analyzer and morphological analyzer. Purpose of the syntactic analyzer is to parse an input sentence, to build a parsing tree, which describes relations between the individual words within the sentence, and to collect information about the input sentence, which the system figured out during the analysis process. It is necessary to provide a grammar file for the syntactic and morphological analyzers. There must be recorded syntactic or morphological rules of particular natural language grammar. Syntactic analyzer also needs information about the grammar categories of the word-forms. It uses the information during analysis process.
  Basic methods and algorithms, which we used to develop the system, are: operations defined on features’ structures; trace back algorithm (for morphological analyzer); general syntactic parsing algorithm for context free grammar with constraints. Features’ structures are widely used on all levels of analysis. We use them to hold various information about dictionary entries and information obtained during analysis. Each symbol defined in a morphological or syntactic rule has an associated features’ structure, which we initially fill from the dictionary, or the system fill them by the previous levels of analysis. Features’ structures and operations defined on them we use to build up features’constraints. With general parsing algorithm, it is possible to get a syntactic analysis of any sentence defined by a context free grammar and simultaneously check features’ constraints, which may be associated with grammatical rules. Features’ constraints are logical expressions composed by the operations, which we defined on the features’ structures. We attach features’ constraints to rules, which we defined within a grammar file. If the constraint is not satisfied during the analysis, then the system will reject current rule and the search process will go on. We can attach features’constraints also to morphological rules. However, unlike the syntactic rules, we can attach constraints at any place within a morphological rule, only not at the end. This speeds up morphological analysis, because the system checks constraints early and it rejects incorrect word-form’s division into morphemes in a timely manner.
  Formalism, which we developed for the syntactic and morphological analysis is highly comfortable for human. It has many constructions that make it easier to write grammar file. Morphological analyzer has a built-in preprocessor. Program operates in UNIX and Windows operating systems. We can compile it and use in any other platform, which contains modern C++ compiler.
  3. Feature Structures
  A feature structure is a specific data structure. It essentially is a list of “Attribute-Value” type pairs. The value of an attribute (field) may be either atomic, or may be a feature structure itself. This is a recursive definition; therefore we can build a complex feature structure, with any level of depth of nested sub-structures.
  Feature structures are widely used in Natural Language Processing. They are commonly used:
  ? To hold initial properties of lexical entries in the dictionary;
  ? To put constraints on parser rules. Certain operations defined on feature structures are used for this purpose;
  ? To pass data across different levels of analysis.
  We use following notation to represent feature structures in our formalism. List of “Attribute-Value”pairs we enclosed in square braces. We separated attributes and values by colon “:”. In example:
  It is possible to use short hand notation for constructing feature structures. We can rewrite above example this way:
  Content of the feature structures listed in the parenthesis at the beginning we copied to the newly constructed feature structure.
  Below is a fragment of a formal grammar for defining feature structures in our formalism:
   ::= “[”[][] “]”
   ::= “(” {} “)”
   ::= |
   ::= { }
   ::= “:”
   ::=
   ::= “+” | “-” | |
  | |
  . . .
  There are several operations defined on feature structures to perform comparison and/or data manipulation. Mostly well-known operation defined on feature structures is unification. In addition to the unification, we have introduced other useful operations that simplify composition of grammar file in practice. The result of each operation is a Boolean constant“true” or “false”. Below is a list of all implemented operations and their semantics:
  Content of the RHS (Right Hand Side) operand (B) is assigned to the LHS (Left Hand Side) operand (A). The assignment operation always returns “true” value.
  This operation does not modify content of the operands. Result of the operation is “true” when both operands (A and B) have the same fields (attributes) with identical values. If there is a field in one feature structure, which is not represented in the second feature structure, or the same fields do not have equal values, then the result is “false”.
  Unification returns “true”, when the values of the similar field in each feature structure does not conflict with each other. That means, either the values are equal, or one of the value is undefined. Otherwise, the result of the unification operator is “false”. Fields, that do not defined in LHS feature structure and defined in RHS feature structure we copied and added to the LHS operand. If there is an undefined value in LHS feature structure, and the same field in the RHS feature structure we defined, that value is assigned to the corresponding LHS feature structure field.
  Returns the same truth-value as unification operator, but the content of operands we do not modified.
  Check on equality or unification operations (“=” and“==”) may take multiple arguments.
  For instance:
  where X, A, B, and C are feature structures. Left hand side of an operation we checked against each right hand side argument that way. In addition, the result is “true”only when all individual operations return “true”, otherwise “false”.
  There is also a functional way to write operations. For example, we can write “),(BAequals” instead of“BA=”. Following functions are defined: “equal”
  (check on equality), “assign” (assignment), “unify”(unification), “unicheck” (check on unification), “meq”(multiple equality checking), “muc” (multiple unification checking).
  4. Constraints
  In our system, we use feature structures and operations defined on them to put constraints on parser rules. That makes parser rules more suitable for natural language analysis than pure CFG rules. We have generalized notation of constraint [1]. Constraint is any logical expression built up with operations defined on features’ structures and basic logical operations and constants such is: & (and), | (or), ~ (not), 0 (false), 1(true).
  We can write a parser rule following way:
  Ai…= are terminal or non-terminal symbols (for morphological analyzer only terminal symbols are allowed), and
  Ci…= are constraints. We check each constraint as soon as all the RHS symbols located before match to the input. If a constraint evaluates to“true” value then parser will continue matching, otherwise if constraint evaluates to “false” parser will reject this alternative and will try another alternative. There is a features’ structure associated with each (S and Ai) symbol in a rule. If a symbol is a terminal symbol, then we take initial content of its associated features’ structure from the dictionary or from the morphological analyzer (for syntactic analyzer). We use constraints not only to check the correctness of parsing and not only to reduce unnecessary variants. We also use them to transfer data to a LHS symbol, thus move all necessary information to the next level of analysis. We can use assignment or unification operations for this purpose. To access a features’structure for particular symbol, we can use a path notation. We write a path using angle brackets. For
  5. Morphological Analyzer
  Purpose of morphological analyzer is to split an input word into the morphemes and figure out morphological categories of the word. We may invoke morphological analyzer manually or automatically by the syntactic analyzer.
  6. Example of Grammar’s File Composition
  Suppose, we wish to develop morphological analysis’ program for Georgian nouns by morphological analyzer (ma); We should fix morphemes’ classes for nouns and enumerate them by their meetings in a noon. For the reason of example’s simplifying, we will consider stems, number’s signs, and declension’s signs only. Stems’ classis consists of all nouns’ stems. Class of number’s signs consists of“eb”, “n”, “t” and “” (wide) morphemes. Declension’s signs class consist of declension’s morphemes of all nouns. We should pass them to ma as starting information. For uniquely recognition of declension’s category of a noun-form, we need to classify nouns’stems by attachment of declension’s signs. For instance, non-compressed noun’s stem ended by consonant. Attached declension’s signs uniquely determine declensions of noun-forms. We must attach to such stem the feature (stem-type = “1”), where stem-type is the attribute and “1” is its value and it signifies non-compressed stem ended by a consonant. Then establishment of declension for such noun-forms is easy. We must compose the rule, which we can express so - separate from a noun-form a stem, number’s sign and declension’s sign and if the noun-form coincide with founded morphemes completely and if stem-type= “1” and declension’s sign = “i” then the noun-form has as declension’s sign morpheme “i” and declension is nominative or declension’s sign = “ma” and so on. The rule will be:
  7. Syntactic Analyzer
  Purpose of syntactic analyzer is to analyze sentences of natural language and produce parsing tree and information about the sentence. In order to accomplish this task, syntactic analyzer needs a grammar’s file and a dictionary (or it may use morphological analyzer instead of complete dictionary). We write grammar rules for syntactic analyzer like CFG rules. However, they may have constraints and symbol position regulators. We can write a rule according to these conventions:
  symbols, C and ),, 1(niCiB=are constraints, and R is a set of symbol position regulators. Position regulators declare order of RHS symbols in the rule, consequently making non-fixed word ordering. There are two types of position regulators:
  A?
  means that symbol Ai must be placed exactly before the symbol Aj. It is not formal difference between syntactic and semantic rules. Therefore, we can provide syntactic and semantic analyses in parallel or separately with the same analyzer.
  8. Conclusions
  Described software tools we used for morphological and syntactic analyses of Georgian texts. All problems mentioned in introduction were resolved. We simplified composition of grammar file by using macros with parameters. The software may be used for other natural languages. Method of composition of program file and initial data are the same.
  Acknowledgment
  The work was supported by GNSF (Georgian National Scientific Foundation), grant 09-184-1-120.
  References
  [1] J. Antidze, N. Gulua, D. Mishelashvili, L. Nukradze, On complete computer morphological and syntactic analysis of Georgian texts, in: The International Symposium of Natural Language Processing, Georgian Language and Computer Technologies, Institute of Linguistics of Georgian Academy of Sciences, Tbilisi, 2009.
  [2] J. Antidze, Formal Languages and Grammars, Natural Languages Computer Modeling, Nekeri, Tbilisi, 2009.
  [3] K. Datukishvili, M. Loladze, N. Zakalashvili, Georgian language processing (morphological level), in: The International Symposium of Natural Language Processing, Georgian Language and Computer Technologies, Institute of Linguistics of Georgian Academy of Sciences, Tbilisi, 2003.
  [4] L. Margvelani, Machine analysis system of Georgian word forms, in: The International Symposium of Natural Language Processing, Georgian Language and Computer Technologies, Institute of Linguistics of Georgian Academy of Sciences, Tbilisi, 2003.
  [5] J. Antidze, D. Mishelashvili, Instrumental tool for morphological analysis of some natural languages, Reports of Enlarged Session of the Seminar of IAM TSU 19 (2004) 21-24.
  [6] J. Antidze, D. Melikishvili, D. Mishelashvili, Georgian language computer morphology, in: The International Symposium of Natural Language Processing, Georgian Language and Computer Technology, Tbilisi, 2004.
  [7] J. Antidze, N. Gulua, On selection of Georgian texts computer analysis formalism, Bulletin of The Georgian Academy of Sciences 162 (2) (2000) 15-19.
  [8] Available online at: http://www.sil.org/pcpatr/manual/pcpatr.html.
  [9] Available online at: http://www.sil.org/pckimmo/v2/pc-kimmo_v2.html.
  [10] D. Melikishvili, System of Georgian Verbs Conjugation, Logos Press, Tbilisi, 2001.
  [11] D. Melikishvili, The Georgian Verb: A Morphosyntactic Analysis, Dunwoody Press, Dunwoody, 2008.
  [12] D. Melikishvili, On Georgian verb-forms classification and qualification principles, Problems of Linguistics 1(2008) 30-35.
其他文献
淡粉色的薄针织短衫翻出素雅的白色雪纺娃娃圆领,玫红的九分裤堆堆袜踩着裸色平底鞋出街,这份搭配整体性满分!
期刊
新年伊始,让心情愉悦的方式有很多,旅游、探访亲友等等,可来一场美丽的时尚购物之旅绝对是爱美人士的心头好,今期就让小编带大家一起来体验这场美妆之旅。  春日焕肤三大方程式  初春时节,气温还是一样的寒冷,肌肤同时承受着天气转化带来的各种问题,干燥、脱皮、过敏等等,那么就让小编跟大家一起来解开春日焕肤的三大方程式吧,让肌肤美润一整季!!!  光洁细腻 调理角质  角质,具有保护、保湿、防晒等功效,但是
期刊
简约大气的欧美风一直以来受到不少MM的亲睐,但是由于身材限制,很多个子娇小的MM只能望而却步了。别急,本期online shopping小编为大家推荐的欧美风格网店里一定有适合你的选择!
期刊
每一期与众不同的策划者,孕育出不同风格读者的造型照片,让你懂得捕捉流行动态,让你说出:“即使有来生也仍然要做女孩!”2013年,让我们变成帅气百变的女生姿态,去迎接一个从未见过的自己吧!
期刊
素肌纯净感是轻透可爱妆的一大亮点,每天护肤程序中的保湿尤为重要,Keep住娇艳欲滴的水润肌肤!
期刊
很多MM都希望拥有Q弹水嫩的完美肌肤。可是尝尽了各种不同种类的护肤方法都达不到自己想要的护肤效果。究竟是怎么一回事呢?其实想要打造水水润润的肌肤,并不需要很复杂的护肤方法。只要找到正确而适合自己的护肤方法,你也可以轻松属于你自己的完美肌肤!
期刊
发型对掩龄、提升气质的作用非常显著,发型不合拍,让你的形象大打折扣。这个冬季做个青春、甜美的女生,让我们发型开始做起。多款美丽发型,融合时尚元素,让我们从头开始扮嫩!
期刊
Abstract: Internet based social-networking sites like Facebook are highly attractive, user-friendly, and increasingly popular. Current estimates indicate over 750 million users of Facebook alone. Howe
期刊
金属感的配饰在新季里的设计突破了传统,  并不只适用于夜晚的着装方式,  低调的金色和银色被大量运用于日装设计上,  而一些白色浅色系的加入也让女人  有了更多的选择的空间。
期刊
Abstract: In this paper, the authors present the development of a data modelling tool that visualizes the transformation process of an“Entity-Relationship” Diagram (ERD) into a relational database sch
期刊