|
2015
On the positional distribution of an Armenian auxiliary: Second position clisis, focus and phases (with Arsalan Kahnemuyipour). Syntax (in press)
This paper investigates the positional distribution of an auxiliary clitic in Eastern Armenian in informationally marked sentences. The paper builds on previous work on the distribution of the auxiliary in focus-neutral contexts (Kahnemuyipour and Megerdoomian 2011), where its placement was analysed as second position within the lower phase domain (or vP), thereby extending the inventory of known second position phenomena from the clause to the smaller verbal domain. To account for the distribution of the auxiliary in sentences with focused constituents, it is proposed here that the relevant phase in this context is the Focus Phrase, with the auxiliary appearing in the expected second position. This falls into place under a dynamic view of phasehood which defines the phase as the highest projection of a lexical category and takes low focus to be part of the verbal domain (Boskovic 2014). These proposals rely highly on the parallelism between clausal and verbal domains both in terms of their status as phases, but also their structural make-up. To the extent that these proposals succeed in accounting for the facts discussed in the paper, they provide further indirect evidence for the dynamic conception of phases as well as the CP-vP parallelism.
- [abstract]
Modeling community resilience for a post-epidemic society (with Shaun Michel). Proceedings of the Computational Social Science Society of the Americas Conference. Santa Fe, New Mexico.
The 2014 Ebola outbreak in West Africa once again reminded the world of the fatal risks of exposure to deadly disease. Commonly evaluated by the number of fatalities estimated during an outbreak, epidemics also have lasting consequences for survivors in the form of community breakdown. Although agent-based models are frequently used to consider the susceptibility of agents to disease and to predict the evolution of epidemics, they rarely attempt to model the interplay of social networks, spatial awareness, and exposure risks. Our re-sponse is to construct a model in which agents are instantiated within a social network that influences their movement decisions alongside individually per-ceived vulnerabilities to exposure-by-contact. We monitor macroscopic behav-ioral trends and examine community breakdown resulting from fatalities. The model provides an important contribution to modeling social science by explor-ing individual response to an emerging epidemic and community-level outcomes as a result of those responses.
- [abstract]
[link to article]
Human language reveals a universal positivity bias (with P. S. Dodds, E. M. Clark, S. Desu, M. R. Frank, A. J. Reagan, J. R. Williams, L. Mitchell, K. D. Harris, I. M. Kloumann, J. P. Bagrow, M. T. McMahon, B. F. Tivnan, and C. M. Danforth). Proceedings of the National Academy of Sciences of the United States of America, vol. 112, no. 8, pp. 2389-2394. 2015.
Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (i) the words of natural human language possess a universal positivity bias, (ii) the estimated emotional content of words is consistent between languages under translation, and (iii) this positivity bias is strongly independent of frequency of word use. Alongside these general regularities, we describe interlanguage variations in the emotional spectrum of languages that allow us to rank corpora. We also show how our word evaluations can be used to construct physical-like instruments for both real-time and offline measurement of the emotional content of large-scale texts.
- [abstract]
[link to article]
[pdf]
2014
State of the Research in Human Language Technology: A Study of ACL and NAACL Publications from 2007 through 2014. MITRE Technical Report (MTR140208), Washington DC. June 2014.
The goal of this study is to identify state-of-the-art in Human Language Technology (HLT), pinpointing potential research directions. We performed our analysis by examining conference publications for recent ACL (Association for Computational Linguistics) and NAACL (North American Association for Computational Linguistics) conferences, identifying terms and topics which reflect the state of HLT research, combined with a network analysis of co-authorships.
All analyses suggest that in the field as seen through ACL and NAACL publications, statistical machine translation (SMT) dominates, and seems to fuel the field and associated research trends, yet there is growing interest in applying knowledge-based approaches that integrate some semantic information or concept and relations understanding within the systems. The co-authorship study demonstrates that the same few people and institutions (often focused on SMT research and typically centered in the East Coast of the US and in China) have dominated the field for the last 8 years. The analysis of the collaboration network shows that the ACL/NAACL community exhibits small world network characteristics, with interconnected groups of authors and a few authors with important central roles.
- [abstract]
[pdf]
2012
The status of the nominal in Persian complex predicates. Natural Language & Linguistic Theory, Volume 30, Issue 1, pp. 179-216. 2012.
The nature of preverbal nominals and their relation to the verb have been the focus of much debate in languages with a productive complex predication process. For Persian, certain analyses have argued that the bare nominals in Complex Predicate constructions are distinct from bare objects, while others have treated the two types of bare nominals uniformly. This paper argues that the two categories of preverbal nouns cannot receive the same analysis since they display distinct syntactic and semantic behavior. The preverbal nominals, unlike the bare object nouns, cannot be questioned, are modified differently, have different interpretations, give rise to distinct case-assignment contexts, and can co-occur with a nonspecific object. The distinct properties fo the two nominal categories are captured by positing distinct structural positions for these nouns. Non-specific bare nouns are internal arguments of the thematic verb, while the nominal element of the complex predicate construction is part of the verbal domain with which it combines through a process of conflation, as defined in Hale and Keyser (2002), to form a single predicate.
- [abstract]
[Link to Article]
Shift happens: Language shift and maintenance in the Persian diaspora. Unpublished manuscript. 2012.
Maintenance of immigrant languages has always been a fundamental challenge among Diaspora communities. The Iranian Diaspora in the United States is facing a similar dilemma with the second generation already showing signs of language loss. This paper presents detailed results of linguistic research on Heritage Persian, the language spoken by the younger generation of Iranians, delineating its strengths and shortcomings. The paper then uses these results to propose a language instruction methodology informed by the characteristics of Heritage Persian speakers that can build upon their language intuitions and cater to their specific needs, within the larger context of diglossia and bilingualism found in the Persian community.
Although Persian instruction for the heritage community is fundamental in preventing language loss in the Iranian Diaspora, it will not be sufficient if applied in isolation. A number of factors play a role in language loss including the lack of teaching material for all levels, ideological issues affecting attitudes towards Persian, isolation from the home country, lack of firmly established Persian language communities outside of a few major cities, and parents' behavior and attitudes. In order for true maintenance to occur, language instruction must be implemented within a strong community of bilingual speakers. The paper discusses the factors contributing to language loss in the Iranian Diaspora and introduces important measures that, along with heritage language instruction, can contribute to Persian language maintenance in the United States.
Even though the main focus of this paper is on the Iranian community, the general linguistic characteristics of Heritage language as well as the language maintenance strategies discussed in this paper can also apply to other immigrant communities in the United States.
- [abstract]
[pdf]
2011
Second position clitics in the vP phase: The case of the Armenian auxiliary (with Arsalan Kahnemuyipour). Linguistic Inquiry, Volume 42, Number 1, Winter 2011, pp. 152-162.
Special clitics appear in a position that is different from the one favored by their associated full forms (Zwicky 1977). Linguistic analyses have identified two main categories of special clitics: (a) second-position or Wackernagel clitics that must appear as the second element in a clause (as in Bosnian-Croatian-Serbian (henceforth, BCS), Czech, Cypriot, Pashto, and Tagalog); and (b) verb-adjacent clitics that take the verb as their host (as in Romance languages such as French, Spanish, and Catalan). The auxiliary verb in Eastern Armenian is a clitic that carries tense and agreement features and appears on seemingly unrelated elements within the clause in focus-neutral sentences. The auxiliary is a special clitic by virtue of the fact that it can appear in varying positions that a full-form verb cannot occupy. However, it appears to defy classification in the major known categories of special clitics: The auxiliary remains low in the main clause in neutral contexts and does not occupy the second position in the sentence. In addition, it does not have to be adjacent to the main verb.
The goal of this squib is to account for the puzzling positional distribution of the Armenian auxiliary clitic in the focus-neutral context. We propose that the auxiliary is a case of a second position clitic in the vP domain, akin to the second position phenomena observed across languages in the CP domain. As such, we draw heavily on the parallel between CP and vP in recent syntactic literature, in particular their status as phases in the minimalist framework (e.g. Chomsky 2001).
- [abstract]
[link to article]
The format of Middle Eastern names in official documents. MITRE Technical Report (MTR110136), Washington DC. April 2011.
This report discusses the ways in which Arabic, Afghan, and Iranian names are written in official documents.
- [abstract]
Issues in Dari transcription. MITRE Product (MP110054), Washington DC. 2011.
Orthographic transcription is used in representing spoken data and translations in the target language when creating parallel corpora. The development of guidelines and standards is important in assuring a consistent set of transcriptions, especially in machine learning applications. However, in the case of languages with a complex writing system and the absence of official orthographic standards, developing transparent and consistent transcription guidelines becomes very difficult. This paper presents the issues encountered in creating transcription guidelines for Dari for speech translation projects.
In developing guidelines, efforts are typically made to maintain consistency in the transcriptions by following the standard orthographic conventions as much as possible. Guidelines should also be transparent so that transcribers can easily follow them, providing accurate transcriptions of the data. These criteria are difficult to achieve in the Persian writing system since it possesses an opaque orthography and there is a large variance in writing affixes and compounds, making word and morpheme boundaries difficult to detect. In addition, Dari displays diglossia, dialectal variation, a lack of standardization, and low societal literacy. All of these conditions helped affect the quality of the transcriptions in the Dari corpus and hindered the development of transcription standards.
- [abstract]
2010
Mining and classification of neologisms in Persian blogs (with Ali Hadjarian). In the Proceedings of CALC-10, NAACL HLT workshop on Computational Approaches to Linguistic Creativity.
The exponential growth of the Persian blogosphere and the increased number of neologisms create a major challenge in NLP applications of Iranian blogs. This paper describes a method for extracting and classifying newly constructed words and borrowings from Persian blog posts. The analysis of the occurrence of neologisms across six distinct topic categories points to a strong correlation between the topic domain and the type of neologism that is most commonly encountered. The results suggest that different approaches should be implemented for the automatic detection and processing of neologisms depending on the domain of application.
- [abstract]
[pdf]
Extending a Persian morphological analyzer to blogs. In Zabane Farsi va Rayane [Persian Language and Computers]. SAMT Publishers, Organization for Research and Editing of University Publications in the Humanities, Tehran, Iran. 2010.
This paper describes a two-level morphological analyzer for Persian using a system based on the Xerox finite state tools. Persian language presents certain challenges to computational analysis: There is a complex verbal conjugation paradigm which includes long-distance morphological dependencies; phonological alternations apply at morpheme boundaries; word boundaries are difficult to define since morphemes may be detached from their stems and distinct words can appear without an intervening space. In this work, we develop these problems and provide solutions in a finite-state morphology system. The paper also presents an overview of new issues that have arisen since the advent of blogs and the propagation of informal Persian text on the web. This new mode of writing provides the computational system with further challenges. The paper proposes approaches for extending the current morphological system to analyze the material found in Persian blogs.
-
[abstract]
[pdf]
[slides]
Developing a Persian part-of-speech tagger. In Zabane Farsi va Rayane [Persian Language and Computers]. SAMT Publishers, Organization for Research and Editing of University Publications in the Humanities, Tehran, Iran. 2010.
Assigning grammatical categories to words in a text is an important component of a natural language processing (NLP) system. Corpora tagged with Part of speech (POS) information are often used as a prerequisite for more complex NLP applications such as information extraction, syntactic parsing, machine translation or semantic field annotation. They are also used to help train statistical models.
Prior to tagging, a natural language processing system generally requires modules for segmenting tokens in the text and providing a morphological analysis. The actual annotation scheme used, however, is often motivated by the system application. This paper outlines some of the main challenges that arise in the development of a Persian POS tagger - such as encoding issues, long-distance dependencies in morphology, recognition of complex tokens, word and phrasal boundaries, and analysis of multiword expressions - and proposes approaches to resolving these issues.
-
[abstract]
[pdf]
[slides]
Linguistic patterns in Heritage Persian instruction. In Proceedings of STARTALK Persian Teacher's Workshop for Professional Curriculum and Materials Development, University of Pennsylvania, Philadelphia. July 2010.
There exists a continuing debate in the field of foreign language teaching on the relative viability of implicit vs. explicit approaches in the classroom. While content-based methods typically lead to more advanced communicative abilities, certain language elements and structures benefit most from explicit instruction. I argue that in the case of heritage language learners, their specific needs and linguistic gaps call for the integration of an inductive explicit instruction methodology. Research shows that heritage language speakers differ from second language learners in several aspects: Heritage language learners display fluency in basic conversational contexts, but have difficulties with elements of formal and high register language variants as well as literacy skills. I propose a novel approach to heritage language instruction that takes advantage of the existing knowledge and linguistic intuition of learners to recognize language patterns and analytically discover the underlying principles. By focusing on specific areas that raise difficulties for Persian heritage speakers, I show that the approach successfully integrates leading edge linguistic insight in an interactive classroom.
- [abstract]
[pdf]
2009
Beyond words and phrases: a unified theory of predicate composition. VDM Verlag, Germany, 2009. (Published doctoral dissertation, University of Southern California, 2002)
This volume presents a computational model of predicate composition that derives the distinct properties of ``words'' and ``phrases'' within a single component of grammar and its interface conditions, thus providing an insight into the interaction between morphology and syntax. Based on a cross-lingual study of complex predicates and verb phrase aspect, with special focus on Armenian and Persian, Megerdoomian isolates the primitive atoms used to encode meaning in the syntactic code and argues for parallel nominal and verbal structures. The notion of word is thus defined as a level in the syntactic structure and the distinction between a "word" and a "phrase" is characterized by the structural complexity of the constituents involved in the formation of a particular predicate. The distinct syntactic and semantic properties of words and phrases can be captured straightforwardly from the resulting configuration and the interface conditions. In the computational model developed, lexical entries, combined with the spell-out node to the PF component, determine the parameters for language variation and can derive the structure-meaning mismatches observed in verbal predicates.
- Available at Amazon;
[abstract]
Low-density language strategies for Persian and Armenian. In Language Engineering for Lesser-Studied Languages, Sergei Nirenburg (ed). IOS Press of Amsterdam. February 2009.
The paper presents research on the feasibility and development of methods for the rapid creation of stopgap language technology resources for low-density languages: (i) related language bootstrapping can be used to port existing technology from a resource-rich language to its associated lower-density variant; and (ii) clever use of linguistic knowledge can be employed to scale down the need for large amount of training or development data. Based on Persian and Armenian languages, the paper illustrates several methods that can be implemented in each instance in the goal of reducing human effort and avoiding the scarce data issue faced by statistical systems.
- [abstract]
[pdf]
Automated metrics for speech translation (with Sherri Condon, Mark Arehart, Christy Doran, Dan Parvaz, John Aberdeen, Beatrice Oshika, Greg Sanders). In PerMIS '09 Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, New York. 2009.
In this paper, we describe automated measures used to evaluate machine translation quality in the Defense Advanced Research Projects Agency's Spoken Language Communication and Translation System for Tactical Use program, which is developing speech translation systems for dialogue between English and Iraqi Arabic speakers in military contexts. Limitations of the automated measures are illustrated along with variants of the measures that seek to overcome those limitations. Both the dialogue structure of the data and the Iraqi Arabic language challenge these measures, and the paper presents some solutions adopted by MITRE and NIST to improve confidence in the scores.
- [abstract]
[pdf]
Telicity in Persian complex predicates. Snippets, Issue 19, July 2009.
In their study of complex predicates in Persian, Folli, Harley and Karimi (2005) propose that the nonverbal component (NV) is the sole determiner of telicity in the complex verbal construction. The data from semelfactive verbs in Persian, however, do not support this analysis.
- [abstract]
[pdf]
The structure of Afghan names. MITRE Product (MP090315), Washington DC. November 2009.
This report provides a description of the structure of Afghan names. Person names in
Afghanistan often consist of a compound first name. Most people lack a last name and are
generally referred to by their tribal affiliation, place of birth, profession, or honorific titles.
However, last names are more prevalent in urban and more educated families.
The report discusses the various components of person names such as titles, honorifics, the
internal structure of the name, and various forms of address. It also describes some of the
issues that arise in the transcription of Afghan names into English due to a lack of
standardization and dialectal differences in the pronunciation.
- [abstract]
[pdf]
2008
Parallel nominal and verbal projections. In Foundational Issues in Linguistic Theory: Essays in Honor of Jean-Roger Vergnaud, Robert Freidin, Carlos P. Otero and Maria Luisa Zubizarreta (eds.); MIT Press, May 2008.
Current research on nominal elements has argued that the structure of the noun phrase should
reflect the structure of the verbal domain. These conclusions have led to the introduction of several functional projections within the noun phrase providing a decomposed structure in which the various nominal elements are represented in distinct syntactic nodes (Abney 1987, Valois 1991, Ritter 1992). In addition, it has been shown that there is a direct correlation between the functional elements in the noun phrase and verb phrase structures, such as Number and Aspect (Travis 1992, Verkuyl 1993, Borer 1994). Despite arguments that noun phrases are parallel to verbal clauses in many respects, nouns have generally been treated as less complex than verbal projections, and the functional categories within the noun phrase itself have not played a significant role in establishing relations such as case and agreement between the nominal and verbal predicates.
In this paper, I investigate the correlation between the noun phrase and the verb phrase by studying morphological and semantic properties of case and agreement in several languages, and argue that the correspondence between the two phrase types can be captured by establishing a direct relation between the functional categories within two parallel nominal and verbal projections. Following ideas developed in Vergnaud (2000), I suggest a framework in which the verbal predicate and nominal phrase each project their own domain in syntax, and case and agreement are realized when a nominal node enters into a specifier relation with its verbal counterpart. I argue that the two parallel domains can enter into a checking relation at various points in the computation, giving rise to corresponding semantic interpretations as well as case and agreement morphology. The parallel architecture proposed for nominal and verbal projections straightforwardly captures the direct correspondence between meaning and structure and provides a new perspective on the notion of specifier.
- [abstract]
[pdf]
Low-density language bootstrapping: The case of Tajiki Persian (with Dan Parvaz). In Proceedings of LREC 2008 (Language Resources and Evaluation Conference). Marrakech, Morocco, May 2008.
Low-density languages raise difficulties for standard approaches to natural language processing that depend on large online corpora. Using Persian as a case study, we propose a novel method for bootstrapping MT capability for a low-density language in the case where it relates to a higher density variant. Tajiki Persian is a low-density language that uses the Cyrillic alphabet, while Iranian Persian (Farsi) is written in an extended version of the Arabic script and has many computational resources available. Despite the orthographic differences, the two languages have literary written forms that are almost identical. The paper describes the development of a comprehensive finite-state transducer that converts Tajik text to Farsi script and runs the resulting transliterated document through an existing Persian-to-English MT system. Due to divergences that arise in mapping the two writing systems and phonological and lexical distinctions, the system uses contextual cues (such as the position of a phoneme in a word) as well as available Farsi resources (such as a morphological analyzer to deal with differences in the affixal structures and a lexicon to disambiguate the analyses) to control the potential combinatorial explosion. The results point to a valuable strategy for the rapid prototyping of MT packages for languages of similar uneven density.
- [abstract]
[pdf]
The structure of Persian names. MITRE Technical Report (MP080034), Washington DC. February 2008.
This report provides a description of the structure of common Persian names from Iran, with an emphasis on automatic recognition of these personal names. Since the early parts of the 20th century, person proper names of Persian (Farsi) origin have been composed of a first name and a last name. There are no middle names. However, first and last names may each be a compound proper name, consisting of two subparts.
- [abstract]
[pdf]
Persian supplement to the TIDES standard for the annotation of temporal expressions. MITRE Technical Report, Washington DC. July 2008.
This document is a supplement to the TIDES Standard for the Annotation of Temporal Expressions, and is designed to assist system developers and annotators working with Persian (Farsi) language data. It provides detailed guidelines to the annotation of temporal expression with Persian language examples.
- [abstract]
[pdf]
Analysis of Farsi weblogs: A Survey of the Literature. MITRE Technical Report (MTR080206), Washington DC. August 2008. (Note: This paper is an excertp of the complete technical report and includes only Chapter 2.)
The survey of the literature on Persian blogs presents the state of the Iranian Blogosphere (in 2008) and provides a review of the research and publications on the topic: Research on Persian blogs has mainly centered around a socio-political study of this new medium, and several quantitative investigations have provided preliminary studies on sociological characteristics and content analysis in weblogs. However, rigorous research and investigation of the linguistic aspects of Persian blogs and computational analysis of these online resources are lacking. This survey also presents a summary of the existing literature on the language of weblogs in English and discusses how the results may be relevant for a computational study of Persian Blogspeak.
- [abstract]
[pdf]
Argus search and stemming. MITRE Product (MP080256), Washington DC. September 2008.
Information retrieval systems often use stemming, i.e., the mechanical cutting off of inflectional and derivational affixes, to better match index terms to query terms. Stemming results differ, however, based on the affix list used, the depth of linguistic analysis applied, and the complexity of the language being analyzed. This report provides an overview of stemming, including its advantages and disadvantages for search applications and reviews the query syntax used in the Argus retrieval system. The report also presents guidance for analysts in performinig searches in Newsstand, with specific examples from several languages. In addition, the behavior of two specific systems - the stemmers for Arabic and Farsi (Persian) - are discussed in more detail. This report is intended for analysts performing search queries in Argus.
- [abstract]
[pdf]
2007
The structure of Arabic names. MITRE Technical Report, Washington DC. February 2007.
This report provides a description of the structure of common Arabic names, and provides a basic algorithm for identifying the four components of the name.
- [abstract]
2005
Transitivity alternation verbs and causative constructions in Eastern Armenian. Armenian Annual of Linguistics 21. 2005.
In this paper, we investigate the syntactic and semantic properties of the morphological causative and of the analytic causative in Eastern Armenian. Based on evidence from binding, adverbial scope and the interpretation of the causee, we will show that the two causative constructions display distinct clausal properties. In particular, we argue that the morphological causative includes a single event (i.e., is monoclausal) and is formed on predicates that lack external arguments. The analytic causative, however, consists of two events and is formed on predicates that have an external argument.
- [abstract]
[pdf]
2004
Finite-state morphological analysis of Persian. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. Coling 2004, University of Geneva. August 28, 2004.
This paper describes a two-level morphological analyzer for Persian using a system based on the Xerox finite state tools. Persian language presents certain challenges to computational analysis: There is a complex verbal conjugation paradigm which includes long-distance morphological dependencies; phonological alternations apply at morpheme boundaries; word and noun phrase boundaries are difficult to define since morphemes may be detached from their stems and distinct words can appear without an intervening space. In this work, we develop these problems and provide solutions in a finite-state morphology system.
- [abstract]
[pdf]
A semantic template for light verb constructions. In Proceedings of the First Workshop on Persian Language and Computers. Invited talk. Tehran University, Iran. May 25-26, 2004.
Multiword Expressions (MWEs) raise an important problem for the development of largescale NLP systems. The lack of compositionality of the semantics of these expresseions has led some researchers to simply list them in the lexicon. However, the components of many MWEs can be separated from each other by intervening material in syntax and are often missed by systems that treat these expressions as single units. In this paper, we will focus on one genre of multiword expressions, namely the light verb constructions (LVCs) in Persian. We argue that the latest developments in analyzing these complex verbal predicates along with the linguistic research performed on Persian light verb constructions can shed some light on a successful computational modeling of multilingual complex predicates in general, and on Persian light verb constructions in particular. The approach proposed here is based on the lexical semantic representation of verbal predicates by providing a template that reflects the combination of the various primitive components and the realization of the verb's argument structure in syntax. By extending the research described in Fong et al (2000) to Persian verbal constructions, we propose to map the semantic templates as an interlingual representation. It is argued that this approach, which is based on the recent linguistic research, can provide a more efficient NLP system in the long run and will facilitate multilingual computationalapplications.
-
[abstract]
[pdf]
2003
Text Mining, Corpus Building and Testing. In A Handbook for Language Engineers, edited by Ali Farghaly. CSLI Publications, Stanford, CA. 2003.
This chapter presents an introduction to corpus-based approaches in computational linguistics. It is intended for a linguist audience interest in computational linguistics.
-
[abstract]
[pdf]
Asymmetries in form and meaning: Surface realization and interface conditions. Unpublished manuscript. Paper presented at Approaching Asymmetry at the
Interfaces, UQAM, Montreal. 2003.
This paper addresses the issue of how languages package the same features of meaning into morphophonological units of different
sizes, giving rise to mismatches between surface form and meaning. A contrastive study of causatives in Japanese and Eastern
Armenian shows that the same syntactic properties and semantic information are surfaced as a single word in Japanese while
they are realized as a phrase in Eastern Armenian. The paper provides an account for the distinct surface realizations of similar
causative constructions in these two languages based on interface relations between syntax and the PF and LF components. In
particular, it is argued that the different surface realizations of predicates across languages can be captured by a language parameter
that determines the spell-out to the PF interface.
- [abstract]
[pdf]
2002
Beyond words and phrases: a unified theory of predicate composition. Doctoral dissertation, University of Southern California. 2002.
This volume presents a computational model of predicate composition that derives the distinct properties of ``words'' and ``phrases'' within a single component of grammar and its interface conditions, thus providing an insight into the interaction between morphology and syntax. Based on a cross-lingual study of complex predicates and verb phrase aspect, with special focus on Armenian and Persian, Megerdoomian isolates the primitive atoms used to encode meaning in the syntactic code and argues for parallel nominal and verbal structures. The notion of word is thus defined as a level in the syntactic structure and the distinction between a "word" and a "phrase" is characterized by the structural complexity of the constituents involved in the formation of a particular predicate. The distinct syntactic and semantic properties of words and phrases can be captured straightforwardly from the resulting configuration and the interface conditions. In the computational model developed, lexical entries, combined with the spell-out node to the PF component, determine the parameters for language variation and can derive the structure-meaning mismatches observed in verbal predicates.
- [abstract]
[pdf (compact/easier to print)]
[pdf (original)]
[Book on Amazon]
2001
Event Structure and Complex Predicates in Persian. Canadian Journal of Linguistics 46 (1/2): 97-125. Special issue on the Syntax of Iranian languages. 2001.
This paper investigates the syntactic and semantic properties of complex predicates in Persian in order to isolate the individual contributions of the verbal components. The event structure of causative alternation and unergative verbs is determined, based on a decomposition of the verbal construction into primitive syntactic elements consisting of lexcial roots and functional heads, with the latter projecting all arguments of the verbal construction. An analysis is provided whereby the argument structure is not projected from the lexicon but is formed compositionally by the conjunction of the primitive components of the complex predicate in syntax. The dual behavior of Persian complex predicates as lexical and syntactic elements, which has been attested in Persian literature on light verb constructions follows naturally from the analysis proposed since there is no strict division between the level of word-formation and the component manipulating phrasal constructs.
Cet article étudie les propriétés syntaxiques et sémantiques des verbes composés en persan afin de mettre à jour pouvoir les contributions individuelles des composants de la construction verbale. La structure événementielle de certains prédicats complexes est déterminée et une analyse de la construction verbale est proposée où les composants sont décomposés en éléments syntaxiques primitifs qui consistent en des racines lexicales et des têtes fonctionelles, ces dernières étant responsables de la projection des arguments de la construction verbale. L'analyse fournie suggère que la structure argumentale n'est pas projetée du lexique, mais est plutôt formée d'une façon compositionnelle en joignant, dane le domaine syntaxique, les composants primitifs du prédicat complexe. Les travaux portant sur les verbes composés en persan ont indiqué que ces constructions montrent un double comportement lexical et syntaxique. L'analyse proposée peut facilement expliquer ces propriétés puisqu'il n'y existe aucune division stricte entre le niveau responsable pour la formation des mots et le module qui manipule les constructions phrastiques.
- [abstract]
[pdf]
2000
Unification-Based Persian Morphology. In Proceedings of CICLing 2000. Alexander Gelbukh, ed. Centro de Investigacion en Computacion-IPN, Mexico.
This paper presents a complete formalization of Persian inflectional morphology using a unification-based framework. The morphological analyzer was developed for use in a Persian-English machine translation system; it computes the part of speech categories and returns all syntactically relevant inflectional features for a word. The morphological analyses are represented as feature structures, which can easily be used by a syntactic parser. The morphological formalism consists of a declarative description of rules utilizing typed feature structures. Persian morphotactics include a few prefixes and sequences of suffixes with co-occurrence constraints between non-adjacent morphemes. The verbal inflectional morphology is rich and is characterized by a complex system of conjugations. A morphological rule associates a regular expression describing a set of character strings to a typed feature structure. Rules can be combined using regular expression operators and they can be factorized in conjugation tables. The morphological engine is implemented as a finite-state transducer where the left projection is the input string and the right projection is a typed feature structure.
- [abstract]
[pdf]
Aspect and partitive objects in Finnish. In Proceedings of WCCFL 2000, Cascadilla Press.
Finnish case system presents a problem for de Hoop's theory, since it does not show a one-to-one correspondence between case morphology and semantic reading of the objects. Based on the generalization developed by Kiparsky (1998), I argue that the distribution of accusative vs. partitive case on Finnish objects depends on VP aspect. I propose an analysis based on the syntactic approach developed by Borer (1994), in which event structure is encoded in the functional projections, and the distinct aspectual interpretations can be derived from the syntax of the arguments. I propose that the movement of the direct object to the specifier of AspP is allowed if (i) the object represents a specific quantity; and (ii) AspP is available in the syntactic structure. If any of these conditions is not met, the object remains within the Verb Phrase and receives weak case. I suggest that AspP is not projected in irresultative predicates, thus there is no position for the object to move to. Since the object remains within the VP, it receives the weak partitive case regardless of the strength of the determiner or the cardinality denoted in Number Phrase. Hence, in irresultative predicates, the direct object receives a strong reading, while bearing a weak partitive case.
- [abstract]
[pdf]
Against optional wh-movement (with Shadi Ganjavi). In Proceedings of WECOL 2000, Volume 12.
Eastern Armenian and Persian seem to display both overt wh-movement and wh-in situ properties. This paper argues that these languages do not have optional wh-movement. Evidence from the distributional properties of the two constructions shows that wh-in situ and overt wh-extraction are two distinct processes. We argue that overt movement of wh-phrases is not wh-movement but rather an instance of scrambling. Wh-in situ is the strategy for forming wh-questions in these languages and should be analyzed as an operator-variable relation.
- [abstract]
[pdf]
Rapid development of translation tools: Applications to Persian and Turkish (with Jan W. Amtrup and Rémi Zajac). In Proceedings of COLING 2000, Saarbrucken, Germany.
The Computing Research Laboratory (CRL) is developing a machine translation toolkit that allows a rapid deployment of translation capabilities. This toolkit has been used to develop several machine translation systems, including a Persian-English and a Turkish-English system, which will be demonstrated. We present the architecture of these systems as well as the development methodology.
- [abstract]
[ps]
Persian-English machine translation: An overview of the Shiraz project (with Jan W. Amtrup, Hamid Mansouri Rad, and Rémi Zajac). NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-319); 2000.
This report describes the Shiraz project MT prototype for a Persian to English machine translation system using typed feature structures and unification. An overview of the linguistic properties of Persian is presented and the morphological and syntactic grammars developed within the Shiraz project are discussed. The underlying model for the system is a layered chart, capable of representing heterogeneous types of hypotheses in an integrated way.
- [abstract]
[pdf]
Persian computational morphology: A unification-based approach. NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-320); 2000.
This report provides a complete descriptive analysis of Persian inflectional morphology from a computational perspective. The parts of speech and the morphemes that appear on them as well as their corresponding morphotactics are presented in detail. The verbal paradigm is also described in this document. Since the morphological analyzer designed for this project uses a unification-based grammar with typed feature structures, the morphological information has been defined in terms of features and values. The report describes the current version of the morphological analyzer used in the Shiraz project and discusses any morphological elements that have not been included in this version, mostly due to the colloquial usage of these morphemes. Sample rules of Samba, the grammar specifying the morphological analyzer, as well as the feature specification for the Persian type definitions module are also described.
- [abstract]
[pdf]
A computational analysis of the Persian noun phrase. NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-321); 2000.
The highly ambiguous structure of the Persian Noun Phrase (NP) causes immense difficulties for automatic parsing of written text. Factors contributing to the ambiguity of the Persian NP structure include the lack of overt morphology to mark boundaries, the fact that short vowels are not written in Persian text, a relatively free word order and the optionality of the subject. This paper introduces the constituents forming a Noun Phrase in Persian. The syntax of NPs and relative clauses is described. A closer study of the Noun Phrase provides lexical and morphological clues for determining boundaries. To describe the NP rules in the Shiraz machine translation project, developed at CRL, a unification-based syntactic grammar was used, which operates on typed feature structures.
- [abstract]
[pdf]
Processing Persian text: Tokenization in the Shiraz project (with Remi Zajac). NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-322); 2000.
Prior to morphological analysis or syntactic parsing, a text needs to undergo tokenization, in order to determine sentence and word boundaries. This report describes the tokenizer used in the Shiraz Persian-English machine translation project at the Computing Research Laboratory. The Persian writing system and the methods that can be used in recognizing token boundaries in written text are presented. The system uses a low-level language-independent tokenizer, which outputs an unambiguous sequence of basic tokens. Difficulties arise in analysis of Persian text since certain detachable morphemes need to be reattached to the word before morphological analysis takes place. In addition, words are often concatenated in written form. These pre-processing tasks are accomplished by a post-tokenizer that contains language-specific information.
- [abstract]
[pdf]
1999
Rapid development of translation tools (with Jan W. Amtrup and Rémi Zajac). In Proceedings of Machine Translation Summit VII, Singapore. September 1999.
The Computing Research Laboratory is currently developing technologies that allow rapid deployment of automatic translation. These technologies are designed to handle low-density languages for which resources, be that human informants or data in electronically readable form, are scarce. All tools are built in an incremental fashion, such that some simple tools (a bilingual dictionary or a glosser) can be delivered early in the development to support initial analysis tasks. More complex applications can be fielded in successive functional versions. The technology we demonstrate has first been applied to Persian-English machine translation within the Shiraz project and is currently extended to cover languages such as Arabic, Japanese, Korean and others.
- [abstract]
[pdf]
Projection of direct objects. In Proceedings of WECOL 1999.
The existence of two distinct structural positions for the direct object has been argued for in languages such as Hindi, Turkish, Persian and Scottish Gaelic. In all of these approaches, the different structural positions give rise to distinct semantic interpretations. In this paper, I show that Eastern Armenian provides clear evidence for two object positions, displaying a strong correlation between case morphology, specificity, phrasal stress pattern and adjacency to the verb.
- [abstract]
[pdf]
Edited Proceedings
Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages (CAASL) (co-editor)
- CAASL1, held in University of Geneva, COLING 2004, August 28.
- CAASL2, held at the Linguistic Society of America Summer Institute, Stanford University, July 2007.
- CAASL3, held at Machine Translation Summi XII, Ottawa, Canada, August 26, 2009.
- Proceedings available at CAASL website
Proceedings of the 20th West Coast Conference on Formal Linguistics (with Leora Ann Bar-el). - Held in University of Southern California, Cascadilla Press, 2001.
|