2012
pdf
bib
abs
Announcing Prague Czech-English Dependency Treebank 2.0
Jan Hajič
|
Eva Hajičová
|
Jarmila Panevová
|
Petr Sgall
|
Ondřej Bojar
|
Silvie Cinková
|
Eva Fučíková
|
Marie Mikulová
|
Petr Pajas
|
Jan Popelka
|
Jiří Semecký
|
Jana Šindlerová
|
Jan Štěpánek
|
Josef Toman
|
Zdeňka Urešová
|
Zdeněk Žabokrtský
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This paper gives a high level overview of the underlying linguistic theory (the so-called tectogrammatical annotation) with some details of the most important features like valency annotation, ellipsis reconstruction or coreference.
2006
pdf
bib
abs
On Automatic Assignment of Verb Valency Frames in Czech
Jiří Semecký
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Many recent NLP applications, including machine translation and information retrieval, could benefit from semantic analysis of language data on the sentence level. This paper presents a method for automatic disambiguation of verb valency frames on Czech data. For each verb occurrence, we extracted features describing its local context. We experimented with diverse types of features, including morphological, syntax-based, idiomatic, animacy and WordNet-based features. The main contribution of the paper lies in determining which ones are most useful for the disambiguation task. The considered features were classified using decision trees, rule-based learning and a Naïve Bayes classifier. We evaluated the methods using 10-fold cross-validation on VALEVAL, a manually annotated corpus of frame annotations containing 7,778 sentences. Syntax-based features have shown to be the most effective. When we used the full set of features, we achieved an accuracy of 80.55% against the baseline 67.87% obtained by assigning the most frequent frame.
pdf
bib
Constructing an English Valency Lexicon
Jiří Semecký
|
Silvie Cinková
Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006
2004
pdf
bib
Searching for Topics in a Large Collection of Texts
Martin Holub
|
Jiří Semecký
|
Jiří Diviš
Proceedings of the ACL Student Research Workshop
pdf
bib
Corpus-based Induction of an LFG Syntax-Semantics Interface for Frame Semantic Processing
Anette Frank
|
Jirí Semecky
Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora