LX-Suite
LX-Suite (beta 2 version) is a freely available online service for the shallow processing of Portuguese. It was developed and is mantained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
Version history:- Beta 2:
Added lemmatization and morphological analysis. - Beta 1:
Sentence chunking, tokenization and POS.
You may be also interested to use our LX-Conjugator and LX-Lemmatizer online services for the conjugation and lemmatization of verbs, and LX-Inflector online service for the inflection of nominal classes.
Features and Evaluation
LX-Suite is composed by a set of shallow processing tools:
- LX-Chunker:
Marks sentence boundaries with<s>…</s>
, and paragraph boundaries with<p>…</p>
.
Unwraps sentences split over different lines.A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
- LX-Tokenizer:
- Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the
|
(vertical bar) symbol is used to mark the token boundaries more cleary. um exemplo → |um|exemplo|
- Expands contractions. Note that the first element of an expanded contraction is marked with an
_
(underscore) symbol: do → |de_|o|
- Marks spacing around punctuation or symbols. The
\*
and the*/
symbols indicate a space to the left and a space to the right, respectively: um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|- Detaches clitic pronouns from the verb. The detached pronoun is marked with a
-
(hyphen) symbol. When in mesoclisis, a-CL-
mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a#
(hash) symbol: dá-se-lho → |dá|-se|-lhe|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|- This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste| when occurring as a Verb
deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)
This tool achieves a f-score of 99.72%.
- Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the
- LX-Tagger:
- Assigns a single morpho-syntactic tag, from the tagset below, to every token. The tag is attached to the token, using a
/
(slash) symbol as separator: um exemplo → um/IA exemplo/CN
- Each individual token in multi-token expressions gets the tag of that expression prefixed by "L" and followed by the number of its position within the expression:
de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4
This tagger was developed with TnT software over 90% of a small, 260k token, accurately hand tagged corpus. Accuracy of 96.87% was obtained with the tagger being trained over 90% of the 260K tokens and evaluated over the held out 10%, this being repeated over 10 different test runs and the results averaged.
- Assigns a single morpho-syntactic tag, from the tagset below, to every token. The tag is attached to the token, using a
- LX-Featurizer (nominal):
- Assigns inflection feature values to words from the nominal categories. Namely, Gender (masculine or feminine), Number (singular or plural) and, when applicable, Person (1st, 2nd and 3rd):
os/DA gatos/CN → os/DA#mp gatos/CN#mp
- Assigns degree feature values (diminutive, superlative and comparative) to words from the nominal categories:
os/DA gatinhos/CN → os/DA#mp gatinhos/CN#mp-dim
- Sometimes, due to the so-called invariant words, the featurizer is not able to determine a feature value. In those cases, it assigns a g value for an underspecified Gender and n value for an underspecified Number. Note, however, that if provided with an adequate context, the featurizer might resolve such cases:
Vi/V pianistas/CN → Vi/V pianistas/CN#gp
Vi/V as/DA pianistas/CN → Vi/V as/DA#fp pianistas/CN$fp
This tool has 91.07% f-score.
- LX-Lemmatizer (nominal):
-
Assigns a lemma to words from the nominal categories (Adjectives, Common Nouns and Past Participles). This lemma corresponds to the form that one would find in a dictionary, typically the masculine singular form. The lemma is inserted into the token, with
gatas/CN#fp → gatas/GATO/CN#fp
normalíssimo/ADJ#ms-sup → normalíssimo/NORMAL/ADJ#ms-sup
/
(slash) as a delimiter.This tool has 97.67% f-score.
- LX-Lemmatizer and Featurizer (verbal):
-
Assigns a lemma and inflection feature values to verbs. The lemma corresponds to the infinitive form of the verb. The lemma is inserted into the token, with
escrevi/V → escrevi/ESCREVER/V#ppi-1s
/
(slash) as a delimiter.The tool disambiguates among the various lemma-inflection pairs that can be assigned to a verb form, achieving 95.96% accuracy.
For an online service supported by this tool (without performing disambiguation) see LX-Lemmatizer.
These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.
Authorship
LX-Suite is being developed by António Branco and João Silva, with the key contribution of Filipe Nunes (verbal lemmatizer), and the help of Francisco Costa, Catarina Ribeiro and Ricardo Santos at the NLX—Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
Acknowledgments
The development of a state-of-the-art, complete suite of shallow processing tools for Portuguese was supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.
This project was developed in cooperation with CLUL—Centro de Linguística da Universidade de Lisboa. The training and test corpora prepared for the development of this demo evolved from a corpus provided by CLUL.
This demo includes a part-of-speech tagger developed with Thorsten Brants' TnT software with his written permission.
White Papers
Branco, António and João Silva, 2006. Dedicated Nominal Featurization of Portuguese. In Proceedings of the VII Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR'06).
Barreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).
Branco, António and João Silva, 2006. A Suite of Shallow Processing Tools for Portuguese: LX-Suite. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL'06).
Branco, António, Filipe Nunes and João Silva, 2006. Verb Analysis in an Inflective Language: Simpler is better. Internal report, University of Lisbon, Department of Informatics, NLX-Natural Language and Speech Group.
Branco, António and João Silva, 2005. Accurate Annotation: an Efficiency Metric. In Nicolas Nicolov, Kalina Bontcheva, Galia Angelova and Ruslan Mitkov (eds.), Recent Advances in Natural Language Processing III, Amsterdam, John Benjamins, pp.173-182.
Branco, António and João Silva, 2004. Swift Development of State of the Art Taggers for Portuguese. In António Branco, Amália Mendes and Ricardo Ribeiro (orgs.), Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, pp. 29-46.
Branco, António and João Silva, 2004. Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese. In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), Paris, ELRA, pp.507-510.
Branco, António, Amália Mendes and Ricardo Ribeiro (eds.), 2003. Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003. Lisbon, University of Lisbon, Faculty of Sciences, Department of Informatics, Technical Report TR-2003-28.
Branco, António and João Silva, 2003. Portuguese-specific Issues in the Rapid Development of State of the Art Taggers. In António Branco, Amália Mendes and Ricardo Ribeiro (eds.), 2003, pp.7-10.
Mendes, Amália, Raquel Amaro, M. Fernanda Bacelar do Nascimento, 2004. Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources. In António Branco, Amália Mendes and Ricardo Ribeiro (orgs.), Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, pp. 47-62.
Mendes, Amália, Raquel Amaro, M. Fernanda Bacelar do Nascimento, 2003. Reusing Available Resources for Tagging a Spoken Portuguese Corpus. In António Branco, Amália Mendes and Ricardo Ribeiro (eds.), 2003, pp.25-28.
TagShare, 2004, Manual de Etiquetação e Convenções, Internal Report, University of Lisbon, Department of Informatics, NLX-Natural Language and Speech Group.
Contact Us
Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
Why LX-Suite?
LX because LX is the "code" name Lisboners like to use to refer to their hometown.
Tagset: POS
Tag | Category | Examples |
---|---|---|
ADJ | Adjectives | bom, brilhante, eficaz, … |
ADV | Adverbs | hoje, já, sim, felizmente, … |
CARD | Cardinals | zero, dez, cem, mil, … |
CJ | Conjunctions | e, ou, tal como, … |
CL | Clitics | o, lhe, se, … |
CN | Common Nouns | computador, cidade, ideia, … |
DA | Definite Articles | o, os, … |
DEM | Demonstratives | este, esses, aquele, … |
DFR | Denominators of Fractions | meio, terço, décimo, %, … |
DGTR | Roman Numerals | VI, LX, MMIII, MCMXCIX, … |
DGT | Digits | 0, 1, 42, 12345, 67890, … |
DM | Discourse Marker | olá, … |
EADR | Electronic Addresses | http://www.di.fc.ul.pt, … |
EOE | End of Enumeration | etc |
EXC | Exclamative | ah, ei, etc. |
GER | Gerunds | sendo, afirmando, vivendo, … |
GERAUX | Gerund "ter"/"haver" in compound tenses | tendo, havendo … |
IA | Indefinite Articles | uns, umas, … |
IND | Indefinites | tudo, alguém, ninguém, … |
INF | Infinitive | ser, afirmar, viver, … |
INFAUX | Infinitive "ter"/"haver" in compound tenses | ter, haver … |
INT | Interrogatives | quem, como, quando, … |
ITJ | Interjection | bolas, caramba, … |
LTR | Letters | a, b, c, … |
MGT | Magnitude Classes | unidade, dezena, dúzia, resma, … |
MTH | Months | Janeiro, Dezembro, … |
NP | Noun Phrases | idem, … |
ORD | Ordinals | primeiro, centésimo, penúltimo, … |
PADR | Part of Address | Rua, av., rot., … |
PNM | Part of Name | Lisboa, António, João, … |
PNT | Punctuation Marks | ., ?, (, … |
POSS | Possessives | meu, teu, seu, … |
PPA | Past Participles not in compound tenses | afirmados, vivida, … |
PP | Prepositional Phrases | algures, … |
PPT | Past Participle in compound tenses | sido, afirmado, vivido, … |
PREP | Prepositions | de, para, em redor de, … |
PRS | Personals | eu, tu, ele, … |
QNT | Quantifiers | todos, muitos, nenhum, … |
REL | Relatives | que, cujo, tal que, … |
STT | Social Titles | Presidente, drª., prof., … |
SYB | Symbols | @, #, &, … |
TERMN | Optional Terminations | (s), (as), … |
UM | "um" or "uma" | um, uma |
UNIT | Abbreviated Measurement Units | kg., km., … |
VAUX | Finite "ter" or "haver" in compound tenses | temos, haveriam, … |
V | Verbs (other than PPA, PPT, INF or GER) | falou, falaria, … |
WD | Week Days | segunda, terça-feira, sábado, … |
Multi-Word Expressions | ||
LADV1…LADVn | Multi-Word Adverbs | de facto, em suma, um pouco, … |
LCJ1…LCJn | Multi-Word Conjunctions | assim como, já que, … |
LDEM1…LDEMn | Multi-Word Demonstratives | o mesmo, … |
LDFR1…LDFRn | Multi-Word Denominators of Fractions | por cento |
LDM1…LDMn | Multi-Word Discourse Markers | pois não, até logo, … |
LITJ1…LITJn | Multi-Word Interjections | meu Deus |
LPRS1…LPRSn | Multi-Word Personals | a gente, si mesmo, V. Exa., … |
LPREP1…LPREPn | Multi-Word Prepositions | através de, a partir de, … |
LQD1…LQDn | Multi-Word Quantifiers | uns quantos, … |
LREL1…LRELn | Multi-Word Relatives | tal como, … |
Tagset: Other tags
Tag | Description |
---|---|
m | Masculine |
f | Feminine |
s | Singular |
p | Plural |
dim | Diminutive |
sup | Superlative |
comp | Comparative |
1 | First Person |
2 | Second Person |
3 | Third Person |
pi | Presente do Indicativo |
ppi | Pretérito Perfeito do Indicativo |
ii | Pretérito Imperfeito do Indicativo |
mpi | Pretérito Mais que Perfeito do Indicativo |
fi | Futuro do Indicativo |
c | Condicional |
pc | Presente do Conjuntivo |
ic | Pretérito Imperfeito do Conjuntivo |
fc | Futuro do Conjuntivo |
imp | Imperativo |