LX-Tokenizer

LX-Tokenizer segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more cleary.: um exemplo → |um|exemplo|
This tool expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:: do → |de_|o|
It marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:: um, dois e três → |um|,*/|dois|e|três| 5.3 → |5|.|3| 1. 2 → |1|.*/|2| 8 . 6 → |8|\*.*/|6|
It detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:: dá-se-lho → |dá|-se|-lhe|-o| afirmar-se-ia → |afirmar-CL-ia|-se| vê-las → |vê#|-las|
This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:: deste → |deste| when occurring as a Verb
deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)

The accuracy of this tool achieves an f-score of 99.72% (see evaluation conditions in the publication below).

Online Demo

For an online demo of this tool, check here

This work was partly supported by FCT-Fundation of Science and Technology.

When mentioning this tokenizer, this is the reference to be used:

Branco, António and João Silva, 2004. Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese. In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC2004), Paris, ELRA, ISBN 2-9517408-1-6, pp.507-510.

To use LX-Tokenizer you must accept the terms of this license.

You can download the program here.

LX because LX is the "code" name Lisboners like to use to refer to their hometown.