NLX-Tokenizer

Developed at the University of Lisbon, Dept. of Informatics, by the NLX-Natural Language and Speech Group.


Table of contents

LX-Tokenizer

LX-Tokenizer segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more cleary.
um exemplo → |um|exemplo|

This tool expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:
do → |de_|o|

It marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:
um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|

It detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:
dá-se-lho → |dá|-se|-lhe|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|

This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste| when occurring as a Verb
deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)
The accuracy of this tool achieves an f-score of 99.72% (see evaluation conditions in the publication below).

Online Demo

For an online demo of this tool, check here

Authorship

LX-Tokenizer was developed and is maintained at University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.

Acknowledgments

This work was partly supported by FCT-Fundation of Science and Technology.

Citation

When mentioning this tokenizer, this is the reference to be used:

License

To use LX-Tokenizer you must accept the terms of this license.

Release

You can download the program here.

Contact Us

Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.

Why LX-Tokenizer?

LX because LX is the "code" name Lisboners like to use to refer to their hometown.