Developed at the University of Lisbon, Dept. of Informatics, by the NLX-Natural Language and Speech Group.
|
(vertical bar) symbol is used to mark the token boundaries more cleary.um exemplo → |um|exemplo|
_
(underscore) symbol:do → |de_|o|
\*
and the */
symbols indicate a space to the left and a space to the right, respectively:
um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|
-
(hyphen) symbol. When in mesoclisis, a -CL-
mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a #
(hash) symbol:
dá-se-lho → |dá|-se|-lhe|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|
deste → |deste|
when occurring as a Verb
deste → |de|este|
when occurring as a contraction (Preposition + Demonstrative)
For an online demo of this tool, check here
LX-Tokenizer was developed and is maintained at University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.
This work was partly supported by FCT-Fundation of Science and Technology.
When mentioning this tokenizer, this is the reference to be used:
To use LX-Tokenizer you must accept the terms of this license.
You can download the program here.
Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
LX because LX is the "code" name Lisboners like to use to refer to their hometown.