Developed at the
University of Lisbon, Dept. of Informatics,
NLX-Natural Language and Speech Group.
Table of contents
LX-Parser is a statistical constituency parser for Portuguese. It performs a syntactic analysis of Portuguese sentences in terms of their constituency structure.
For an online demo of this tool, check here.
LX-Parser is being developed by Patricia Gonçalves and João Silva, managed by António Branco,
by the NLX-Natural Language and Speech Group,
partly in the scope of the SemanticShare Project, funded by FCT-Fundação para a Ciência e Tecnologia.
This work was partly supported by FCT-Fundation of Science and Technology under the grant FCT/PTDC/PLP/81157/2006 for the project
When mentioning this parser, this is the reference to be used:
- Silva, João and António Branco and Sérgio Castro and Ruben Reis.
Out-of-the-Box Robust Parsing of Portuguese.
In Proceedings of the 9th International Conference on the Computational Processing of Portuguese (PROPOR'10), pp. 75–85.
To use LX-Parser you must agree with its license.
LX-Parser is made available as a standalone parser that you can download and run locally in your computer.
- The parser model file, cintil.ser.gz
- Stanford Parser (requires Java 5 or later). Note that the model was created with version 1.6.5 of the parser. More recent versions of the software seem to be unable to load the model.
- LX-Tokenizer to tokenize input prior to parsing.
Example command line:
java -Xmx500m -cp /path/to/stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -sentences newline -outputFormat oneline -uwModel edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel cintil.ser.gz input.txt
A quick explanation of the options:
- For some more complex sentences, the default heap size used by Java might not be enough. We increase the maximum heap size to 500 megabytes with the -Xmx500m option.
- The path to the Stanford Parser JAR file is provided with the -cp option.
- The name of the Java class we wish to run (LexicalizedParser).
- The input to the parser must already be tokenized (see LX-Tokenizer for details on tokenization decisions). We indicate this through the -tokenized option.
- Each sentence in the input is separated by newline. We indicate this through the -sentences newline option.
- The output format is one parse per line. NB: The parser always adds a ROOT node. You can remove it in a post-processing step.
- A class (BaseUnknownWordModel, part of the Stanford parser package) that implements a baseline word model is used to handle unknonwn words. It is chosen by the -uwModel option.
- The final two arguments are the model file and the input file.
To be available soon
Contact us using the following email address: 'nlx' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
LX because LX is the "code" name Lisboners like to use to refer to their hometown.