Language resources

In the course of our R&D activities, and as instrumental assets for the execution of our projects, we developed or are developing the following language resources:


LX-DSemVectors

Distributional semantic representation of Portuguese words (a.k.a. word embeddings).

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.

Other versions distributed via github.


LX-4WAnalogies

Test set based on four-word analogies for distributional semantic representation of Portuguese words (a.k.a. word embeddings).

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


TimeBankPT

Portuguese corpus annotated with rich temporal annotations, adopting the TimeML conventions. It includes annotations not only of temporal expressions but also of events and temporal relations. This corpus is the result of translating and adapting the English corpus used in the first TempEval challenge to the Portuguese language.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


DeepBankPT

Bank of deep grammatical representations sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with their fully fledged grammatical representations, along a HPSG grammar. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


LogicalFormBankPT

Bank of logical forms sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with logical forms representing their meaning. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


DependencyBankPT

Dependency bank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


PropBankPT

PropBank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with trees representing syntactic constituency decorated with grammatical functions and semantic roles. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


TreebankPT

Treebank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with trees of syntactic constituency. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-QATreeBank

Corpus of Portuguese interrogative and imperative sentences. This Treebank includes declarative sentences from the pre-existing CINTIL-Treebank whose syntactic structure was manually transformed into their non-declarative counterpart: interrogative and imperative clauses.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-Definitions

Corpus of Portuguese definitions. Collection of annotated corpus (POS tags and morphological information) with and additional layer of annotation marking definitions.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-DeepBank

Bank of deep grammatical representations: corpus of Portuguese sentences annotated with their fully fledged grammatical representations, along a HPSG grammar.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-LogicalFormBank

Bank of logical forms: corpus of Portuguese sentences annotated with logical forms representing their meaning.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-DependencyBank

Dependency bank of Portuguese: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-DependencyBank PREMIUM

Dependency bank of Portuguese: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-PropBank

PropBank of Portuguese: corpus of Portuguese sentences annotated with trees representing syntactic constituency decorated with grammatical functions and semantic roles.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-Treebank

Treebank of Portuguese: corpus of Portuguese sentences annotated with trees of syntactic constituency.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-WordSenses

CINTIL extended by means of the annotation of word tokens with the identifer of concepts (synsets) they happen to express, with these identifiers belonging to the MWNPT-International Wordnet of Portuguese.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-NamedEntities

CINTIL extended by means of the annotation of named entities manually disambiguated and annotated with links to appropriate pages in the Portuguese Dbpedia.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL-Corpus Internacional do Português

High quality, linguistically interpreted, accurately hand tagged 1 million token corpus of Portuguese. Annotated with part-of-speech (POS), inflection and named entities (NER). Developed and maintained in cooperation with CLUL-Centro de Linguística da Universidade de Lisboa.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


CINTIL Concordancer

Advanced, freely available online concordancer for the CINTIL corpus. Developed and maintained in cooperation with CLUL-Centro de Linguística da Universidade de Lisbo.


CINTIL TagSet

Exhaustive set of part of speech tags for Portuguese, including coverage of transcriptions of verbal productions. This is the tagset used in the annotation of the CINTIL corpus.

You can find it here.


CINTIL Annotation Manual

Companion manual of CINTIL corpus with explicit guidelines for annotation/interpretation.

You can find it here.


LX-VerbalInflections

Collection of the verbforms of the Portuguese verbs associated with information on the respective inflection features.


LX-Abbreviations

Collection of abbreviations of different types from Portuguese. Each type of abbreviation is mannually divided and annotated with grammatical categories, gender and number, and, finally, with the respective full expression.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


LX-StopWords

List of words from Portuguese composed by 2631 words of 51 types. The words are grouped in three big classes, arranged according to their morpho-syntactic category and inflectional feature value (closed classes, open classes, and multi-word units).

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


MWNPT-International WordNet of Portuguese

WordNet of Portuguese, developed in cooperation with MultiWordnet project of FBK-Foundation Bruno Kessler, from Trento, Italy.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


QTLeap WSD/NED Multilingual Parallel Corpora

QTLeap Multilingual Parallel Corpora extended by means of the annotation of named entities automatically disambiguated and annotated with links to appropriate pages in DBpedia, and of the automatic annotation of word tokens with the identifer of concepts (synsets) they happen to express, with these identifiers belonging to wordnets.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


QTLeap Multilingual Parallel Corpora

Collection of queries and respective replies as these occurred in and were collected from a chat service to support troubleshooting in the domain of Information Technology, and their translations into Portuguese, English, German, Spanish, Basque, Dutch, Bulgarian and Czech.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.


Nexing Corpus

Corpus with the transcriptions of syllogistic reasoning protocols.

This resource is available from the PORTULAN CLARIN infrastructure. You can find it here.