Current List of LREC 2020 Shared LRs

LREC recognises the importance of sharing Language Resources (LRs) and making them available to the community. When submitting a paper, participants were offered the possibility to share their LRs (data, tools, web-services, etc.), uploading them in a special LREC repository set up by ELRA. This effort of sharing LRs, linked to the LRE Map initiative for their description, contributes to creating a common repository where everyone can deposit and share data.

After the conference, the Shared LRs set at LREC 2020 was manually checked and a cleaned version of the list of LRs is now available. The LRs in this list comply with the following criteria:

  • LRs accessible (whether downloadable directly or through an an external URL)
  • LRs categorized as Datasets only. It can be a:
    • Corpus
    • Evaluation Data,
    • Grammar/Language Model,
    • Language Resources/Technologies Infrastructure,
    • Lexical Conceptual Resource General Information,
    • Lexicon,
    • Ontology,
    • Terminology,
    • Treebank

Excluded LRs are:

  • Uploaded LRswith a content that does not match the description
  • LRs with no download URL or URL now a dead link
  • LRs categorized as tools or guidelines
  • LRs associated to rejected papers

 

Search for LRs

Filter by resource type:

Reset
Shared-LRs @ LREC 2020
  • Name A Chinese Corpus for Fine-grained Entity Typing
    Resource type Corpus
    Size 650 MByte
    Languages Chinese, English
    Production status Newly created-finished
    Resource usage Named Entity Recognition
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from https://drive.google.com/file/d/1xorWUdTi9r43tTEdwJ4tKa9ErvRjossU/view
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.548.pdf
    Edition LREC 2020
  • Name A corpus of precise natural entailment problems
    Resource type Corpus
    Size 150 entailment problems
    Languages English
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/172_res_1.zip [31 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.844.pdf
    Edition LREC 2020
  • Name A New Dataset for Sindhi Named Entity Recognition
    Resource type Corpus
    Size 1358724 words
    Languages Sindhi
    Production status Newly created-finished
    Resource usage Named Entity Recognition
    License <Not Specified>
    Conditions of use Freely Available
    Description We present the first large named entity recognition dataset for low-resourced Sindhi language with quality baselines. Our work mainly consists of three novel contributions. Firstly, the annotation of a gold-standard SiNER (Sindhi Named Entity Recognition) dataset, secondly we present quality baselines using conditional random field (CRF) and recent state-of-the-art deep bi-directional long short term memory networks (Bi-LSTM) neural models in named entity recognition and thirdly the performance comparison of Bi-LSTM, Bi-LSTM-CRF models on GloVe and fastText word representations.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1444_res_1.zip [8.88 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.361.pdf
    Edition LREC 2020
  • Name A Seed Corpus of Hindu Temples In India
    Resource type Corpus
    Size 1.7 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License <Not Specified>
    Conditions of use Freely Available
    Description This json file consists of 4933 high accuracy facts about 573 temples. The file format is as below -------------------------------- {"head_line": {"questions": [] }"templ1":{"context": "text","answers": [ [],[]] ------------------------------------
    Download from https://github.com/priyaradhakrishnan0/templeKB/blob/master/corpus/WebTempleCorpus.json
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.32.pdf
    Edition LREC 2020
  • Name AbuseEval
    Resource type Corpus
    Size 14100 NULL
    Languages English
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description Extension of OLID/OffensEval data set with distinction of explicit vs implicit offensive messages. Annotation of Abusive Language, distinguishing also between explicit vs implicit offensive messages.
    Download from https://github.com/tommasoc80/AbuseEval
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf
    Edition LREC 2020
  • Name Amharic News Corpus
    Resource type Corpus
    Size 16295955 tokens
    Languages Amharic (amh)
    Production status Newly created-finished
    Resource usage Knowledge Discovery/Representation
    License <Not Specified>
    Conditions of use Freely Available
    Description This corpus of Amharic news websites was built by using our own crawler on four Amharic news websites from a variety of genres. As a pre-process, we removed non-Amharic words and letters and replaced all digits with the `#` character. After pre-processing, the corpus contained 16,295,955 tokens consisting of 855,109 unique tokens with average frequency of 19.05; this is similar in size to the English `text8` benchmark (17 million tokens, with an average frequency of 66). This is the corpus used for training embeddings in the publication: Mersha, Amanuel and Wu, Stephen. Morphology-rich Alphasyllabary Embeddings. Proceedings of the 12th Edition of the Language Resources and Evaluation Conference (LREC). Marseilles, France. 2020.
    Download from https://github.com/leobitz/amharic_word_embedding
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.315.pdf
    Edition LREC 2020
  • Name Amharic Treebank
    Resource type Treebank
    Size 1000 sentences
    Languages Amharic (amh)
    Production status <Not Specified>
    Resource usage <Not Specified>
    License <Not Specified>
    Conditions of use Freely Available
    Description The Amharic treebank (ATT) version 1 contains 1,074 manually-annotated sentences (5,245 tokens or 10,010 words). The sentences were collected from grammar books, biographies, news, and fictional and religious texts. The researchers made an effort to include different types of sentences.
    Download from https://github.com/Binyamephrem/Amharic-treebank
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.238.pdf
    Edition RAIL 2020
  • Name Anföranden
    Resource type Corpus
    Size 121987537 tokens
    Languages Swedish
    Production status Available
    Resource usage <Not Specified>
    License CC-BY 4.0
    Conditions of use Freely Available
    Description This resource contains all Swedish parliamentary debates from autumn 1993 to spring 2018. They have been cleaned and linguistically annotated through the corpus pipeline of Språkbanken Text, University of Gothenburg.
    Download from https://spraakbanken.gu.se/en/resources/rd-anf
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/ParlaCLARIN2/pdf/2020.parlaclarin-1.2.pdf
    Edition ParlaCLARIN2 2020
  • Name Annotated Corpora for Term Extraction Research (ACTER)
    Resource type Corpus
    Size 19002 terms and named entities
    Languages Dutch, English, French
    Production status Version 1.2 completed and validated
    Resource usage Training and evaluation data for automatic term extraction (ATE)
    License CC-BY-NC-SA 4.0
    Conditions of use Freely Available under appropriate license
    Description The Annotated Corpora for Term Extraction Research (ACTER), version 1.2, contain texts in four domains (corruption, dressage (horse riding), heart failure, and wind energy) and three languages (English, French, Dutch). For each corpus (combination of domain & language), around 50k tokens have been manually annotated to identify terminology and named entities (almost 600k annotated tokens in total). The results are presented as lists of annotations per corpus, with one (lowercased, unlemmatised, unique) annotation per line, tab-separated by its label. In total, there are 19k unique annotations. The annotation process is transparent and well-documented, with freely available guidelines (http://hdl.handle.net/1854/LU-8503113) and several published papers for the validation of the dataset. It has also been used for the TermEval 2020 shared task on automatic term extraction, organised at the CompuTerm workshop at LREC 2020.
    Download from https://clarin.eurac.edu/repository/xmlui/handle/20.500.12124/24#:~:text=The%20ACTER%20(Annotated%20Corpora%20for,failure%2C%20and%20wind%20energy).
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/COMPUTERM2020/pdf/2020.computerm-1.12.pdf
    Edition COMPUTERM 2020
  • Name Anomalous Lexical Units
    Resource type Lexicon
    Size 300 entries
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Lexicon Creation/Annotation
    License <Not Specified>
    Conditions of use Freely Available
    Description The resource contains 300 WordNet synsets that cannot be characterized by any existing semantic frame in Berkeley FrameNet data release 1.7.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/611_res_1.tsv [20 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.431.pdf
    Edition LREC 2020
  • Name ArSpecEmb
    Resource type Corpus
    Size 3196k words
    Languages Arabic (ara)
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License OpenSource
    Conditions of use Freely Available
    Description ArSpecEmb regroups embeddings that take into account specificity of Arabic language. They are trained on different types of corpora.
    Download from https://lium.univ-lemans.fr/en/arsentimentanalysis/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.610.pdf
    Edition LREC 2020
  • Name Austrian named entities
    Resource type Lexical Conceptual Resource General Information
    Size 15457 named entities
    Languages German
    Production status In use
    Resource usage Machine Translation
    License CC-BY-SA 3.0
    Conditions of use Freely Available
    Description Named entities of Austrian political parties, ministries of the federal administration, names of the federal ministers, name of the president, names of state secretaries, daily newspapers, Austrian common last names, Austrian common first names (male, female), Austrian geographical names (localities, municipalities, states, political districts), names of Austrian stock companies. The dataset "Austrian (German) named entities" is a compilation by University of Vienna of several datasets. Common names and Geographical names data originate from Statistics Austria. Names of Austrian stock companies originate from Wikimedia. Public domain information -- Names of President, State Secretaries and Ministers and Newspaper names are compiled by University of Vienna.
    Download from https://www.elrc-share.eu/repository/browse/austrian-named-entities/b0998b12ab9611e8b7d400155d02670612bad73492934202887a45e227312e0e
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf
    Edition LT4Gov 2020
  • Name BERT-Base, Multilingual
    Resource type Grammar/Language Model
    Size NULL NULL
    Languages 104 languages
    Production status Existing-used
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in "Well-Read Students Learn Better: On the Importance of Pre-training Compact Models" (https://arxiv.org/abs/1908.08962).
    Download from https://github.com/google-research/bert
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.669.pdf
    Edition LREC 2020
  • Name BERT-based Abstractive Summarization
    Resource type Grammar/Language Model
    Size NULL NULL
    Languages <Not Specified>
    Production status Newly created-finished
    Resource usage Summarisation
    License MIT License
    Conditions of use <Not Specified>
    Description Summarization Library.
    Download from https://github.com/axenov/BERT-Summ-OpenNMT
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.825.pdf
    Edition LREC 2020
  • Name Biblical Hebrew Graph Database
    Resource type Lexicon
    Size 10 MByte
    Languages Biblical Hebrew
    Production status In production
    Resource usage <Not Specified>
    License <Not Specified>
    Conditions of use <Not Specified>
    Description A thesaurus of Biblical Hebrew, with headwords linked by sematic, phonetic, and distributional similarities.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/16_res_1.tgz [734 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.660.pdf
    Edition LT4HALA 2020
  • Name Bilingual Corpus of English-Arabic Parallel Tweets
    Resource type Corpus
    Size 3 MByte
    Languages Arabic, English
    Production status Complete
    Resource usage For Research
    License Apache 2.0
    Conditions of use Publicly Available
    Description We provide tweet IDs of 166,706 pairs of English-Arabic parallel tweets. We share a list of 1389 twitter accounts who post English-Arabic parallel tweets. We also share subset of twitter accounts annotated for their countries and topics they most frequently post tweets about.
    Download from http://bit.ly/ 2xApE8V
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/BUCC2020/pdf/2020.bucc-1.3.pdf
    Edition BUCC 2020
  • Name Bilingual troll tweets
    Resource type Corpus
    Size 2.3 MByte
    Languages English, Russian (rus)
    Production status Newly created-finished
    Resource usage Text Mining
    License <Not Specified>
    Conditions of use Freely Available
    Description A bilingual dataset with troll tweets. The dataset we built contains tweets from troll and legitimate accounts in two languages (including translation from one language to another) and can be used for training models for automatic detection of troll tweets/accounts in either English, Russian, or both languages.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1055_res_1.zip [2.32 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.766.pdf
    Edition LREC 2020
  • Name Broad-Coverage German Sentiment Classification for Dialog Systems
    Resource type Corpus
    Size 5355043 entries
    Languages German
    Production status Newly created and combined with existing
    Resource usage Document Classification, Text categorisation
    License MIT License
    Conditions of use Freely Available
    Description This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.
    Download from https://github.com/oliverguhr/german-sentiment
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf
    Edition LREC 2020
  • Name CA-EHN
    Resource type Corpus
    Size 90505 analogies
    Languages Chinese
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License CC-BY-NC-SA 4.0
    Conditions of use Freely Available
    Description CA-EHN is a commonsense word analogy dataset containing 90,505 analogies covering 5,656 words and 763 relations. The typical usage of word analogy datasets are to evaluate word embeddings.
    Download from https://github.com/ckiplab/CA-EHN
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.365.pdf
    Edition LREC 2020
  • Name Cairo Student Code-Switch Corpus
    Resource type Corpus
    Size 505 KByte
    Languages Egyptian Arabic (arz), English (eng)
    Production status Newly created-finished
    Resource usage Morphological Analysis
    License OpenSource
    Conditions of use From Owner
    Description A corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1030_res_1.txt [517 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.489.pdf
    Edition LREC 2020
  • Name CantoMap
    Resource type Corpus
    Size 768 minutes
    Languages , Yue Chinese (yue)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License GNU GPL v3.0
    Conditions of use Freely Available
    Description The Cantonese MapTask corpus is a collection of recordings of the MapTask task in contemporary Hong Kong Cantonese. The corpus contains the following: - Recordings of the sessions of the MapTask are in the folder "ConversationData". Each sub-folder contains audio files and corresponding ELAN files that contain the transcription of the audio in Chinese characters (both non-segmented and manually segmented) and jyutping romanization. - Recordings of the reading task where participants read words aloud from a list are in the folder "ReadingData". - The maps used for the task are in the folder "Maps". In total, the corpus offers 12 hours and 48 minutes of recordings.
    Download from https://github.com/gwinterstein/CantoMap
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.355.pdf
    Edition LREC 2020
  • Name CEASE
    Resource type Corpus
    Size 66.3 KByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Emotion Recognition/Generation
    License <Not Specified>
    Conditions of use From Owner
    Description The dataset is present in the 'data' folder which contains 15 '.txt' files, one for each emotion class: Emotion Classes considered are: Abuse, Anger, Blame, Fear, Forgiveness, Guilt, Hopefulness, Hopelessness, Happiness_Peacefulness, Information, Instruction, Love, Pride, Sorrow, Thankfulness. Each file contains pre-processed sentences from real-life suicide notes. The full corpus contains 2393 annotated instances from 15 emotion classes. Note: To anonymize any real-life reference of persons, address and organization in this dataset, the authors have used three tags: NAME in place of any person's name ADDRESS in place of any address of a person or organization ORGANIZATION in place of any name of an industry, organization, firm, etc. The authors have taken the effort to anonymize any real-life references in the dataset, but still, if any reference is missed, it is advised to anyone using this dataset to anonymize them using the above mentioned three tags before using it in their work.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/907_res_1.zip [7 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf
    Edition LREC 2020
  • Name CETE codes and datasets
    Resource type Evaluation Data
    Size 68.7 MByte
    Languages English
    Production status Existing-used
    Resource usage Question Answering
    License <Not Specified>
    Conditions of use Freely Available
    Description We share the source code of our models as well as the pre-processed datasets (TREC-QA, WikiQA, SemEval2015-CQA, SemEval2016-CQA, SemEval2017-CQA). Our fine-tuning model is implemented based on this repository: https://github.com/huggingface/transformers and the feature-based model is implemented based on this repository: https://github.com/AnubhavGupta3377/Text-Classification-Models-Pytorch. The datasets are already publicly available. But as the publicly shared question answering datasets are often available in unstructured text format, the pre-processed versions is shared so that it can be easily adapted for future research and reproduction of results. Each instance in the dataset is separated by " ".
    Download from https://drive.google.com/file/d/1iCm515ZhEN-UmdOR_ievK3EA8iuXacSz/view?usp=sharing
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.676.pdf
    Edition LREC 2020
  • Name Chinese Humor Corpus
    Resource type Corpus
    Size 12.7 MByte
    Languages Chinese (zho)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License MIT License
    Conditions of use Freely Available
    Description This Chinese humor corpus contains 3,365 jokes collected from over 40 sources. Each joke was labeled with five levels of funniness, eight skill sets of humor, and six dimensions of intent.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/632_res_1.zip [5.23 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.168.pdf
    Edition LREC 2020
  • Name Chinese PragBank
    Resource type Corpus
    Size 167 KByte
    Languages Mandarin Chinese (cmn)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License OpenSource
    Conditions of use From Owner
    Description This corpus collects Chinese readers' veridicality judgments to news events, namely, whether readers believe an even is likely to happen or not. A total of 968 event sentences are included and each sentence is annotated with 6 veridicality judgments. There are 4 columns in the corpus, which are news sentences, normalized sentences, target events, and veridicality judgments.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/869_res_1.txt [171 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.33.pdf
    Edition LREC 2020
  • Name Cifu
    Resource type Lexicon
    Size 51798 entries
    Languages Yue Chinese (yue)
    Production status Newly created-finished
    Resource usage Psycholinguistics
    License GNU GPL v3.0
    Conditions of use Freely Available
    Description This resource offers a lexical database for Hong Kong Cantonese. Each entry in the lexicon includes the following information: - the term in Chinese characters - a Jyutping romanization of the term - the frequency of the term (occurence counts and per million words) in four different genres: Written, Spoken adult, Spoken child and Child directed speech - a count of the strokes in each character in the term (as a measure of the character complexity) - 6 measures of the Neighborhood Density of each entry (distinguishing whether insertion/deletion are allowed when searching for neighbors and the genre from which frequencies are retrived to calculate the Neighborhood Density: Written, Spoken Adult or their average).
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/508_res_1.txt [7.86 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.375.pdf
    Edition LREC 2020
  • Name Ciron: a New Benchmark Dataset for Chinese Irony Detection
    Resource type Corpus
    Size 1 MByte
    Languages Chinese (zho)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License <Not Specified>
    Conditions of use Freely Available
    Description Ciron includes more than 8.7k posts, collected from the Weibo, a microblogging platform. Ciron is one scale larger in size than the previously available dataset. Ciron is collected with no pre-conditions to ensure much wide coverage. Five fine-grained classes for ironic ratings: 1 (not ironic), 2 (unlikely ironic), 3 (insufficient evidence), 4 (weakly ironic), 5 (strongly ironic). Class 2 and 4 are introduced to allow for fine-grained extent in the judgement. The inter-rater agreement Fleiss' Kappa results is 0.470. Evaluation of four different deep learning classifiers proves the usefulness of Ciron as an important resource for Chinese irony detection.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/105_res_1.zip [1.05 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.701.pdf
    Edition LREC 2020
  • Name cldf-datasets/normansinitic: Structural and lexical data for the paper by Norman (2013) on Chinese dialect classification
    Resource type Lexicon
    Size 400 lexemes
    Languages Chinese (zho), Chinese dialects
    Production status Newly created-finished
    Resource usage Lexicon Creation/Annotation
    License CC-BY 4.0
    Conditions of use Freely Available
    Description See documentation (and main text).
    Download from https://doi.org/10.5281/zenodo.3552559
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.864.pdf
    Edition LREC 2020
  • Name Collaborative Dialogue in Minecraft
    Resource type Corpus
    Size 32 GByte
    Languages English (eng)
    Production status Existing-updated
    Resource usage Knowledge Discovery/Representation
    License CC-BY-SA 4.0
    Conditions of use Freely Available
    Description Collaborative Dialogue in Minecraft: the annotations were added for the corpus sentences.
    Download from https://drive.google.com/drive/folders/16lDzswcQh8DR2jkQJdoVTK-RyVDFPHKa
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.717.pdf
    Edition LREC 2020
  • Name ConceptNet 5
    Resource type Ontology
    Size 34 million concepts
    Languages English, French, German (deu), Italian, Spanish, Russian, Portuguese, Japanese, Dutch, Chinese and many others
    Production status Existing-updated
    Resource usage Semantic Web
    License CC-BY-SA 4.0
    Conditions of use Freely Available
    Description ConceptNet is a freely-available semantic network, designed to help computers understand the meanings of words that people use. ConceptNet originated from the crowdsourcing project Open Mind Common Sense, which was launched in 1999 at the MIT Media Lab. It has since grown to include knowledge from other crowdsourced resources, expert-created resources, and games with a purpose.
    Download from http://conceptnet.io/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.38.pdf
    Edition LREC 2020
  • Name Corpora of Portuguese Humour
    Resource type Corpus
    Size NULL NULL
    Languages Portuguese
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available
    Description Corpora balanced between two styles of humour (one-liners, headlines) and for styles of no-humour (general-knowledge questions, why questions, news headlines, proverbs), in Portuguese, useful for text classification / humour recognition.
    Download from https://github.com/andreclemencio/Recognizing-Humor-in-Portuguese/tree/master/Datasets/Balanceados
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf
    Edition LREC 2020
  • Name Corpus of Chinese Dynastic Histories
    Resource type Corpus
    Size 23000000 tokens
    Languages Literary Chinese (lzh)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License CC-BY 4.0
    Conditions of use Freely Available
    Description Newly created, on the basis of Wikisource, Corpus of Chinese Dynastic Histories currently includes twenty-four histories.
    Download from https://osf.io/tp729/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.98.pdf
    Edition LREC 2020
  • Name Corpus of Polish Spatial Texts 2.0 (PST 2.0)
    Resource type Corpus
    Size 61315 tokens
    Languages Polish (pol)
    Production status Existing-updated
    Resource usage Information Extraction, Information Retrieval
    License CC-BY-SA 4.0
    Conditions of use Freely Available
    Description The extended version of Polish Spatial Text corpus. Texts derived from polish travel blogs manually annotated with spatial expressions. A spatial expression is a text fragment which describes a relative location of two or more physical objects to each other.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/934_res_1.zip [973 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.265.pdf
    Edition LREC 2020
  • Name Corpus REDEWIEDERGABE (Core corpus)
    Resource type Corpus
    Size 489608 tokens
    Languages German (deu)
    Production status Newly created-finished
    Resource usage Machine Learning
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description Corpus REDEWIEDERGABE is a German-language historical corpus with detailed annotations for speech, thought and writing representation (ST&WR). It consists of narrative texts as well as newspaper and magazine texts published between 1840 and 1920 and has approximately 490,000 tokens. The textual material is balanced over decades and between fictional and non-fictional texts. The annotation system is rooted in narratological theory and specifies the type of ST&WR (direct, free indirect, indirect, reported) and the medium (speech, thought, writing) as well as several more specialized attributes. In addition, the framing phrase, the key word that indicates the speech, thought or writing act, and the source (speaker/author) of the ST&WR are annotated. The annotation is a consens annotation, based on two independent primary annotations. The corpus is available in a TEI-compliant XML format, UIMA XMI format, and a column-based plain text format (UTF8, CONLL style).
    Download from https://github.com/redewiedergabe/corpus/tree/master/data/main
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.100.pdf
    Edition LREC 2020
  • Name Costra 1.0
    Resource type Corpus
    Size 4262 sentences
    Languages Czech (ces)
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License CC-BY 4.0
    Conditions of use Freely Available
    Description Dataset of complex sentence transformations, COSTRA 1.0. is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. This first version of the dataset is limited to sentences in Czech.
    Download from http://hdl.handle.net/11234/1-3123
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.434.pdf
    Edition LREC 2020
  • Name DaNewsroom
    Resource type Corpus
    Size 1.1M entries
    Languages Danish (dan)
    Production status Newly created-finished
    Resource usage Summarisation
    License OpenSource
    Conditions of use Freely Available
    Description Large-scale automatic summarisation dataset, comprised of articles and their summaries.
    Download from https://github.com/danielvarab/da-newsroom
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.831.pdf
    Edition LREC 2020
  • Name DEbateNet-mig15
    Resource type Corpus
    Size 556185 tokens
    Languages German
    Production status Existing-updated
    Resource usage Political Science
    License CLARIN ACA+BY+NORED (EULA)
    Conditions of use Freely Available
    Description Corpus of manually annotated articles for political science research. Source: die Tageszeitung (reference German newspaper). Year: 2015. Multi-level annotation: claim identification and categorization into 97 categories (annotation schema available at https://github.com/mardy-spp/mardy_acl2019/blob/master/codebook.pdf); actor identification; claim attribution (linking of claims to actors); polarity; claim date. Additionally, an R package is released which can be used to explore the dataset and replicate the analysis of it.
    Download from http://hdl.handle.net/11022/1007-0000-0007-DB07-B
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.115.pdf
    Edition LREC 2020
  • Name Developing an Arabic Infectious Disease Ontology to Include Non-Standard Terminology
    Resource type Ontology
    Size 247 entries
    Languages Arabic (ara)
    Production status Newly created-finished
    Resource usage Text Mining
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/138_res_1.txt [241 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.596.pdf
    Edition LREC 2020
  • Name DIP-SumEval
    Resource type Corpus
    Size 49 topics (x 10 summaries and 11 dimensions)
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Summarisation
    License CC-BY 4.0
    Conditions of use Freely Available, Attribution Required
    Description This repository contains the summaries and evaluations from the paper 'A Dataset for the Analysis of Text Quality Dimensions in Summarization Evaluation' presented at LREC 2020. In total, the data set contains over 400 automatically generated summaries for 49 topics of an data set for multi-document summarization, 1274 judgements according to 11 text and summary quality criteria on a Likert-scale (1 to 5) performed by 26 trained annotators, and 43218 pairwise judgements according to 6 criteria performed by 64 crowd-workers. If you want to use the data please cite the paper http://www.lrec-conf.org/proceedings/lrec2020/bib/2020.lrec-1.826.bib.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/915_res_1.zip [531 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.826.pdf
    Edition LREC 2020
  • Name DiscSense
    Resource type Lexicon
    Size NULL MByte
    Languages English
    Production status Newly created-finished
    Resource usage Discourse
    License <Not Specified>
    Conditions of use Freely Available
    Description DiscSense was built automatically: we finetuned a BERT model for discourse marker prediction between sentences (using the Discovery dataset) and used the marker prediction between sentences of existing datasets.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/33_res_1.tsv [289 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.125.pdf
    Edition LREC 2020
  • Name Domain-Specific Dataset of Difficulty Ratings for German Noun Compounds
    Resource type Terminology
    Size 1030 lexemes
    Languages German (deu)
    Production status Newly created-finished
    Resource usage Difficulty of domain-specific German closed compounds
    License CC-BY-SA 4.0
    Conditions of use Freely Available
    Description Dataset with difficulty ratings for 1,030 German closed noun compounds extracted from domain-specific texts for do-it-ourself (DIY), cooking and automotive. It includes two-part compounds for cooking and DIY, and two- to four-part compounds for automotive.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/137_res_1.zip [13 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.537.pdf
    Edition LREC 2020
  • Name dutch-archaeo-NER-dataset
    Resource type Corpus
    Size 8.4 MByte
    Languages Dutch (nld)
    Production status Newly created-finished
    Resource usage Named Entity Recognition
    License GNU GPL v3.0
    Conditions of use Freely Available
    Description A manually annotated NER dataset, consisting of Dutch archaeological excavation reports. The following entity types are labelled: - Artefacts - Time periods - Materials - Places (geographical locations) - Archaeological contexts - Species The dataset is provided in the BIO format, with each token on 1 line and empty lines denoting sentence boundaries. On each line you can find the token, PoS tag, morphological segmentation and finally the label, separated by spaces. The PoS tag and morphological segmentation are assigned by Frog. This work was funded by a grant provided by the Leiden University Centre For Digital Humanities
    Download from https://doi.org/10.5281/zenodo.3544543
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.562.pdf
    Edition LREC 2020
  • Name ElkarHizketak v1.0
    Resource type Corpus
    Size 1634 questions
    Languages Basque (eus)
    Production status Newly created-finished
    Resource usage Dialogue
    License CC-BY-SA 4.0
    Conditions of use Freely Available
    Description ElkarHizketak is a Conversational Question Answering (CQA) dataset in Basque created by Basque speaker volunteers. The dataset contains close to 400 dialogues and more than 1600 questions and answers, and its small size presents a realistic low-resource scenario for CQA systems. The dataset is built on top of Wikipedia sections about popular people and organizations. The dialogues involve two crowd workers: (1) a student asks questions after reading a small introduction about the person, but without seeing the section text; and (2) a teacher answers the questions selecting a span of text of the section.
    Download from http://ixa2.si.ehu.es/convai/elkarhizketak-v1.0.zip
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.55.pdf
    Edition LREC 2020
  • Name English WordNet Random Walk Pseudo-Corpora
    Resource type Corpus
    Size 81 corpora
    Languages English
    Production status Newly created-finished
    Resource usage Language Modelling
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description The archive contains a collection of 81 pseudo-corpora (9.9GB). These are text files that contain pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy. The random walk algorithm produces a pseudo-sentence from WordNet by randomly picking a node (SynSet) in WordNet, randomly choosing a word in the SynSet, and then randomly picking a connected node and repeating the process. At every step there is a 15% chance for the random walk to stop; it also stops if it has no more connected nodes to take. Once the walk stops, a sentence is generated, and the same process repeats for each new sentence. Each line in the generated file represents one pseudo-sentence, where words are delimited by spaces. Example sentences: - measure musical notation tonality minor mode - Dutch-processed cocoa powder chocolate milk The corpus files are different in size, as well as in some parameters that were used to generate them. The parameters are: - size: number of sentences/lines in the corpus - direction: the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both) - minimal sentence length: the shortest length sentence (in number of words)
    Download from https://doi.org/10.21427/he55-6481
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.602.pdf
    Edition LREC 2020
  • Name English-Hiligaynon
    Resource type Lexicon
    Size 1200 entries
    Languages English (eng), Hiligaynon
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License <Not Specified>
    Conditions of use From Owner
    Description Word-word translation of 1,200 English words into Hiligaynon.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/163_res_1.txt [20 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.313.pdf
    Edition LREC 2020
  • Name Evaluation Benchmark for Domain Taxonomies from Knowledge Graphs (EBDT-KG)
    Resource type Evaluation Data
    Size 155 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Knowledge Discovery/Representation
    License CC-BY 4.0
    Conditions of use Freely Available
    Description An application-specific gold standard dataset extracted from the Wikipedia knowledge graph and an evaluation framework to comparatively assess the quality of noisy automatically extracted taxonomies.
    Download from https://sites.google.com/unitelmasapienza.it/wikipediataxonomies/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.285.pdf
    Edition LREC 2020
  • Name Event it Pronouns
    Resource type Corpus
    Size 8 MByte
    Languages English (eng), Estonian (ekk), Finnish (fin), German (deu), Spanish (esp), French (fra), Hungarian (hun), Italian (ita), Latvian (lav), Dutch (nld), Polish (pol), Portuguese (por), Romanian (ron), Slovak (slk), Slovenian (slv), Swedish (swe)
    Production status Newly created-finished
    Resource usage Anaphora, Coreference
    License OpenSource
    Conditions of use Freely Available
    Description This is a multilingual corpus of English 'it' pronouns annotated with their translations in the Europarl languages. The English 'it' pronouns have also been automatically labeled as 'event'.
    Download from https://github.com/sharidloaiciga/multilingual-event-pronouns
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.12.pdf
    Edition LREC 2020
  • Name Fake News Filipino Dataset
    Resource type Corpus
    Size 3206 entries
    Languages English (eng), Filipino (fil)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License GNU-GPL v3.0
    Conditions of use Freely Available
    Description The first curated fake news detection dataset in Filipino. The dataset has 3206 samples with a perfect 50/50 split between real and fake news articles. Fake news articles are from sites tagged as fake by VeraFiles (an independent news organization) and the National Union of Journalists in the Philippines (NUJP). Real news were sourced from mainstream news sites in the Philippines. Data collection was performed across five months, as content and websites that spread fake news are usually taken down or go offline quickly, which makes text-based fake news data in Filipino scarce.
    Download from https://github.com/jcblaisecruz02/Tagalog-fake-news
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.316.pdf
    Edition LREC 2020
  • Name Fake.Br Corpus
    Resource type Corpus
    Size 22.4 MByte
    Languages Portuguese (por)
    Production status Existing-used
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available, Attribution Required
    Description The Fake.Br Corpus is composed of true and fake news that were manually aligned, focusing only on Brazilian Portuguese. There are 7,200 news, with exact 3,600 true and 3,600 fake news. All of them are in plain text format, with each one in a different file.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/939_res_1.zip [23.45 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.176.pdf
    Edition LREC 2020
  • Name Fakeddit
    Resource type Corpus
    Size 1063106 entries
    Languages English
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available
    Description We present Fakeddit, a novel multimodal dataset consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classi?cation categories through distant supervision.
    Download from https://github.com/entitize/fakeddit
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.755.pdf
    Edition LREC 2020
  • Name FalseFriendsLexicon
    Resource type Lexicon
    Size 900 KByte
    Languages French (fra), Italian (ita), Portuguese (por), Romanian (ron), Spanish (spa), English (eng)
    Production status Newly created-finished
    Resource usage Lexicon Creation/Annotation
    License CC-BY 4.0
    Conditions of use Freely Available
    Description This lexicon of false friends pairs in 6 languages, includes for each language pair automatically detected pairs of false friends along with a score of falseness.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/180_res_1.zip [628 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.367.pdf
    Edition LREC 2020
  • Name FandomCorpora
    Resource type Corpus
    Size 1627 Summaries
    Languages English (eng), German (deu)
    Production status Newly created-finished
    Resource usage Summarisation
    License MIT License
    Conditions of use Freely Available
    Description Three sample corpora created with the FandomCorpus Construction Framework.
    Download from https://datamanagementlab.github.io/fandomCorpus/data.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.827.pdf
    Edition LREC 2020
  • Name FAPESP-PBSMT-NMT
    Resource type Corpus
    Size 310 KByte
    Languages Brazilian Portuguese, English (eng)
    Production status Newly created-finished
    Resource usage Machine Learning
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description This resource is a collection of 300 parallel sentences written in Brazilian Portuguese and English annotated with machine translation errors. There are two sets of 300 sentences: one output by a Phrase-Based Statistical Machine Translation system and another one output by a Neural Machine Translation System. Details about the annotation process can be found at the paper (Caseli; Inácio, LREC 2020).
    Download from https://github.com/LALIC-UFSCar/FAPESP-PBSMT-NMT
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.446.pdf
    Edition LREC 2020
  • Name FastText.cc
    Resource type Grammar/Language Model
    Size NULL NULL
    Languages Portuguese (por)
    Production status Existing-used
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.
    Download from https://fasttext.cc/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.669.pdf
    Edition LREC 2020
  • Name FEVER Dev partition delexicalized
    Resource type Corpus
    Size 10 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_5.txt [9.84 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
    Edition LREC 2020
  • Name Fin-SoMe
    Resource type Corpus
    Size 10000 sentences
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License CC BY-NC-SA 4.0
    Conditions of use From Owner
    Description Both front-desk and middle-desk experts were hired from a bank treasury. The front-desk expert, working in the treasury marketing unit, was assumed to be a risk-neutral individual (RN), and the middle-desk expert from the risk management department was to be assumed a risk-averse individual (RA). Annotators used five months to complete the labeling of 10,000 tweets. A tweet is investigated from four angles: (1) The market sentiment (bullish/bearish) of the tweet; (2) The presence or absence of reasoning in the tweet supporting the investor's analysis; (3) The writer's sentiment (positive/negative); (4) The gain/loss of the writer's trade. Because writers do not always show market sentiment, writer sentiment, or gain/loss in their tweets, annotators assigned a "None" label to tweets without the related narratives.
    Download from http://nlg.csie.ntu.edu.tw/nlpresource/FinSoMe
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.749.pdf
    Edition LREC 2020
  • Name FlauBERT
    Resource type Grammar/Language Model
    Size 552.6 MByte
    Languages French (fra)
    Production status Newly created-finished
    Resource usage Language Modelling
    License OpenSource
    Conditions of use Freely Available
    Description Pre-trained language model for French.
    Download from https://github.com/getalp/Flaubert
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.302.pdf
    Edition LREC 2020
  • Name FloDusTA
    Resource type Corpus
    Size 9000 tweets
    Languages Arabic (ara)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License <Not Specified>
    Conditions of use Freely Available
    Description Flood, Dust Storm, Traffic Accident Saudi Event dataset
    Download from https://github.com/BatoolHamawi/FloDusTA
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.174.pdf
    Edition LREC 2020
  • Name fnc
    Resource type Corpus
    Size 42 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description A comprehensive collection of all the urls is also presented in the paper referenced herewith.
    Download from https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
    Edition LREC 2020
  • Name FNC delexicalized training partition
    Resource type Corpus
    Size 95 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
    Edition LREC 2020
  • Name FNC dev partition delexicalized
    Resource type Corpus
    Size 55 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
    Edition LREC 2020
  • Name ForwardQuestions
    Resource type Corpus
    Size 38480 questions
    Languages English
    Production status Complete
    Resource usage Question Generation
    License CC-BY 3.0 Unported
    Conditions of use <Not Specified>
    Description The ForwardQuestions data set is made of human-generated questions related to knowledge triples. This data set results from the conversion and merger of the existing SimpleDBPediaQA and SimpleQuestionsWikidata data sets, including the mapping of predicates from DBPedia to Wikidata, and the selection of 'forward' questions as opposed to 'backward' ones. The new data set can be used to generate novel questions given an unseen Wikidata triple.
    Download from https://github.com/johannamelly/ForwardQuestions
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/ISA16/pdf/2020.isa-1.7.pdf
    Edition ISA16 2020
  • Name GeBioCorpus_v2
    Resource type Corpus
    Size 2000 sentences
    Languages Catalan (cat), English (eng), Spanish (spa)
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License CC-BY-NC 4.0
    Conditions of use Freely Available
    Description Corpus of 2,000 sentences in English, Spanish and Catalan (gender-balanced), which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/325_res_1.zip [389 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.502.pdf
    Edition LREC 2020
  • Name GeoWAC
    Resource type Corpus
    Size 45 billion words
    Languages ara, aze, bel, bul, cat, ces, dan, deu, ell, eng, est, fas, fin, fra, gle, hbs, hin, hun, ind, isl, ita, jpn, kat, kaz, kor, lav, lit, mkd, mon, nld, nor, pol, por, ron, rus, slk, slv, spa, sqi, swe, tam, tgl, tur, ukr, urd, uzb, vie, zho
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License GNU GPL v2.0
    Conditions of use Freely Available
    Description Gigaword geo-referenced web corpora for 50 language (varieties) that are explicitly balanced for geographic and demographic representation.
    Download from https://www.earthlings.io/corpus_download.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.308.pdf
    Edition LREC 2020
  • Name German-English parallel data by the Presidency of the Council of the EU held by Austria in 2006
    Resource type Corpus
    Size 4973 Translation Units
    Languages English, German
    Production status In use
    Resource usage Machine Translation
    License Open Under-PSI
    Conditions of use Freely Available
    Description News/media (press releases, interviews and Common Foreign and Security Policy statements) in German and English by the EU-Presidency, Austria, 2006.German-English parallel data containing news and statements (press releases, interviews and Common Foreign and Security Policy statements) unidirectional from German to English by the Presidency of the Council of the EU held by Austria in 2006. Aligned with HunAlign, a language-independent sentence aligner (Varga et al, 2005) and manually evaluated by two evaluators. Format: .tmx. Character encoding: UTF-8.
    Download from https://www.elrc-share.eu/repository/browse/german-english-parallel-data-by-the-presidency-of-the-council-of-the-eu-held-by-austria-in-2006/e38b283eac3e11e8b7d400155d0267062180d233a0fd4e84b8dffb9b25cc1775/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf
    Edition LT4Gov 2020
  • Name GeWiki
    Resource type Corpus
    Size 240000 entries
    Languages German (deu)
    Production status Newly created-finished
    Resource usage Summarisation
    License OpenSource
    Conditions of use Freely Available
    Description A summarization corpus extracted from Wikipedia articles. The corpus is split into training (220'000 texts), evaluation (10'000 texts) and test (10'000 texts) sets. For each text (text_id.src) there is a corresponding reference summary (text_id.tgt).
    Download from https://github.com/domfr/GeWiki
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.821.pdf
    Edition LREC 2020
  • Name GitHub Typo Corpus
    Resource type Corpus
    Size 353055 entries
    Languages Chinese (zho), English (eng), Japanese (jpn), Others
    Production status Newly created-finished
    Resource usage Spelling Correction, Grammatical Error Correction
    License <Not Specified>
    Conditions of use Freely Available
    Description A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors.
    Download from https://github.com/mhagiwara/github-typo-corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.835.pdf
    Edition LREC 2020
  • Name Glossary "Austrian Administration. German - English"
    Resource type Lexical Conceptual Resource General Information
    Size 673 entries
    Languages English, German (AT)
    Production status In use
    Resource usage Machine Translation
    License Use and non-commercial distribution permitted if resource is unchanged and source stated
    Conditions of use Freely Available
    Description The terminological resource entitled Fachglossar Österreichische Verwaltung. Deutsch - Englisch covers terminology in the public administration domain in German and English. It contains terminology from the areas of Austrian public law, legislation and executive authorities. The terminology is standardised by an informal working group of translators employed with the Austrian federal ministries.
    Download from https://www.sprachressourcen.at/verwaltungsglossar/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf
    Edition LT4Gov 2020
  • Name Glossary terms in German related to Austria and the topics of the trio presidency
    Resource type Lexical Conceptual Resource General Information
    Size 71000 terms
    Languages German
    Production status <Not Specified>
    Resource usage Machine Translation
    License CC-BY-SA 3.0
    Conditions of use Freely Available
    Description German monolingual terminology and named entities. The dataset was created by University of Vienna by crawling, extracting and compiling content from Wikipedia. Terms crawled from de.wikipedia.org on the Austrian and EU-presidency related topics of the EU Council Presidency 2018: agricultural policy, digitalisation, development policy, European Union, Single Market, trade, climate policy, migration, sustainability, Austria, politics, security policy, environment, Western Balkans, economic policy.
    Download from https://www.elrc-share.eu/repository/browse/terms-in-german-related-to-austria-and-the-topics-of-the-trio-presidency/b82781c4ab9e11e8b7d400155d026706f61ef02809fb4748944b1af1b434f0a9/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf
    Edition LT4Gov 2020
  • Name Gold Standard for Cognate Pairs in English-Dutch and French-Dutch
    Resource type Lexicon
    Size 51220 entries
    Languages Dutch, English, French
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use From Owner
    Description This resource contains a context-independent gold standard for English-Dutch and French-Dutch cognate detection. To this end, automatic word alignment was applied on the Dutch Parallel Corpus, and all term equivalents with a Normalized Levenshtein distance smaller than 0.5 were extracted. This resulted in a list with 28,503 English-Dutch candidate cognate pairs, and 22,715 French-Dutch candidate cognate pairs, which were subsequently manually labeled according to the guidelines established in Labat et al. 2019. The following labels were annotated: (1) Cognate: words which have a similar form and meaning in all contexts, (2) Partial cognate: words which have a similar form, but only share the same meaning in some contexts, (3) False friend: words which have a similar form but a different meaning, (4) Proper name: proper nouns (e.g. persons, companies, cities, coun-tries, etc.) and their derivations, (5) Error: word alignment errors and compound nouns of which one part is a cognate but the other part is missing in one of the languages, and (6) No standard: words that do not occur in the dictionary of that particular language.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/367_res_1.zip [1.65 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.504.pdf
    Edition LREC 2020
  • Name Greek POS Tagged Tweets
    Resource type Corpus
    Size 1,9 MByte
    Languages Modern Greek (1453-) (ell)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use From Owner
    Description <Not Specified>
    Download from https://hilab.di.ionio.gr/index.php/en/datasets/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.476.pdf
    Edition LREC 2020
  • Name Habibi Corpus
    Resource type Corpus
    Size 378646 MByte
    Languages Arabic (ara), Arabic Dialects, Egyptian, Maghrebi, Iraqi, Gulf, Levantine, Sudanese
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use Freely Available
    Description Habibi is the first freely available corpus of Arabic song lyrics. The corpus comprises of more than 30,000 Arabic songs from 18 different Arab countries. Each Song in the corpus comes with a song title along with the singer's full name, country of origin and dialect. The corpus also contains information about the song's writer and composer. Each song has a singer, song title, and lyrics. The corpus is made of 30,071 Arabic songs with their lyrics split into 527,870 sentences (song verses). The corpus contains information about 1,765 singers, 3,789 song writers and 2,463 composers. The singers span across 18 Arabic countries including Egypt,Saudi, Lebanon, Iraq, Sudan, Kuwait, Syria, UAE, Morocco, Tunisia, Yemen, Jordan, Algeria, Qatar, Bahrain, Oman, Palestine and Libya. The lyrics covers 6 dialects: Egyptian, Gulf, Levantine, Iraqi, Sudanese and Maghrebi (North African). The corpus in CSV file format is available through ELRA's Share Your LR's initiative (size 10,856 MB). The complete corpus in txt and csv format as well as Habibi's word embeddings are available on http://ucrel-web.lancaster.ac.uk/habibi/.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/450_res_1.zip [11.12 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.165.pdf
    Edition LREC 2020
  • Name Hallig-Wartburg Ontology
    Resource type Ontology
    Size 6.8 Mbyte
    Languages <Not Specified>
    Production status Newly created-finished
    Resource usage Language Modelling
    License CC0
    Conditions of use Freely Available
    Description This ontology is based on the so-called Hallig-Warburg: Rudolf Hallig and Walther von Wartburg, Begriffssystem als Grundlage für die Lexikographie / Système raisonné des concepts pour servir de base à la lexicographie, Berlin (Akademie-Verlag), [1952] 1963; Hallig-Wartburg is a conceptual scheme used in historical linguistic resources.
    Download from https://lod.academy/site/vocabs/hallig-wartburg
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LDL2020/pdf/2020.ldl-1.1.pdf
    Edition LDL 2020
  • Name HatEval
    Resource type Corpus
    Size 19600 entries
    Languages English, Spanish
    Production status Existing-used
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available
    Description Hate Speech detection in Twitter but featured by two specific different targets, immigrants and women, in a multilingual perspective, for Spanish and English.
    Download from https://competitions.codalab.org/competitions/19935
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf
    Edition LREC 2020
  • Name Hindi Courteous Dataset
    Resource type Corpus
    Size 52.3 MByte
    Languages Hindi
    Production status Newly created-finished
    Resource usage Natural Language Generation
    License <Not Specified>
    Conditions of use From Owner
    Description The conversations between various customers and the trained customer care agents of different companies on their Twitter handles were used for building the dataset. The Twitter data for Hindi was mined from twitter for different companies. The dataset comprises of conversations with their ids followed by the utterance id. We have two sets of responses available in the dataset i.e., courteous response as well as the corresponding generic response.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/969_res_1.zip [13.08 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.514.pdf
    Edition LREC 2020
  • Name Hindi-Discourse-Modes
    Resource type Corpus
    Size 1 MByte
    Languages Hindi (hin)
    Production status Newly created-finished
    Resource usage Discourse
    License MIT License
    Conditions of use Freely Available
    Description A new corpus consisting of sentences from Hindi short stories annotated for five different discourse models argumentative, narrative, descriptive, dialogic and informative.
    Download from https://github.com/midas-research/hindi-discourse
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.149.pdf
    Edition LREC 2020
  • Name Hindi_Event
    Resource type Corpus
    Size 13.6 MByte
    Languages Hindi
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License <Not Specified>
    Conditions of use From Owner
    Description This is a benchmark dataset in Hindi language for Event Extraction. All the annotated documents are XML documents in which event triggers, argument triggers, event type information, argument role information and event-argument linking information are tagged. The dataset contains 1709 XML documents.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1292_res_1.zip [3.29 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.273.pdf
    Edition LREC 2020
  • Name HSE Corpus for Discovering Safety in Construction
    Resource type Corpus
    Size 600 sentences
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Named Entity Recognition
    License Open Government Licence 3.0 / CC-BY-SA 4.0 International Licence
    Conditions of use Freely Available
    Description The "HSE Corpus for Discovering Safety in Construction" is a semantically annotated corpus for safety in construction, consisting of 600 sentences from RIDDOR workplace accident reports. The corpus is annotated with six types of named entities relevant to the construction safety domain. The annotation of the corpus was split between four different annotators; 400 of the sentences have been annotated by at least two of the annotators. The average inter-annotator agreement rate is 0.79 F-Score.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/426_res_1.zip [1.05 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.245.pdf
    Edition LREC 2020
  • Name IGC-Parl
    Resource type Corpus
    Size 218889307 tokens
    Languages Icelandic
    Production status 1st version published
    Resource usage <Not Specified>
    License CC-BY 4.0
    Conditions of use <Not Specified>
    Description The 2019 version of the corpus includes speeches from 1911-2019. It comprises 404 thousand speeches and just under 219 million words. The corpus has been automatically part-of-speech tagged and lemmatised. It is annotated with extensive metadata about the speeches, speakers and political parties, including speech topic, whether the speaker is in the government coalition or opposition, age and gender of speaker at the time of delivery and references to sound and video recordings where available. The corpus is distributed in TEI-conformant format.
    Download from http://hdl.handle.net/20.500.12537/14
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/ParlaCLARIN2/pdf/2020.parlaclarin-1.3.pdf
    Edition ParlaCLARIN2 2020
  • Name IKAT Dataset (German)
    Resource type Corpus
    Size NULL NULL
    Languages German (deu)
    Production status Existing-used
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use Freely Available, Attribution Required
    Description This dataset contains annotations about implied information in argumentative texts. The annotations are performed on sentence pairs from the Microtext corpus (the original German version, Peldszus/Stede 2015), that stand in an argumentative relation according to the argumentation graph. There are 464 such sentence pairs in the 112 texts in the corpus, i.e., approx. 4 pairs per microtext.
    Download from https://www.cl.uni-heidelberg.de/english/research/downloads/resource_pages/IKAT/IKAT_data.shtml
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.282.pdf
    Edition LREC 2020
  • Name Imperative Low-level Complete Image Edit Requests
    Resource type Corpus
    Size 2540 sentences
    Languages English
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use From Owner
    Description 2540 image edit requests with edit type adjust. Each request is annotated with (1) referring expression (2) attribute (3) value. (1) referring expression refers to an object in an image~(not accompanied). (2) attribute belongs to one of the five categories (a) brightness, (b) contrast (c) hue (d) saturation (e) lightness. (3) value is an integer between -100 and 100. Dataset is randomly split into train/dev/test of 2056/243/241 respectively. There is also an "abstract" version where (2) attribute and (3) value are replaced with tokens <abstract> and <value>.
    Download from https://github.com/tzuhsial/ImageEditingWithDialogue/tree/master/imadial-nlu/data/ILLC-IER
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.51.pdf
    Edition LREC 2020
  • Name imSituVQA
    Resource type Corpus
    Size 52 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Question Answering
    License <Not Specified>
    Conditions of use Freely Available
    Description imSituVQA is a corpus for Visual Question Answering Annotated with Semantic Frame Information, based on the imSitu dataset. So images can be downloaded from http://imsitu.org/download/ Question answer pairs were created from image annotations of imSitu dataset. Each question answer is annotated with frame semantic information.
    Download from https://github.com/givenbysun/imSituVQA/blob/master/data/imSituVQA.json
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.678.pdf
    Edition LREC 2020
  • Name Indian-Language-Dataset
    Resource type Corpus
    Size 461 MByte
    Languages Bengali, Malayalam, Tamil, Telugu, Urdu (urd)
    Production status Existing-updated
    Resource usage Machine Translation, SpeechToSpeech Translation
    License <Not Specified>
    Conditions of use Freely Available
    Description We obtained the data from different resources such as EnTamV2.0 (Ramasamy et al., 2012), Opus (Tiedemann, 2012) and UMC005 (Jawaid and Zeman, 2011). The sentences are of domain news, cinema, bible and movie subtitles. We combined and preprocessed the data of Tamil, Malayalam, Telugu, Bengali, and Urdu. After preprocessing (as described below) and cleaning, we split the dataset into train, test, and validation accordingly.
    Download from https://github.com/himanshudce/Indian-Language-Dataset
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.444.pdf
    Edition LREC 2020
  • Name INEL Dolgan Corpus 1.0
    Resource type Corpus
    Size NULL NULL
    Languages Dolgan, English, Russian
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License CC-BY-NC-SA 4.0
    Conditions of use Freely Available
    Description The INEL Dolgan corpus has been created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages"), 2016-2033. The corpus makes possible typologically aware corpus-based grammatical research on the Dolgan language and expands the documentation of the lesser described indigenous languages of Northern Eurasia. Dolgan is an endangered Siberian (North-Eastern) Turkic language, which is spoken on the Taymyr peninsula and in adjacent areas in Northern Siberia. The number of speakers is approx. 1,000. Whereas in many settlements, including Dudinka, Dolgan is rarely spoken in everyday's life, there are three remote villages (Syndassko, Popigay and Novorybnoe) where Dolgan is until now very present and frequently spoken. There is no final agreement about a possible dialectal division of Dolgan. Mostly, two or three varieties of Dolgan are mentioned: Upper (Western) Dolgan, Lower (Eastern) Dolgan and Anabar Dolgan. Upper Dolgan is spoken in the settlements of Volochanka, Ust`-Avam, Katyryk, Xeta, Novaya, Kresty. Lower Dolgan is spoken in the villages Zhdanixa, Novorybnoe, Syndassko and Popigay and seems to be the more vital variety of Dolgan. The settlement of Xatanga is said to form the border between the two dialects. Anabar Dolgan is spoken in the Anabar ulus of the Sakha republic (Yakutia) and possibly best regarded a transitory variety of Dolgan and Yakut. Formerly, the territory around Dudinka and Norilsk belonged to the Upper Dolgan area. Nowadays, however, the traditional living places of Dolgans do not exist anymore; the Dolgans living in Dudinka and Norilsk come originally from the whole Taymyr peninsula, and thus from different dialectal regions. The INEL Dolgan corpus contains both Upper and Lower Dolgan material. For several reasons Anabar Dolgan is not included. The INEL Dolgan corpus is composed of texts from different sources: 1. Published folklore texts from an edited volume ("Fol'klor Dolgan", P.E. Efremov 2000), 2. Transcripts of recordings obtained from the Taymyr House of Folk Art (TDNT) in Dudinka (1970s-2000s), 3. Transcripts from the collection of Dr. Eugénie Stapert recorded on several fieldwork trips in 2007-2010, 4. Transcripts of recordings made on a fieldwork trip in 2017. The first group as well as parts of the third group were already transcribed and translated, the rest of the recordings was transcribed and translated within the INEL project. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, as well as annotation of Russian borrowings. Some texts also have annotations for syntactic functions, semantic roles and information structure/information status. The corpus has been produced in the context of the joint research funding of the German Federal Government and Federal States in the Academies' Programme, with funding from the Federal Ministry of Education and Research and the Free and Hanseatic City of Hamburg. The Academies' Programme is coordinated by the Union of the German Academies of Sciences and Humanities. The corpus could not have been produced without the support of the Taymyr House of Folk Art (TDNT). On the one hand, the TDNT provided valuable audio material (see above). On the other hand, the TDNT helped in all kinds of organizational tasks. Moreover, the sound materials from the TDNT were transcribed and translated by the following native speakers of Dolgan: Nina Semyonovna Kudryakova (TDNT Head of Department of folklore and ethnography), who also worked as editor for transcriptions and translations by other consultants, Svetlana Semyonovna Kudryakova, Egor Kudryakov, Adeya Evdokimovna Eske, Aleksandra Tuprina, Illarion Tuprin In summer 2017 (06.07.2017 - 20.08.2017), Chris Lasse Däbritz and Dr. Eugénie Stapert had the opportunity to conduct fieldwork with Dolgans in the town of Dudinka. Thanks to the organizational work of the TDNT, it was possible to work many hours each day with very motivated consultants. The fieldwork had two major goals: Firstly, most uncertainties in the analysis of the folklore texts from (Efremov 2000) could be resolved. Secondly, a lot of material obtained from the TDNT could be transcribed, translated and analyzed. Moreover, roughly half an hour of new material was recorded, namely an experiment on Social Cognition. The following local consultants were involved in the fieldwork: Nina Semyonovna Kudryakova, Anna Alekseevna Barbolina, Vera Polikarpovna Bettu, Galina Sidorovna Chuprina, Adeya Evdokimovna Eske, Stepanida Il`inichna Kudryakova, Yuliya Kupchik, Polina Prokop`evna Uodaj The web-based search interface is using the Tsakonian Corpus platform developed by Dr. Timofey Arkhangelskiy, Humboldt Research Fellow at IFUU, Hamburg University.
    Download from http://hdl.handle.net/11022/0000-0007-CAE7-1
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.322.pdf
    Edition LREC 2020
  • Name INEL Dolgan Corpus 1.0
    Resource type Corpus
    Size 72912 words
    Languages Dolgan (dlg), English (eng), German (deu), Russian (rus)
    Production status Existing-used
    Resource usage Knowledge Discovery/Representation
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description Dolgan is an endangered Siberian (North-Eastern) Turkic language, which is spoken on the Taymyr peninsula and in adjacent areas in Northern Siberia. The number of speakers is approx. 1,000. Whereas in many settlements, including Dudinka, Dolgan is rarely spoken in everyday's life, there are three remote villages (Syndassko, Popigay and Novorybnoe) where Dolgan is until now very present and frequently spoken. The INEL Dolgan corpus has been created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages"), 2016-2033. The corpus makes possible typologically aware corpus-based grammatical research on the Dolgan language and expands the documentation of the lesser described indigenous languages of Northern Eurasia.
    Download from http://hdl.handle.net/11022/0000-0007-D832-D
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf
    Edition LREC 2020
  • Name INEL Kamas Corpus 0.1
    Resource type Corpus
    Size 48293 words
    Languages English (eng), German (deu), Kamas (xas), Russian (rus)
    Production status Existing-used
    Resource usage Knowledge Discovery/Representation
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description Kamas is an extinct Samoyedic language (Uralic family). The INEL Kamas corpus comprises folklore texts collected by Kai Donner in 1912-1914, before the language shift, and transcribed audio recordings of the last speaker, Klavdiya Plotnikova made between 1964 and 1970. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, annotation of borrowings. Some texts also have annotations for syntactic structure, semantic roles and information status.
    Download from http://hdl.handle.net/11022/0000-0007-CF46-2
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf
    Edition LREC 2020
  • Name INEL Selkup Corpus 0.1
    Resource type Corpus
    Size 18673 words
    Languages English (eng), German (deu), Russian (rus), Selkup (sel)
    Production status Existing-used
    Resource usage Knowledge Discovery/Representation
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description Selkup is an endangered Southern Samoyedic language (Uralic family). The INEL Selkup corpus is composed of texts from the archive of Angelina Ivanovna Kuzmina (1924-2002), who gathered a large amount of material on Selkup in almost all regions where the Selkup people lived in 1962-1977. Most texts in the corpus originate from the handwritten part of the archive, the others come from sound recordings made by A.I. Kuzmina, transcribed and translated within the INEL project. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, annotation of borrowings. Some texts also have annotations for syntactic structure, semantic roles and information status.
    Download from http://hdl.handle.net/11022/0000-0007-CC85-D
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf
    Edition LREC 2020
  • Name JParaCrawl
    Resource type Corpus
    Size 8763995 sentences
    Languages English (eng), Japanese (jpn)
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License Proprietary License (Research purposes - No commercial Use)
    Conditions of use Freely Available
    Description JParaCrawl is a large freely available English-Japanese parallel corpus created by NTT.
    Download from http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.443.pdf
    Edition LREC 2020
  • Name Kannada Event Annotated Treebank
    Resource type Corpus
    Size 37020 lexemes
    Languages Kannada
    Production status <Not Specified>
    Resource usage For Kannada NLP
    License <Not Specified>
    Conditions of use Freely Available
    Description This is an event annotated Kannada Dependency Treebank. The dataset has been annotated based on the guidelines mentioned in the paper.
    Download from https://drive.google.com/drive/folders/11ZXpP4mQcDcM91SKHiSNEtWi_mAkXku7
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/ISA16/pdf/2020.isa-1.10.pdf
    Edition ISA16 2020
  • Name KORE 50^DYWC
    Resource type Evaluation Data
    Size 50 sentences
    Languages English
    Production status Existing-used
    Resource usage Evaluation/Validation
    License OpenSource
    Conditions of use Freely Available
    Description Entity linking evaluation data set with annotations based on DBpedia, YAGO, Wikidata, and Crunchbase.
    Download from http://people.aifb.kit.edu/mfa/datasets/kore50-lrec2020.zip
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.291.pdf
    Edition LREC 2020
  • Name Korean FrameNet 1.1
    Resource type Corpus
    Size 40,7 MByte
    Languages Korean (kor)
    Production status Newly created-finished
    Resource usage Semantic Role Labeling
    License CC BY-NC-SA 2.0
    Conditions of use Freely Available
    Description Korean FrameNet is a lexical database that has rich annotations to represent the meaning of Korean text using semantic frames.
    Download from https://github.com/machinereading/koreanframenet
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.30.pdf
    Edition LREC 2020
  • Name Korean Movie Review Emotion Dataset
    Resource type Corpus
    Size 11216 KByte
    Languages Korean (kor)
    Production status Newly created-finished
    Resource usage Emotion Recognition/Generation
    License OpenSource
    Conditions of use Freely Available
    Description The Korean Movie Review Emotion (KMRE) dataset was annotated with six types of emotions in the NSMC dataset by following an in-house annotation procedure, which achieves the best performance. In here, the emotion neutral was excluded because there are no neutral reviews in the NSMC. After the construction, the training dataset was splitted into 80% for training and 20% for development separately. Specific emotions (i.e. anger, happiness, and sadness) were found as more frequently annotated in the KMRE dataset. Furthermore, the train, dev, and test of KMRE dataset tend to have similar emotion label distribution. The KMRE dataset has 119,995 sentences of train, 29,999 sentences of development, and 49,997 sentences of test.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/891_res_1.zip [11.48 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.199.pdf
    Edition LREC 2020
  • Name kosentences
    Resource type Corpus
    Size 31427658 sentences
    Languages Korean
    Production status Newly created-finished
    Resource usage Language Modelling
    License MIT Licence, GNU Free Documentation License, CC-BY-NC-SA 4.0
    Conditions of use Freely Available
    Description kosentences is a large-scale web corpus of Korean text. It can be used for any task which requires a large amount of unsupervised and (relatively) well-formed text, such as language modeling.
    Download from https://github.com/cynthia/kosentences/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.429.pdf
    Edition LREC 2020
  • Name LEDGAR
    Resource type Corpus
    Size 104990418 tokens
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License MIT License
    Conditions of use Freely Available
    Description We present LEDGAR, a multilabel corpus of legal provisions in contracts. The corpus was crawled and scraped from the public domain (SEC filings) and is, to the best of our knowledge, the first freely available corpus of its kind. Since the corpus was constructed semi-automatically, we apply and discuss various approaches to noise removal. Due to the rather large labelset of over 12,000 labels annotated in almost 100,000 provisions in over 60,000 contracts, we believe the corpus to be of interest for research in the field of Legal NLP, (large-scale or extreme) text classification, as well as for legal studies.
    Download from https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.155.pdf
    Edition LREC 2020
  • Name LibriVoxDeEN
    Resource type Corpus
    Size 62 GByte
    Languages English, German
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description LibriVoxDeEn is corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that speech alignment quality is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for German speech recognition and for end-to-end German-to-English speech translation.
    Download from https://www.cl.uni-heidelberg.de/statnlpgroup/librivoxdeen/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.441.pdf
    Edition LREC 2020
  • Name List of English-French Cognates
    Resource type Lexicon
    Size 492 entries
    Languages English (eng), French (fra)
    Production status Newly created-finished
    Resource usage Word Sense Disambiguation
    License CC-BY 4.0
    Conditions of use Freely Available
    Description This is a new database for computational approaches of synchronized diachronic investigation of language pairs, and subsequent novel findings stemming from the cognates temporal comparison of the two chosen languages. The present study is the first in the literature to use computational approaches and large data to make a cross-language temporal analysis. Taken and computed from google n-grams, a cognate database was constructed with 492 English-French word pairs composed of nouns, adjectives and adverbs. Each pair has between one and four forms in English, and up to ten in French. In English, most words have only one form for adjectives and adverbs, while most nouns have two forms (singular and plural). In French, with masculine and feminine, singular and plural, most nouns and adjectives can be found in four different surface forms.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/584_res_1.txt [76 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.107.pdf
    Edition LREC 2020
  • Name LK Hadith Corpus
    Resource type Language Resources/Technologies Infrastructure
    Size 10266304 tokens
    Languages Arabic, English
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License OpenSource
    Conditions of use Freely Available, Attribution Required
    Description Well-structured Arabic-English Parallel Corpus of Islamic Hadith with its components (Isnad and Matn) segmented.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/980_res_1.zip [14.54 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.415.pdf
    Edition LREC 2020
  • Name Machine Reading Comprehension Dataset
    Resource type Corpus
    Size NULL NULL
    Languages English
    Production status Newly created-finished
    Resource usage Question Answering
    License <Not Specified>
    Conditions of use From Owner
    Description The dataset is for building Machine Reading Comprehension model on Scholarly Articles.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1035_res_1.zip [408 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.675.pdf
    Edition LREC 2020
  • Name MAGPIE Corpus
    Resource type Corpus
    Size 56622 entries
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Machine Learning
    License CC-BY 4.0
    Conditions of use Freely Available
    Description The MAGPIE Corpus is a large sense-annotated corpus of potentially idiomatic expressions (PIEs), based on the British National Corpus (BNC). Potentially idiomatic expressions are like idiomatic expressions, but the term also covers literal uses of idiomatic expressions, such as 'I leave work at the end of the day.' for the idiom 'at the end of the day'. The corpus contains 56,622 instances, covering 1,756 different idiom types, all of which have crowdsourced meaning labels.
    Download from https://github.com/hslh/magpie-corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.35.pdf
    Edition LREC 2020
  • Name Mandarinograd
    Resource type Corpus
    Size 154 entries
    Languages Mandarin Chinese (cmn)
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available
    Description Mandarinograd is a collection of 154 Winograd Schemas in Mandarin Chinese, mainly translated and adapted from English. Winograd Schemas represent particularly hard problems in natural language understanding, related to anaphora resolution. They are designed to involve common sense reasoning and to limit the biases and artefacts commonly found in natural language understanding datasets.
    Download from https://gitlab.com/vanTot/mandarinograd/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.3.pdf
    Edition LREC 2020
  • Name Mapudungun Corpus
    Resource type Corpus
    Size 142 hours
    Languages Mapudungun (arn), Spanish (spa)
    Production status Existing-updated
    Resource usage Speech Synthesis, Speech Recognition and Machine Translation
    License CC-BY-NC-SA 3.0
    Conditions of use Freely Available
    Description 142 hours of conversations in Mapudungun, fully transcribed and translated in Spanish. Training, development, and test splits are provided for speech synthesis, speech recognition, and machine translation.
    Download from https://github.com/mingjund/mapudungun-corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.350.pdf
    Edition LREC 2020
  • Name MASRI-HEADSET Corpus
    Resource type Corpus
    Size 8 hours
    Languages Maltese (mlt)
    Production status Newly created-finished
    Resource usage Speech Recognition/Understanding
    License <Not Specified>
    Conditions of use From Owner. The corpus is free for Research/Academic purposes. This corpus belongs to the University of Malta. All rights reserved.
    Description The MASRI-HEADSET CORPUS (MHC) is suitable for speech recognition and it has the following characteristics: - The MHC has an exact duration of 8 hours and 6 minutes. It has 3864 audio \n files. - The MHC has recordings from 25 different speakers: 13 women and 12 men. - Data in MHC is classified by speaker. It means, all the recordings of one single speaker are stored in one single directory. - Data is also classified according to the gender (male/female) of the \n speakers. - Audio files in the MHC are distributed in a 16khz@16bit mono format.\n\n- Every audio file has an ID that is compatible with ASR engines such as \n Kaldi and CMU-Sphinx. - Transcriptions in MHC are lowercase. No punctuation marks are permitted except dashes (-) and apostrophes (') because they belong to the Maltese \n orthography.
    Download from https://www.um.edu.mt/projects/masri/index.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.784.pdf
    Edition LREC 2020
  • Name MEL dataset
    Resource type Corpus
    Size 51 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License CC BY-NC-SA 3.0
    Conditions of use Freely Available
    Description This is an evaluation corpus for multimedia entity linking as described in (Adjali et al., 2020). It is made of tweet identifiers separated into 4 files: - kb: knowledge base containing 2657213 ids - mel_dev_ids: development dataset for evaluation (16599 ids) - mel_test_ids: test dataset for evaluation (36521 ids) - mel_train_ids: train dataset for evaluation (35976 ids) More information at https://github.com/OA256864/MEL_Tweets
    Download from https://drive.google.com/open?id=1kkRpVJpo-U6Gt_r4Ly-ciq4pAY03CoTg
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.528.pdf
    Edition LREC 2020
  • Name Minho Word Pool
    Resource type Lexicon
    Size 700 KByte
    Languages Portuguese
    Production status Existing-used
    Resource usage Acquisition
    License <Not Specified>
    Conditions of use Freely Available
    Description Minho Word Pool (MWP) is a dataset that provides normative values of imageability, concreteness and subjective frequency for 3,800 (European) Portuguese words, three subjective measures, which in spite of being extensively used in research, were still scarce for Portuguese. Data were collected with 2,357 college students who were native speakers of European Portuguese. Participants rated 100 words drawn randomly from the full set in each of the three subjective indices using a web survey procedure (via a URL link). Analyses comparing the MWP ratings with those obtained for the same words from other national and international databases showed that the MWP norms are reliable and valid, thus providing researchers with a useful tool to support research in all neuroscientific areas using verbal stimuli.
    Download from http://p-pal.di.uminho.pt/about/databases
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf
    Edition LREC 2020
  • Name MNLI DEV partition matched delexicalized
    Resource type Corpus
    Size 12 NULL
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_4.txt [12.92 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
    Edition LREC 2020
  • Name MNLI DEV partition mis matched delexicalized
    Resource type Corpus
    Size 2.2 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_7.txt [2.33 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
    Edition LREC 2020
  • Name MNLI training delexicalized
    Resource type Corpus
    Size 492 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description This is the masked version of the datasets FEVER,FNC,MEDNLI, MNLI, , delexicalized with the OA-NER technique presented in the paper referenced herewith.
    Download from https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
    Edition LREC 2020
  • Name Monsanto Trial Document
    Resource type Treebank
    Size 15000 sentences
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Sensitive Information Detection
    License CC-BY 4.0
    Conditions of use Freely Available
    Description Constituent parse trees over labeled sentences. Corpus is publicized documents from the Monsanto trial.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/219_res_1.zip [4.54 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.158.pdf
    Edition LREC 2020
  • Name Movie MPAA Information Corpus
    Resource type Corpus
    Size 5562 entries
    Languages English (eng)
    Production status Existing-updated
    Resource usage Machine Learning
    License <Not Specified>
    Conditions of use From Owner
    Description The data set of this work is an expansion of the existing movie script dataset (http://ritual.uh.edu/1493-2/) to include the MPAA rating for the movies. Other important features of our dataset are: (1) 50 more movies compared to the original dataset (2) up to 12 most similar movies for each movie in the dataset along with the MPAA rating of those similar movies (3) users' vote for MPAA components (violence, profanity, sex, drug/alcohol, frightening) (4) reason of the MPAA rating published by the MPAA organization (5) poster images for movies. PS: The whole dataset is a 5G file. So, we uploaded the meta-information (similar movies, metadata file, reasons, MPAA component ) directly to this page, and we provided the whole dataset (meta-information + posters + script) through a link.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/548_res_1.zip [1.27 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.166.pdf
    Edition LREC 2020
  • Name Multi sentenced Algebraic Mathematical word problems in English
    Resource type Corpus
    Size NULL sentences
    Languages English
    Production status Newly created-finished
    Resource usage Machine Learning
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_2.txt [234 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf
    Edition LREC 2020
  • Name Multi sentenced simple Mathematical word problems in English
    Resource type Corpus
    Size NULL sentences
    Languages English
    Production status Newly created-finished
    Resource usage Machine Learning
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_1.txt [140 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf
    Edition LREC 2020
  • Name Multi sentenced simple Mathematical word problems in Sinhala
    Resource type Corpus
    Size NULL sentences
    Languages Sinhala
    Production status Newly created-finished
    Resource usage Machine Learning
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_3.txt [250 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf
    Edition LREC 2020
  • Name Multi sentenced simple Mathematical word problems in Tamil
    Resource type Corpus
    Size NULL sentences
    Languages Tamil
    Production status Newly created-finished
    Resource usage Machine Learning
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_4.txt [234 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf
    Edition LREC 2020
  • Name Multi-Party Dialogue Dataset
    Resource type Corpus
    Size 6.7 MByte
    Languages Chinese
    Production status Newly created-finished
    Resource usage Dialogue
    License <Not Specified>
    Conditions of use Freely Available
    Description The Multi-Party Dialogue Dataset (MPDD) consists of two files, dialogue.json and metadata.json. The file dialogue.json contains the dialogues. Each dialogue has a unique case index value in the json file, and is a list composed of the utterances in speaking order. Every utterance in the list contains the speaker, content, and annotated labels shown in data format. The list of the listener in the utterance contains all listeners in this utterance with their relation type. The data format of dialogue.json is shown as follows. {case index:[{"speaker": speaker's name,"utterance": utterance, "listener": [{"name": listener's name,"relation": relation type between speaker and listener}, ...],"emotion": speaker's emotion type}, ...]} The metadata is given in metadata.json. The file defines all the emotion, relation types, and the sub-classes in the two perspectives, position, and field. The data format of metadata.json is shown as follows. {"relation":["parent", ...],"field":{"family":["parent", ...],"school":["teacher", ...],"company":["boss", ...],"others":["couple", ...]},"position":{"superior":["parent",...],"peer":["spouse", ...],"inferior":["child",...]},"emotion":["fear",...]}
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/641_res_1.zip [953 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.76.pdf
    Edition LREC 2020
  • Name Multi-Sense Dataset (MSD-1030)
    Resource type Evaluation Data
    Size 1030 word pairs
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available
    Description MSD-1030 is an English semantic similarity dataset designed to facilitate more reliable evaluations of sense embeddings. Unlike previous word similarity datasets, most words in MSD-1030 are multi-sense words, which requires sense embedding models to capture multiple meanings instead of only the most dominant sense. In addition, because of the carefully designed word pair selection process, the distribution of the similarity scores are balanced over the entire range. MSD-1030 also does not contain phrases, which are usually skipped in evaluations of word and sense embedding models. MSD-1030 is more suitable for evaluating multi-sense representations and may shed light on room for improvement in related research.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/567_res_1.zip [19 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.711.pdf
    Edition LREC 2020
  • Name multiCEFRLex
    Resource type Lexicon
    Size 41425 entries
    Languages English (eng), French (fra), Swedish (swe)
    Production status Newly created-finished
    Resource usage Language Learning/Grading
    License <Not Specified>
    Conditions of use Freely Available
    Description Lexical resources for foreign language (L2) learning, teaching and research graded following the Common European Framework of Reference for Languages (CEFR).
    Download from http://pub.cl.uzh.ch/purl/multiCEFRLex
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.43.pdf
    Edition LREC 2020
  • Name Multilingual Hate Speech Dataset for Fairness Evaluation
    Resource type Corpus
    Size 7.4 MByte
    Languages English (eng), Italian (ita), Polish (pol), Portuguese (por), Spanish (spa)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License Apache 2.0
    Conditions of use Freely Available
    Description First corpus for fairness evaluation of document classifiers in the field of hate speech detection.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1174_res_1.zip [7.39 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.180.pdf
    Edition LREC 2020
  • Name Multilingual Task-Oriented Dialog Data-Basque
    Resource type Corpus
    Size 6404 sentences
    Languages Basque (eus)
    Production status Newly created-finished
    Resource usage Dialogue
    License CC-BY-NC-SA 4.0
    Conditions of use Freely Available
    Description The resource contains manually and automatically generated and annotated utterances for Basque from Multilingual Task-Oriented Dialog Dataset. 34,18 utterances were manually translated and annotated, and 2,986 automatically translated by a neuralMT system from Spanish ones and annotated by projecting the slot tags using a word aligner.
    Download from https://hizkuntzateknologiak.elhuyar.eus/assets/files/fmtodelh.tgz
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.340.pdf
    Edition LREC 2020
  • Name MultiMWE corpora
    Resource type Corpus
    Size 3302268 entries
    Languages Chinese, English, German
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use Freely Available. Free for Research Only
    Description The size of the bilingual MWE pairs is 3,159,226 and 143,042 for German-English and Chinese-English respectively after filtering.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/78_res_1.zip [14.95 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.363.pdf
    Edition LREC 2020
  • Name Multiword Expressions dataset with type and complexity annotation
    Resource type Evaluation Data
    Size 1.3 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License CC-BY-NC-SA 4.0
    Conditions of use Freely Available
    Description We release a re-annotated Complex Word Identification Shared Task 2018 dataset originally provided by Yimam et al. (2017). The multiword expressions (MWE) in this data is annotated with MWE type, which helps drawing conclusions about the complexity of each MWE type for native and non-native readers. This work demonstrates that incorporating the type of MWE based on this dataset into a lexical complexity assessment system improves its performance. Together with the original complexity labels, this dataset represents a valuable resource for the text simplification community.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1446_res_1.tsv [1.30 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.545.pdf
    Edition LREC 2020
  • Name MyFixit Dataset
    Resource type Corpus
    Size 313 MByte
    Languages English
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License CC-BY-NC-SA 3.0
    Conditions of use Freely Available
    Description MyFixit is a collection of repair manuals, collected from the iFixit website. There are in total 31,601 repair manuals in 15 device categories. Each step in the manuals of the 'Mac Laptop' category is annotated with the required tool, disassembled parts, and the removal verbs (1,497 manuals with 36,659 steps).
    Download from https://github.com/rub-ksv/MyFixit-Dataset
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.260.pdf
    Edition LREC 2020
  • Name Norm It
    Resource type Corpus
    Size 12822 words
    Languages Italian
    Production status Existing-updated
    Resource usage Lexical Normalization
    License <Not Specified>
    Conditions of use Freely Available
    Description <Not Specified>
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1079_res_1.tgz [69 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.769.pdf
    Edition LREC 2020
  • Name North Sámi active learning morphological segmentation annotations
    Resource type Evaluation Data
    Size 796 types
    Languages Northern Sami (sme)
    Production status Existing-used
    Resource usage Morphological Analysis
    License CC-BY 4.0
    Conditions of use Freely Available
    Description Morphological segmentation data set, for unsupervised and weakly supervised learning.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/872_res_4.tgz [45 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.486.pdf
    Edition LREC 2020
  • Name NTCIR14-QALab-PoliInfo-FormalRunDataset
    Resource type Evaluation Data
    Size 45 MByte
    Languages Japanese (jpn)
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License <Not Specified>
    Conditions of use Freely Available
    Description NTCIR14-QALab-PoliInfo-FormalRunDataset.
    Download from https://github.com/kmr-y/NTCIR14-QALab-PoliInfo-FormalRunDataset
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.253.pdf
    Edition LREC 2020
  • Name NZZ Black Letter Ground Truth
    Resource type Corpus
    Size 304286 words
    Languages German
    Production status Newly created-finished
    Resource usage Machine Learning
    License CC-BY-NC 4.0
    Conditions of use Freely Available
    Description The Neue Zürcher Zeitung (NZZ) has been publishing in black letter from its very first issue in 1780 until 1947. From this time period, one frontpage per year was randomly sampled, resulting in a total of 167 pages. Frontpages were chosen because they typically contain highly relevant material and to make sure not to sample pages containing exclusively advertisements or stock information. During certain periods, the NZZ was published several times a day, and there were supplements, too. Due to incomplete metadata, the sampling included frontpages from supplements.
    Download from https://github.com/impresso/NZZ-black-letter-ground-truth
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.436.pdf
    Edition LREC 2020
  • Name Odia Sentiment Dataset
    Resource type Corpus
    Size 582 KByte
    Languages Odia
    Production status Newly created-finished
    Resource usage Machine Learning
    License <Not Specified>
    Conditions of use From Owner
    Description This dataset consists of 2045 sentences in Odia annotated with sentiments labelled as +1(positive), 0(neutral), -1(negative). This sentences were picked from news domain, with source being "Samaja" Archives.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/789_res_1.txt [582 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.339.pdf
    Edition LREC 2020
  • Name OLID
    Resource type Corpus
    Size 14100 entries
    Languages English
    Production status Existing-used
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available
    Description OLID contains a collection of annotated tweets using an annotation model that encompasses the following three levels: A: Offensive Language Detection B: Categorization of Offensive Language C: Offensive Language Target Identification OLID was used in the OffensEval: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 - Task 6) shared task.
    Download from https://competitions.codalab.org/competitions/20011
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf
    Edition LREC 2020
  • Name OpenWordNet.PT
    Resource type Lexicon
    Size 39000 synsets
    Languages Portuguese
    Production status Existing-used
    Resource usage Acquisition
    License CC-BY 4.0
    Conditions of use Freely Available
    Description The OpenWordnet-PT (abbreviated as OpenWN-PT or simply OWN-PT) is an open access wordnet for Portuguese.
    Download from http://openwordnet-pt.org/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf
    Edition LREC 2020
  • Name Optimized Twitter Search Phrases
    Resource type Corpus
    Size 20000 phrases
    Languages <Not Specified>
    Production status Finished
    Resource usage <Not Specified>
    License <Not Specified>
    Conditions of use <Not Specified>
    Description This resource contains optimized search phrases for the 50 most prevalent languages on Twitter. Each list consists of key phrases that identify the most target language tweets while limiting the number of other-language tweets retrieved. Use a list in combination with the Twitter Streaming API to obtain the language-specific stream of tweets. When collecting Twitter data in this way, please refer to the accompanying paper to know the limits of your collection.
    Download from https://www.clips.uantwerpen.be/twitter/phrases.zip
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/WAC-II/pdf/2020.wac-1.8.pdf
    Edition WACXII 2020
  • Name PAN-SemEval-Hyperpartisan-News-Detection-19
    Resource type Evaluation Data
    Size 1,000,000 documents
    Languages English
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License CC-BY 4.0
    Conditions of use From Owner
    Description <Not Specified>
    Download from https://pan.webis.de/semeval19/semeval19-web/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.159.pdf
    Edition LREC 2020
  • Name Parallel Sentiment
    Resource type Corpus
    Size 300 KByte
    Languages ar,cs,de,en,es,fi,fr,hi,id,it,ja,ko,pl,pt,ru,sv,th,tr,zh
    Production status Newly created-finished
    Resource usage Emotion Recognition/Generation
    License <Not Specified>
    Conditions of use Freely Available
    Description Sentence-level sentiment annotations (positive or negative) for 19 languages semi-automatically generated from PUD (parallel corpora in Universal Dependencies).
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/288_res_1.zip [13 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.500.pdf
    Edition LREC 2020
  • Name ParaPat
    Resource type Corpus
    Size 68000000 sentences
    Languages Chinese (zho), English (eng), French (fra), Japanese (jpn), Russian (rus), Korean, German,Spanish
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License GNU-GPL v3.0
    Conditions of use Freely Available
    Description ParaPat is a corpus of parallel sentences from patents extracted from Google Patents.
    Download from https://github.com/soares-f/parapat
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.465.pdf
    Edition LREC 2020
  • Name PO-EMO
    Resource type Corpus
    Size 4191 lines
    Languages English, German
    Production status Newly created-finished
    Resource usage Emotion Recognition/Generation
    License <Not Specified>
    Conditions of use Freely Available
    Description PO-EMO is a corpus of German and English Poetry, with fine-grained multi-label annotation of aesthetic emotions.
    Download from https://github.com/tnhaider/poetry-emotion
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.205.pdf
    Edition LREC 2020
  • Name Prague Czech English Dependency Treebank 2.0
    Resource type Treebank
    Size 50000 sentences
    Languages Czech, English
    Production status Existing-updated
    Resource usage <Not Specified>
    License CC BY-NC-SA 3.0
    Conditions of use Freely Available
    Description The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.
    Download from http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf
    Edition LREC 2020
  • Name Prague Dependency Treebank - Consolidated 1.0
    Resource type Treebank
    Size 4000000 tokens
    Languages Czech
    Production status Newly created-finished
    Resource usage <Not Specified>
    License <Not Specified>
    Conditions of use Restricted Access
    Description The Prague Dependency Treebank - Consolidated (PDT-C) is a consolidated release of the existing PDT-corpora of Czech data with manual annotation at all three PDT-annotation layers (morphological, surface syntax and deep syntax layer).
    Download from http://hdl.handle.net/11234/1-3185
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf
    Edition LREC 2020
  • Name Prague Dependency Treebank 3.5
    Resource type Treebank
    Size 50000 sentences
    Languages Czech
    Production status Existing-updated
    Resource usage <Not Specified>
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (over 800 thousand nodes) on all layers, from tectogrammatical to words, and additional sentences on the analytical (surface dependency syntax) and morphological layers of annotation (approx. 2 million words in total). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations.
    Download from http://hdl.handle.net/11234/1-2621
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf
    Edition LREC 2020
  • Name PT Lexical-Semantic Relations
    Resource type Lexicon
    Size NULL NULL
    Languages Portuguese
    Production status Existing-used
    Resource usage Acquisition
    License <Not Specified>
    Conditions of use Freely Available
    Description Lexical-Semantic relations acquired from ten computational lexical resources in Portuguese (PAPEL, Dicionário Aberto, Wiktionary.PT, TeP, OpenThesaurus.PT, OpenWordnet-PT, PULO, WordNet.Br, Port4Nooj, ConceptNet) together with the number of resources where they were extracted from.
    Download from http://ontopt.dei.uc.pt/index.php?sec=download_outros
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf
    Edition LREC 2020
  • Name pwnalign
    Resource type Evaluation Data
    Size 1M entries
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available
    Description A set of 3 goldstandards for the evaluation of alignments between WordNet and Wikipedia.
    Download from https://sourceforge.net/projects/pwnalign/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.597.pdf
    Edition LREC 2020
  • Name Read Speech Corpus (RSC)
    Resource type Corpus
    Size 8.3 GByte
    Languages Romanian (ron)
    Production status Newly created-finished
    Resource usage Speech Recognition/Understanding
    License CC-BY-NC-ND 3.0
    Conditions of use From Owner
    Description Read Speech Corpus (RSC) is the largest publicly available Romanian speech corpora for training and evaluating Automatic Speech Recognition (ASR) systems, comprising 100 hours from 164 speakers.
    Download from https://speed.pub.ro/downloads/speech-datasets/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.814.pdf
    Edition LREC 2020
  • Name Renmin-OCR-NER
    Resource type Evaluation Data
    Size 438249 tokens
    Languages Chinese (cmn)
    Production status Newly created-finished
    Resource usage Named Entity Recognition
    License <Not Specified>
    Conditions of use Freely Available
    Description Newspaper images transcribed with text locations and transcriptions for OCR evaluation and annotated for named entities.
    Download from https://github.com/hltcoe/cmn-renmin-ocr-ner-dataset
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.570.pdf
    Edition LREC 2020
  • Name Resources for Academic Writing
    Resource type Corpus
    Size 8.6 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License <Not Specified>
    Conditions of use Freely Available
    Description The dataset contains keyphrases and n-grams extracted from reference corpus, training, and test dataset fro Informal word identification task, and training and test for academic paraphrasing task. A README file is included explaining the details.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1020_res_1.zip [8.59 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.722.pdf
    Edition LREC 2020
  • Name Rigor-Mortis-LREC2020
    Resource type Corpus
    Size 41820 tokens
    Languages French (fra)
    Production status Newly created-finished
    Resource usage Multiword Expressions
    License CC-BY-NC 4.0
    Conditions of use Freely Available
    Description The resource is an export of the Multiword expressions annotated by players the the Rigor-Mortis Game. See https://github.com/gwaps4nlp/rigor-mortis/blob/master/LREC_2020/LREC2020_RM.pdf
    Download from https://github.com/gwaps4nlp/rigor-mortis/tree/master/LREC_2020
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.541.pdf
    Edition LREC 2020
  • Name Romanian Named Entity Corpus
    Resource type Corpus
    Size 26377 entities
    Languages Romanian
    Production status Newly created-finished
    Resource usage Named Entity Recognition
    License MIT License
    Conditions of use Freely Available
    Description RONEC - the Romanian Named Entity Corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition. It is available in BRAT and CoNLL-U Plus formats, and it is free to use and extend at github.com/dumitrescustefan/ronec.
    Download from https://github.com/dumitrescustefan/ronec
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.546.pdf
    Edition LREC 2020
  • Name RusLTC
    Resource type Corpus
    Size 2.3 million tokens
    Languages English (eng), Russian (rus)
    Production status Existing-updated
    Resource usage Document Classification, Text categorisation
    License CC-BY-SA 3.0
    Conditions of use Freely Available
    Description Russian Learner Translator Corpus (RusLTC) is a bi-directional parallel corpus, which stores English-Russian translations produced by university students majoring in translation studies in Russia (Russian L1). The English-Russian subcorpus has multiple translations to 402 sources, while Russian-English part includes 125 source-targets sets. The number of translations to a single source averages at 8 (min=1, max=67). Many texts are translations of excerpts from larger sources; the median lengths of translations in the English-to-Russian subcorpus is 380 words. RusLTC is downloadable as a customized TMX-file and a plaintext archive (https://www.rus-ltc.org/static/html/about.html). The TMX contains the sentence-aligned version of the corpus with the automatic alignment errors corrected manually. It also carries 10 metadata fields, including the translator's gender and affiliation, education type and level, grade for the translation, year and conditions of translation (routine/exam; home/classroom) and source text genre. RusLTC includes texts in 10 genres: academic, informational, educational, essay, technical, fiction, educational, speech, letters, advertisement texts in 10 genres: academic, informational, educational, essay, technical, fiction, educational, speech, letters, advertisement. Detailed and automatically updated statistics is available from https://dev.rus-ltc.org/statistics/.
    Download from https://www.rus-ltc.org/static/html/about.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.505.pdf
    Edition LREC 2020
  • Name Scholarly Documents Deduplication Dataset
    Resource type Corpus
    Size 204 MByte
    Languages English
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License <Not Specified>
    Conditions of use Freely Available
    Description A labelled dataset of 100K scholarly documents categorised into non-duplicates, near-duplicates and exact duplicates groups. 50K examples are non-duplicates; 25K examples belong to the exact-duplicate groups and the remaining 25K example belong to the near-duplicate groups.
    Download from https://core.ac.uk/documentation/dataset/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.113.pdf
    Edition LREC 2020
  • Name Seed corpus for entity coreference in email conversations
    Resource type Corpus
    Size 866 coreference chains
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Anaphora, Coreference
    License Apache 2.0
    Conditions of use Freely Available
    Description Corpus for submission entitled: A study on Entity Resolution in Email Conversations. The seed corpus contains 46 email threads comprising of 245 email messages. The annotations are provided in CoNLL format.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/899_res_1.tar.xz [83 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.8.pdf
    Edition LREC 2020
  • Name Semantic verb classes
    Resource type Evaluation Data
    Size 17 entries
    Languages English
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License CC BY 4.0
    Conditions of use Freely Available
    Description 17 semantic verb classes collected for English via a semantic clustering task, where 825 verbs are placed one by one into circles visible on the computer screen by an annotator, each circle corresponding to a grouping of similar and related verbs.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/444_res_2.txt [6 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.705.pdf
    Edition LREC 2020
  • Name SemEU-R CMV Corpus
    Resource type Corpus
    Size NULL NULL
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use From Owner
    Description <Not Specified>
    Download from http://katfuji.lab.tuat.ac.jp/nlp_datasets/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.143.pdf
    Edition LREC 2020
  • Name SentiEcon
    Resource type Lexicon
    Size 6470 entries
    Languages English
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License CC BY-NC 3.0
    Conditions of use Freely Available for Academic Purposes
    Description Domain-specific computational lexicon designed for sentiment analysis applications, we compiled from our own corpus of online business news. It was created as a plug-in lexicon for the sentiment analysis tool Lingmotif, and thus it follows its data structure requirements and presupposes the availability of a general-language core sentiment lexicon that covers non-specific sentiment-carrying terms and phrases. It contains 6,470 entries, both single and multi-word expressions, each with tags denoting their semantic orientation and intensity.
    Download from http://tecnolengua.uma.es/sentiecon
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.623.pdf
    Edition LREC 2020
  • Name Serial Speakers: a Dataset of TV Series
    Resource type Corpus
    Size 45 MByte
    Languages English
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License CC-BY 4.0
    Conditions of use Freely Available
    Description "Serial Speakers" is a dataset of 161 annotated episodes from three popular American TV serials, "Breaking Bad" (62 annotated episodes), "Game of Thrones" (73), and "House of Cards". "Serial Speakers" is suitable for addressing both high level multimedia retrieval tasks in real world scenarios, and lower level speech processing tasks in challenging conditions. The boundaries, speaker and textual content of every speech turn, along with all scene boundaries, have been manually annotated for the whole set of episodes; the shot boundaries and the recurring shots for the first season of each of the three series; and the interacting speakers for a subset of 10 episodes. Because of copyright restrictions, the textual content is encrypted in the public version of the dataset, but we provide the users with a simple tool for recovering the text from their own subtitle files (https://github.com/bostxavier/Serial-Speakers).
    Download from https://figshare.com/articles/TV_Series_Corpus/3471839
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.525.pdf
    Edition LREC 2020
  • Name SpA-Verb
    Resource type Evaluation Data
    Size 29721 entries
    Languages English
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License CC BY 4.0
    Conditions of use Freely Available
    Description An English intrinsic evaluation verb similarity dataset containing similarity scores for 29,721 unique verb pairs and 825 target verbs, obtained via a two-phase data collection design: a rough semantic clustering phase, where verbs are placed in broad semantic classes, and a spatial multi-arrangement phase, where fine-grained spatial similarity judgments are elicited on related verbs, clustered together in Phase 1.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/444_res_1.txt [746 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.705.pdf
    Edition LREC 2020
  • Name SQuAD-es v1.1
    Resource type Corpus
    Size 87595 (context, question, answer) tuples
    Languages Spanish (spa)
    Production status Newly created-finished
    Resource usage Question Answering
    License MIT License
    Conditions of use Freely Available
    Description Automatic Spanish translation of the training set of the Stanford Question Answering Dataset (SQuAD) v1.1
    Download from https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.677.pdf
    Edition LREC 2020
  • Name Statement Classification Dataset, arXMLiv 08.2018
    Resource type Corpus
    Size 10,555,689 paragraphs
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License SIGMathLing Non-Disclosure-Agreement (research-only use)
    Conditions of use Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself. Membership is free and granted on the honor system.
    Description This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph, immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations (e.g. Definition, Proof, Remark). A control dataset of the same statements is also included with all mathematical symbolism omitted (nomath), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. It may be considered as a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance. Community feedback is welcome on all of: data quality, representation issues, as well as organization and archival best practices. Jointly release new versions of this data are planned together with new releases of the arXMLiv corpus.
    Download from https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.153.pdf
    Edition LREC 2020
  • Name STEM-ECR
    Resource type Corpus
    Size 3.3 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License CC-BY-SA 3.0
    Conditions of use Freely Available
    Description The STEM ECR v1.0 dataset introduces the task of Scientific Entity Extraction, Classification, and Resolution on scholarly publications in STEM (Science, Technology, Engineering, and Medicine) disciplines. It comprises annotated scholarly abstracts from 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. The annotated data includes: phrase-based scientific entities, and their corresponding disambiguated references in Wikipedia and Wiktionary as applicable. The purpose of the dataset is to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1130_res_1.zip [4.93 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.268.pdf
    Edition LREC 2020
  • Name SVALex
    Resource type Lexicon
    Size 15619 entries
    Languages Swedish (swe)
    Production status Existing-used
    Resource usage Language Learning/Grading
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description Lexical resources for foreign language (L2) learning, teaching and research graded following the Common European Framework of Reference for Languages (CEFR)
    Download from http://pub.cl.uzh.ch/purl/multiCEFRLex
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.43.pdf
    Edition LREC 2020
  • Name TableBank
    Resource type Evaluation Data
    Size 26,6 GByte
    Languages <Not Specified>
    Production status Newly created-finished
    Resource usage Information Extraction, Information Retrieval
    License CC-BY-NC-ND 4.0
    Conditions of use From Owner
    Description TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.
    Download from https://github.com/doc-analysis/TableBank
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.236.pdf
    Edition LREC 2020
  • Name TamilMemes
    Resource type Corpus
    Size NULL NULL
    Languages English, Tamil, Tamil-English code mixed
    Production status <Not Specified>
    Resource usage <Not Specified>
    License <Not Specified>
    Conditions of use <Not Specified>
    Description TamilMemes folder contains troll and non-troll images in the directory with the same name.
    Download from https://github.com/sharduls007/TamilMemes
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/WILDRE-5/pdf/2020.wildre-1.2.pdf
    Edition WILDRE-5 2020
  • Name TaPaCo
    Resource type Corpus
    Size 2789631 sentences
    Languages English, Esperanto, Italian, Russian, Turkish, French, German, Portuguese, Spanish + 88 other languages
    Production status Newly created-finished
    Resource usage Textual Entailment and Paraphrasing
    License CC-BY 2.0
    Conditions of use Freely Available
    Description a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database.
    Download from https://doi.org/10.5281/zenodo.3707949
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.848.pdf
    Edition LREC 2020
  • Name Taxonomic Word Embeddings - Trained on English WordNet Random Walk Pseudo-Corpora
    Resource type Corpus
    Size 71 models
    Languages English
    Production status Newly created-finished
    Resource usage Language Modelling
    License CC BY-NC-SA 4.0
    Conditions of use Freely Available
    Description This archive contains a collection of computational models called word embeddings. These are vectors that contain numerical representations of words. They have been trained on pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy, and thus reflect taxonomic knowledge about words (rather than contextual). A separate embedding model was trained for 72 random walk corpora that were generated, and thus 72 different embedding models are made available (4.4GB). For training an off-the-shelf implementation of pytorch was used and no major parameters were changed, essentially using it 'as is'. Each model has been trained for 30 epochs. As the corpus files differed with regards to the parameters used in their generation, these are also reflected in the models. The parameters are: - size: number of sentences/lines in the training corpus - direction: the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both) - minimal sentence length: the shortest length sentence (in number of words)
    Download from https://arrow.dit.ie/datas/12/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.602.pdf
    Edition LREC 2020
  • Name The Alice Datasets: fMRI & EEG Observations of Natural Language Comprehension
    Resource type Corpus
    Size 75 participants
    Languages English
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License CC0
    Conditions of use Freely Available
    Description The Alice Datasets are a set of datasets based on magnetic resonance data and electrophysiological data, collected while participants heard a story in English. Along with the datasets and the text of the story, there exists a variety of different linguistic and computational measures ranging from prosodic predictors to predictors capturing hierarchical syntactic information included in the supplementary materials. These ecologically valid datasets can be easily reused to replicate prior work and to test new hypotheses about natural language comprehension in the brain.
    Download from https://openneuro.org/datasets/ds002322/versions/1.0.3
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.15.pdf
    Edition LREC 2020
  • Name Time-Aware Word Embeddings for Three Lebanese News Archives
    Resource type Corpus
    Size 3.7 GByte
    Languages Arabic
    Production status Newly created-finished
    Resource usage Text Mining
    License CC-BY 4.0
    Conditions of use Freely Available
    Description The resource consists of various Arabic word embeddings trained using three large Lebanese news archives, namely Assafir, Annahar and Alhayat. The archives consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. To train the word embeddings, Google's Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt.
    Download from https://zenodo.org/record/3538880
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.580.pdf
    Edition LREC 2020
  • Name TOROT
    Resource type Treebank
    Size 1155577 words
    Languages Church Slavic (chu), Old East Slavonic (orv), Russian (rus)
    Production status Existing-updated
    Resource usage Corpus Creation/Annotation
    License CC-BY-NC-SA 3.0
    Conditions of use Freely Available
    Description Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) spans from the earliest Old Church Slavonic to modern Russian texts, covering more than a thousand years of continuous language history. It is a member of the larger PROIEL family of treebanks of ancient languages.
    Download from http://torottreebank.github.io/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.646.pdf
    Edition LREC 2020
  • Name TRANSLIT
    Resource type Corpus
    Size 1.6M entries
    Languages <Not Specified>
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License CC0
    Conditions of use Freely Available
    Description A large-scale name transliteration corpus. It is a merging of large transliteration resources such as JRC, Geonames, amanzon transliteration and wiki-lang-all (new) names. The resources were unified and the name variants were merged.
    Download from https://github.com/fbenites/TRANSLIT
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.399.pdf
    Edition LREC 2020
  • Name Turku NER corpus
    Resource type Corpus
    Size 202116 tokens
    Languages Finnish (fin)
    Production status Newly created-finished
    Resource usage Named Entity Recognition
    License CC-BY-SA 4.0
    Conditions of use Freely Available
    Description The Turku NER corpus is a manually annotated broad-coverage corpus for Finnish NER, covering over 10,000 annotations for person, organization, location, product and event mentions as well as dates in 754 documents (approx. 200,000 tokens) representing ten different genres of text. The corpus builds on and extends the existing annotations of the Turku Dependency Treebank corpus included in the Universal Dependencies repository.
    Download from https://turkunlp.org/turku-ner-corpus
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.567.pdf
    Edition LREC 2020
  • Name TutorialVQA
    Resource type Corpus
    Size 4 GByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Question Answering
    License CC-BY-NC 4.0
    Conditions of use Freely Available
    Description The VideoQA corpus contains questions and answers pertaining to Photoshop helpx tutorial videos. The questions were collected on Amazon Mechanical Turk (AMT) and pertain to helpx tutorial video segments. These video segments were manually annotated. In total there are approximately 6,000 questions collected, pertaining to 408 total video segments, and relating to 76 videos. Turkers were shown the video segments and asked to input questions which the video segment can answer. To expand the question set, a separate task asked Turkers to paraphrase the original questions. The videos originate from https://helpx.adobe.com/.
    Download from https://github.com/acolas1/TutorialVQAData
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.670.pdf
    Edition LREC 2020
  • Name TV-AfD_Imperative_Corpus
    Resource type Corpus
    Size 3 MByte
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Language Modelling
    License GNU GPL v3.0
    Conditions of use Freely Available
    Description This is a corpus with imperative and non-imperative sentences from the TV show The Big Bang Theory and Wikipedia Articles for Deletion. We used rule-based classification and manual annotation to classify imperative sentences. 1. Data from the TV show Both raw data and classified data are included in this repository. Each file in the raw data is from one episode with each line representing one dialogue from that specific episode. In the tab-separated classification data file, each line represents one dialogue. The columns respectively are: data source (in the formant of nxmm with 'n' indicating the season of the show and 'mm' representing the episode), text, imperative classification (1 for imperative and 0 for non-imperative), imperative category, and whether the imperative sentence has affixal negative markers or not. 2. Data from Wikipedia Articles for Deletion Classified data for sentences only are included from this source. The file is tab separated with two columns of text and classification (1 for imperative and 0 for non-imperative) respectively. Each line represent one instance of data.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/235_res_1.zip [1.07 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.805.pdf
    Edition LREC 2020
  • Name Twitter Datasets
    Resource type Corpus
    Size 1904 stance-annotated tweets
    Languages German (deu)
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License CC-BY 4.0
    Conditions of use Freely Available
    Description The German Titling Twitter Corpus consists of 1904 stance-annotated tweets (125KB) collected in June/July 2018 mentioning 24 German politicians with a doctoral degree. The Addendum contains an additional 296 stance-annotated tweets from each month of 2018 mentioning 6 left-leaning and 4 right-leaning politicians with a doctoral degree.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/181_res_1.zip [125 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.606.pdf
    Edition LREC 2020
  • Name Twitter Users with City Population Size
    Resource type Corpus
    Size 2000000 sentences
    Languages English
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License <Not Specified>
    Conditions of use Freely Available
    Description 2 million twitter user IDs and tweets IDs for tweets written by each user. Can be used to retrieve original data from the Twitter API subject to Twitters terms of service.
    Download from http://lit.eecs.umich.edu/downloads.html
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.771.pdf
    Edition LREC 2020
  • Name Universal Dependencies
    Resource type Treebank
    Size 25 million words
    Languages Afrikaans, Akkadian, Amharic, Ancient Greek, Arabic, Armenian, Assyrian, Bambara, Basque, Belarusian, Bhojpuri, Breton, Bulgarian, Buryat, Cantonese, Catalan, Chinese, Classical Chinese, Coptic, Croatian, Czech, Danish, Dutch, English, Erzya, Estonian, Faroese, Finnish, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hindi English, Hungarian, Indonesian, Irish, Italian, Japanese, Karelian, Kazakh, Komi Permyak, Komi Zyrian, Korean, Kurmanji, Latin, Latvian, Lithuanian, Livvi, Maltese, Marathi, Mbya Guarani, Moksha, Naija, North Sami, Norwegian, Old Church Slavonic, Old French, Old Russian, Persian, Polish, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Skolt Sami, Slovak, Slovenian, Spanish, Swedish, Swedish Sign Language, Swiss German, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Upper Sorbian, Urdu, Uyghur, Vietnamese, Warlpiri, Welsh, Wolof, Yoruba
    Production status Existing-updated
    Resource usage Parsing and Tagging
    License Universal Dependencies 2.6 License Agreement
    Conditions of use Freely Available
    Description Universal Dependencies is an initiative to create cross-linguistically consistent morphosyntactic annotation for many languages. The latest release, UD v2.5, contains 157 treebanks in 90 languages. The resources are available from https://lindat.mff.cuni.cz/. Documentation can be found at https://universaldependencies.org.
    Download from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3105
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.497.pdf
    Edition LREC 2020
  • Name Urban Dictionary Emeddings
    Resource type Corpus
    Size 3 GByte
    Languages English
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License <Not Specified>
    Conditions of use Freely Available
    Description Word embeddings trained on the entirety of the content of Urban Dictionary as of October 16, 2019. A skip-gram architecture with negative sampling was used within the fastText framework.
    Download from http://smash.inf.ed.ac.uk/ud-embeddings/
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.586.pdf
    Edition LREC 2020
  • Name Urdu-News-Augmented-Dataset
    Resource type Corpus
    Size 5.04 MByte
    Languages English, Urdu (urd)
    Production status Newly created-finished
    Resource usage Document Classification, Text categorisation
    License <Not Specified>
    Conditions of use Freely Available
    Description This is a language resource which contains a dataset of 900 news articles originally in Urdu annotated as real or fake. Additionally, it contains a 400 news article as an augmentation dataset generated using Google Translate MT system from English to Urdu, as well as a number of combinations of these datasets for exploration of the augmentation effect. The original English Fake News dataset is available from https://web.eecs.umich.edu/~mihalcea/downloads.html#FakeNews.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/66_res_1.zip [5.29 Mb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.309.pdf
    Edition LREC 2020
  • Name WFM mathematical definitions dataset
    Resource type Corpus
    Size NULL NULL
    Languages English
    Production status Newly created-finished
    Resource usage Definition extraction
    License GNU GPL v3.0
    Conditions of use Freely Available
    Description The WFM dataset (Vanetik et al., 2019) was created by collecting and processing 2352 articles from Wolfram Mathworld (Weisstein and others, 2007). The dataset contains 1793 sentences, of which 811 are definitions and 982 are non-definitions. Sentences were extracted automatically and then manually separated into two categories: definitions and statements (non-definitions).
    Download from https://github.com/uplink007/FinalProject/tree/master/data/wolfram
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.256.pdf
    Edition LREC 2020
  • Name WikiFactCheck-English
    Resource type Corpus
    Size 124821 entries
    Languages English (eng)
    Production status Newly created-finished
    Resource usage Fact checking
    License <Not Specified>
    Conditions of use Freely Available
    Description WikiFactCheck-English, a dataset of 124k triples consisting of a claim, context and an evidence document extracted from English Wikipedia articles and citations, as well as 34k manually written claims that are refuted by the evidence documents. This is the largest fact checking dataset consisting of real claims and evidence to date; it will allow the development of fact checking systems that can better process claims and evidence in the real world.
    Download from https://github.com/WikiFactCheck-English
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.849.pdf
    Edition LREC 2020
  • Name word2word
    Resource type Lexicon
    Size 41 GByte
    Languages af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw
    Production status Newly created-finished
    Resource usage Machine Translation, SpeechToSpeech Translation
    License GNU GPL v3.0
    Conditions of use Freely Available
    Description A collection of bilingual lexicons for 3,564 language pairs.
    Download from https://github.com/Kyubyong/word2word
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.371.pdf
    Edition LREC 2020
  • Name Yarega
    Resource type Corpus
    Size 92848 words
    Languages Telugu
    Production status Newly created-finished
    Resource usage Opinion Mining/Sentiment Analysis
    License <Not Specified>
    Conditions of use Freely Available
    Description The sentences in the dataset are in Telugu Script. The annotated data for aspect term identification is in the form of word/tag. Annotated data for aspect polarity classification and aspect term categorisation is in JSON format. Each JSON file contains an array of JSON objects. Each object consists of two properties, "sentence" and "aspectTerms". The property "sentence" has the review sentence. The property "aspectTerms" has an array of JSON objects where each object has the properties of "aspectTerm", "start", "end", "polarity" and "category". "aspectTerm" has the aspect term itself. "start" and "end" have the beginning and ending indexes of the aspect term."polarity" and "category" are the annotated polarity of the aspect term and the annotated category of the aspect term respectively. The dataset contains 5027 sentences and 92848 tokens. The total number of aspect terms in all those sentences is 7130. Of which, 3521 aspect terms are of positive polarity, 2480 are of negative polarity and 1129 are of neutral polarity.
    Download from http://lrec2020.lrec-conf.org/sharedlrs2020/706_res_1.zip [53 Kb]
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.617.pdf
    Edition LREC 2020
  • Name Yorùbá and Twi Corpus
    Resource type Corpus
    Size NULL NULL
    Languages Twi, Yorùbá
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use Freely Available
    Description This is a collection of some noisy and fully diacritized Yorùbá texts from different sources. Also, it includes some Twi texts. Also included is annotated Yorùbá texts for Named Entity recognition tasks.
    Download from https://github.com/ajesujoba/YorubaTwi-Embedding
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf
    Edition LREC 2020
  • Name Yorùbá and Twi FastText Embedding
    Resource type Corpus
    Size NULL NULL
    Languages Twi, Yorùbá
    Production status Newly created-finished
    Resource usage Natural Language Processing Downstream tasks
    License <Not Specified>
    Conditions of use Freely Available
    Description Collection of FastText per-trained Embeddings for and Twi which can be used for downstream tasks in Natural Language Processing.
    Download from https://drive.google.com/drive/folders/1jkwLBkxJhnfVvf1yd7PyZw0nY8aNYaNN?usp=sharing
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf
    Edition LREC 2020
  • Name Yorùbá and Twi WordSim-353 dataset
    Resource type Evaluation Data
    Size NULL NULL
    Languages Twi, Yorùbá
    Production status Newly created-finished
    Resource usage Evaluation/Validation
    License <Not Specified>
    Conditions of use Freely Available
    Description This is the collection of the WordSim-353 word pairs dataset for Yorùbá and Twi. All the already existing WordSim-353 word pairs were translated from English to Yorùbá and Twi.
    Download from https://github.com/ajesujoba/YorubaTwi-Embedding
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf
    Edition LREC 2020
  • Name Zamboanga Chabacano Spelling Errors Ontology
    Resource type Ontology
    Size NULL NULL
    Languages Chavacano (cbk)
    Production status Newly created-finished
    Resource usage Corpus Creation/Annotation
    License <Not Specified>
    Conditions of use Freely Available
    Description A Zamboanga Chabacano spelling errors ontology in Web Ontology Language (OWL) format.
    Download from https://research.chavacano.org/cwzcc.owl
    Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.327.pdf
    Edition LREC 2020
28762777303_686513d051_o.jpg
Important dates
  • 31 October 2019: Submission of proposals for panels, workshops and tutorials
  • 2 December 2019: Submission of proposals for oral and poster papers
  • 13 February 2020: Notification of acceptance for oral and poster/demo papers
  • 13 March 2020: Final Submission of accepted oral and poster/demo papers
  • 13-14-15 May 2020: Main Conference
  • 11-12-16 May 2020: Workshops & Tutorials