LREC 2020 - Current List of LREC 2020 Shared LRs

LREC recognises the importance of sharing Language Resources (LRs) and making them available to the community. When submitting a paper, participants were offered the possibility to share their LRs (data, tools, web-services, etc.), uploading them in a special LREC repository set up by ELRA. This effort of sharing LRs, linked to the LRE Map initiative for their description, contributes to creating a common repository where everyone can deposit and share data.

After the conference, the Shared LRs set at LREC 2020 was manually checked and a cleaned version of the list of LRs is now available. The LRs in this list comply with the following criteria:

LRs accessible (whether downloadable directly or through an an external URL)
LRs categorized as Datasets only. It can be a:
- Corpus
- Evaluation Data,
- Grammar/Language Model,
- Language Resources/Technologies Infrastructure,
- Lexical Conceptual Resource General Information,
- Lexicon,
- Ontology,
- Terminology,
- Treebank

Excluded LRs are:

Uploaded LRswith a content that does not match the description
LRs with no download URL or URL now a dead link
LRs categorized as tools or guidelines
LRs associated to rejected papers

index.html

Name	A Chinese Corpus for Fine-grained Entity Typing
Resource type	Corpus
Size	650 MByte
Languages	Chinese, English
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	https://drive.google.com/file/d/1xorWUdTi9r43tTEdwJ4tKa9ErvRjossU/view
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.548.pdf
Edition	LREC 2020

Name	A corpus of precise natural entailment problems
Resource type	Corpus
Size	150 entailment problems
Languages	English
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/172_res_1.zip [31 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.844.pdf
Edition	LREC 2020

Name	A New Dataset for Sindhi Named Entity Recognition
Resource type	Corpus
Size	1358724 words
Languages	Sindhi
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	<Not Specified>
Conditions of use	Freely Available
Description	We present the first large named entity recognition dataset for low-resourced Sindhi language with quality baselines. Our work mainly consists of three novel contributions. Firstly, the annotation of a gold-standard SiNER (Sindhi Named Entity Recognition) dataset, secondly we present quality baselines using conditional random field (CRF) and recent state-of-the-art deep bi-directional long short term memory networks (Bi-LSTM) neural models in named entity recognition and thirdly the performance comparison of Bi-LSTM, Bi-LSTM-CRF models on GloVe and fastText word representations.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1444_res_1.zip [8.88 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.361.pdf
Edition	LREC 2020

Name	A Seed Corpus of Hindu Temples In India
Resource type	Corpus
Size	1.7 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	<Not Specified>
Conditions of use	Freely Available
Description	This json file consists of 4933 high accuracy facts about 573 temples. The file format is as below -------------------------------- {"head_line": {"questions": [] }"templ1":{"context": "text","answers": [ [],[]] ------------------------------------
Download from	https://github.com/priyaradhakrishnan0/templeKB/blob/master/corpus/WebTempleCorpus.json
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.32.pdf
Edition	LREC 2020

Name	AbuseEval
Resource type	Corpus
Size	14100 NULL
Languages	English
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	Extension of OLID/OffensEval data set with distinction of explicit vs implicit offensive messages. Annotation of Abusive Language, distinguishing also between explicit vs implicit offensive messages.
Download from	https://github.com/tommasoc80/AbuseEval
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf
Edition	LREC 2020

Name	Amharic News Corpus
Resource type	Corpus
Size	16295955 tokens
Languages	Amharic (amh)
Production status	Newly created-finished
Resource usage	Knowledge Discovery/Representation
License	<Not Specified>
Conditions of use	Freely Available
Description	This corpus of Amharic news websites was built by using our own crawler on four Amharic news websites from a variety of genres. As a pre-process, we removed non-Amharic words and letters and replaced all digits with the `#` character. After pre-processing, the corpus contained 16,295,955 tokens consisting of 855,109 unique tokens with average frequency of 19.05; this is similar in size to the English `text8` benchmark (17 million tokens, with an average frequency of 66). This is the corpus used for training embeddings in the publication: Mersha, Amanuel and Wu, Stephen. Morphology-rich Alphasyllabary Embeddings. Proceedings of the 12th Edition of the Language Resources and Evaluation Conference (LREC). Marseilles, France. 2020.
Download from	https://github.com/leobitz/amharic_word_embedding
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.315.pdf
Edition	LREC 2020

Name	Amharic Treebank
Resource type	Treebank
Size	1000 sentences
Languages	Amharic (amh)
Production status	<Not Specified>
Resource usage	<Not Specified>
License	<Not Specified>
Conditions of use	Freely Available
Description	The Amharic treebank (ATT) version 1 contains 1,074 manually-annotated sentences (5,245 tokens or 10,010 words). The sentences were collected from grammar books, biographies, news, and fictional and religious texts. The researchers made an effort to include different types of sentences.
Download from	https://github.com/Binyamephrem/Amharic-treebank
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.238.pdf
Edition	RAIL 2020

Name	Anföranden
Resource type	Corpus
Size	121987537 tokens
Languages	Swedish
Production status	Available
Resource usage	<Not Specified>
License	CC-BY 4.0
Conditions of use	Freely Available
Description	This resource contains all Swedish parliamentary debates from autumn 1993 to spring 2018. They have been cleaned and linguistically annotated through the corpus pipeline of Språkbanken Text, University of Gothenburg.
Download from	https://spraakbanken.gu.se/en/resources/rd-anf
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/ParlaCLARIN2/pdf/2020.parlaclarin-1.2.pdf
Edition	ParlaCLARIN2 2020

Name	Annotated Corpora for Term Extraction Research (ACTER)
Resource type	Corpus
Size	19002 terms and named entities
Languages	Dutch, English, French
Production status	Version 1.2 completed and validated
Resource usage	Training and evaluation data for automatic term extraction (ATE)
License	CC-BY-NC-SA 4.0
Conditions of use	Freely Available under appropriate license
Description	The Annotated Corpora for Term Extraction Research (ACTER), version 1.2, contain texts in four domains (corruption, dressage (horse riding), heart failure, and wind energy) and three languages (English, French, Dutch). For each corpus (combination of domain & language), around 50k tokens have been manually annotated to identify terminology and named entities (almost 600k annotated tokens in total). The results are presented as lists of annotations per corpus, with one (lowercased, unlemmatised, unique) annotation per line, tab-separated by its label. In total, there are 19k unique annotations. The annotation process is transparent and well-documented, with freely available guidelines (http://hdl.handle.net/1854/LU-8503113) and several published papers for the validation of the dataset. It has also been used for the TermEval 2020 shared task on automatic term extraction, organised at the CompuTerm workshop at LREC 2020.
Download from	https://bit.ly/3lNHdr2
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/COMPUTERM2020/pdf/2020.computerm-1.12.pdf
Edition	COMPUTERM 2020

Name	Anomalous Lexical Units
Resource type	Lexicon
Size	300 entries
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Lexicon Creation/Annotation
License	<Not Specified>
Conditions of use	Freely Available
Description	The resource contains 300 WordNet synsets that cannot be characterized by any existing semantic frame in Berkeley FrameNet data release 1.7.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/611_res_1.tsv [20 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.431.pdf
Edition	LREC 2020

Name	AR-ASAG Dataset
Resource type	Corpus
Size	4.46 MByte
Languages	Algerian Arabic (arq)
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	The Dataset consists of reported evaluations related to Automatic short answers grading in Arabic Language. The exams were conducted under natural conditions of evaluation. Each test consists of 16 short answer questions (a total of 48 questions). Students submitted answers to these questions. The number of answers obtained is different from one question to another. The Dataset includes a total of 2133 pairs (Model Answer, student answer). For each pair, two grades (Mark1 and Mark2 ) are associated with a manual Average Gold Score. Both manual grades are available in the Dataset. The Inter-Annotators Agreement: (Pearson=0.8384) and (Root Mean Square Error : RMSE=0.8381). The Dataset encompasses 5 types of questions: "Define", "Explain", "What consequences?", "Justify", and "What is the difference?". AR-ASAG Dataset is available in different versions: TXT, XML, XML-MOODLE and Database (.DB). The .DB format allows to make the necessary exports according to specific analysis needs. The XML-MOODLE format is used on Moodle e-learning Platforms.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/345_res_ 1.rar [751.75 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.321.pdf
Edition	LREC 2020

Name	Arabic In-Domain Cyber Text Corpus
Resource type	Corpus
Size	22.9 MByte
Languages	Algerian Arabic (arq)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	Freely Available
Description	This is an Arabic text corpus covering the specific field of cybercrimes. Documents : 1273 Words : 2009110 The corpus was used to explore the Automatic Short Answer Grading (ASAG) field for Arabic Language. The corpus was automatically obtained from texts extracted from a collection of URLs according to a list of key terms. Key terms are combined and queried to a search engine, which returns a list of potentially relevant URLs. The URLs are then inspected and validated. Relevant web pages are retrieved, automatically cleaned of HTML tags. The text is extracted and added to the corpus.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/345_res_ 2.rar [6.15 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.321.pdf
Edition	LREC 2020

Name	ArSpecEmb
Resource type	Corpus
Size	3196k words
Languages	Arabic (ara)
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	OpenSource
Conditions of use	Freely Available
Description	ArSpecEmb regroups embeddings that take into account specificity of Arabic language. They are trained on different types of corpora.
Download from	https://lium.univ-lemans.fr/en/arsentimentanalysis/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.610.pdf
Edition	LREC 2020

Name	Austrian named entities
Resource type	Lexical Conceptual Resource General Information
Size	15457 named entities
Languages	German
Production status	In use
Resource usage	Machine Translation
License	CC-BY-SA 3.0
Conditions of use	Freely Available
Description	Named entities of Austrian political parties, ministries of the federal administration, names of the federal ministers, name of the president, names of state secretaries, daily newspapers, Austrian common last names, Austrian common first names (male, female), Austrian geographical names (localities, municipalities, states, political districts), names of Austrian stock companies. The dataset "Austrian (German) named entities" is a compilation by University of Vienna of several datasets. Common names and Geographical names data originate from Statistics Austria. Names of Austrian stock companies originate from Wikimedia. Public domain information -- Names of President, State Secretaries and Ministers and Newspaper names are compiled by University of Vienna.
Download from	https://www.elrc-share.eu/repository/browse/austrian-named-entities/b0998b12ab9611e8b7d400155d02670612bad73492934202887a45e227312e0e
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf
Edition	LT4Gov 2020

Name	BERT-Base, Multilingual
Resource type	Grammar/Language Model
Size	NULL NULL
Languages	104 languages
Production status	Existing-used
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in "Well-Read Students Learn Better: On the Importance of Pre-training Compact Models" (https://arxiv.org/abs/1908.08962).
Download from	https://github.com/google-research/bert
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.669.pdf
Edition	LREC 2020

Name	BERT-based Abstractive Summarization
Resource type	Grammar/Language Model
Size	NULL NULL
Languages	<Not Specified>
Production status	Newly created-finished
Resource usage	Summarisation
License	MIT License
Conditions of use	<Not Specified>
Description	Summarization Library.
Download from	https://github.com/axenov/BERT-Summ-OpenNMT
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.825.pdf
Edition	LREC 2020

Name	Biblical Hebrew Graph Database
Resource type	Lexicon
Size	10 MByte
Languages	Biblical Hebrew
Production status	In production
Resource usage	<Not Specified>
License	<Not Specified>
Conditions of use	<Not Specified>
Description	A thesaurus of Biblical Hebrew, with headwords linked by sematic, phonetic, and distributional similarities.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/16_res_1.tgz [734 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.660.pdf
Edition	LT4HALA 2020

Name	Bilingual Corpus of English-Arabic Parallel Tweets
Resource type	Corpus
Size	3 MByte
Languages	Arabic, English
Production status	Complete
Resource usage	For Research
License	Apache 2.0
Conditions of use	Publicly Available
Description	We provide tweet IDs of 166,706 pairs of English-Arabic parallel tweets. We share a list of 1389 twitter accounts who post English-Arabic parallel tweets. We also share subset of twitter accounts annotated for their countries and topics they most frequently post tweets about.
Download from	http://bit.ly/ 2xApE8V
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/BUCC2020/pdf/2020.bucc-1.3.pdf
Edition	BUCC 2020

Name	Bilingual troll tweets
Resource type	Corpus
Size	2.3 MByte
Languages	English, Russian (rus)
Production status	Newly created-finished
Resource usage	Text Mining
License	<Not Specified>
Conditions of use	Freely Available
Description	A bilingual dataset with troll tweets. The dataset we built contains tweets from troll and legitimate accounts in two languages (including translation from one language to another) and can be used for training models for automatic detection of troll tweets/accounts in either English, Russian, or both languages.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1055_res_1.zip [2.32 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.766.pdf
Edition	LREC 2020

Name	Broad-Coverage German Sentiment Classification for Dialog Systems
Resource type	Corpus
Size	5355043 entries
Languages	German
Production status	Newly created and combined with existing
Resource usage	Document Classification, Text categorisation
License	MIT License
Conditions of use	Freely Available
Description	This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.
Download from	https://github.com/oliverguhr/german-sentiment
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf
Edition	LREC 2020

Name	CA-EHN
Resource type	Corpus
Size	90505 analogies
Languages	Chinese
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	CC-BY-NC-SA 4.0
Conditions of use	Freely Available
Description	CA-EHN is a commonsense word analogy dataset containing 90,505 analogies covering 5,656 words and 763 relations. The typical usage of word analogy datasets are to evaluate word embeddings.
Download from	https://github.com/ckiplab/CA-EHN
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.365.pdf
Edition	LREC 2020

Name	Cairo Student Code-Switch Corpus
Resource type	Corpus
Size	505 KByte
Languages	Egyptian Arabic (arz), English (eng)
Production status	Newly created-finished
Resource usage	Morphological Analysis
License	OpenSource
Conditions of use	From Owner
Description	A corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1030_res_1.txt [517 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.489.pdf
Edition	LREC 2020

Name	CantoMap
Resource type	Corpus
Size	768 minutes
Languages	, Yue Chinese (yue)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	GNU GPL v3.0
Conditions of use	Freely Available
Description	The Cantonese MapTask corpus is a collection of recordings of the MapTask task in contemporary Hong Kong Cantonese. The corpus contains the following: - Recordings of the sessions of the MapTask are in the folder "ConversationData". Each sub-folder contains audio files and corresponding ELAN files that contain the transcription of the audio in Chinese characters (both non-segmented and manually segmented) and jyutping romanization. - Recordings of the reading task where participants read words aloud from a list are in the folder "ReadingData". - The maps used for the task are in the folder "Maps". In total, the corpus offers 12 hours and 48 minutes of recordings.
Download from	https://github.com/gwinterstein/CantoMap
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.355.pdf
Edition	LREC 2020

Name	CEASE
Resource type	Corpus
Size	66.3 KByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Emotion Recognition/Generation
License	<Not Specified>
Conditions of use	From Owner
Description	The dataset is present in the 'data' folder which contains 15 '.txt' files, one for each emotion class: Emotion Classes considered are: Abuse, Anger, Blame, Fear, Forgiveness, Guilt, Hopefulness, Hopelessness, Happiness_Peacefulness, Information, Instruction, Love, Pride, Sorrow, Thankfulness. Each file contains pre-processed sentences from real-life suicide notes. The full corpus contains 2393 annotated instances from 15 emotion classes. Note: To anonymize any real-life reference of persons, address and organization in this dataset, the authors have used three tags: NAME in place of any person's name ADDRESS in place of any address of a person or organization ORGANIZATION in place of any name of an industry, organization, firm, etc. The authors have taken the effort to anonymize any real-life references in the dataset, but still, if any reference is missed, it is advised to anyone using this dataset to anonymize them using the above mentioned three tags before using it in their work.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/907_res_1.zip [7 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf
Edition	LREC 2020

Name	CETE codes and datasets
Resource type	Evaluation Data
Size	68.7 MByte
Languages	English
Production status	Existing-used
Resource usage	Question Answering
License	<Not Specified>
Conditions of use	Freely Available
Description	We share the source code of our models as well as the pre-processed datasets (TREC-QA, WikiQA, SemEval2015-CQA, SemEval2016-CQA, SemEval2017-CQA). Our fine-tuning model is implemented based on this repository: https://github.com/huggingface/transformers and the feature-based model is implemented based on this repository: https://github.com/AnubhavGupta3377/Text-Classification-Models-Pytorch. The datasets are already publicly available. But as the publicly shared question answering datasets are often available in unstructured text format, the pre-processed versions is shared so that it can be easily adapted for future research and reproduction of results. Each instance in the dataset is separated by " ".
Download from	https://drive.google.com/file/d/1iCm515ZhEN-UmdOR_ievK3EA8iuXacSz/view?usp=sharing
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.676.pdf
Edition	LREC 2020

Name	Chinese Humor Corpus
Resource type	Corpus
Size	12.7 MByte
Languages	Chinese (zho)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	MIT License
Conditions of use	Freely Available
Description	This Chinese humor corpus contains 3,365 jokes collected from over 40 sources. Each joke was labeled with five levels of funniness, eight skill sets of humor, and six dimensions of intent.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/632_res_1.zip [5.23 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.168.pdf
Edition	LREC 2020

Name	Chinese PragBank
Resource type	Corpus
Size	167 KByte
Languages	Mandarin Chinese (cmn)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	OpenSource
Conditions of use	From Owner
Description	This corpus collects Chinese readers' veridicality judgments to news events, namely, whether readers believe an even is likely to happen or not. A total of 968 event sentences are included and each sentence is annotated with 6 veridicality judgments. There are 4 columns in the corpus, which are news sentences, normalized sentences, target events, and veridicality judgments.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/869_res_1.txt [171 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.33.pdf
Edition	LREC 2020

Name	Cifu
Resource type	Lexicon
Size	51798 entries
Languages	Yue Chinese (yue)
Production status	Newly created-finished
Resource usage	Psycholinguistics
License	GNU GPL v3.0
Conditions of use	Freely Available
Description	This resource offers a lexical database for Hong Kong Cantonese. Each entry in the lexicon includes the following information: - the term in Chinese characters - a Jyutping romanization of the term - the frequency of the term (occurence counts and per million words) in four different genres: Written, Spoken adult, Spoken child and Child directed speech - a count of the strokes in each character in the term (as a measure of the character complexity) - 6 measures of the Neighborhood Density of each entry (distinguishing whether insertion/deletion are allowed when searching for neighbors and the genre from which frequencies are retrived to calculate the Neighborhood Density: Written, Spoken Adult or their average).
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/508_res_1.txt [7.86 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.375.pdf
Edition	LREC 2020

Name	Ciron: a New Benchmark Dataset for Chinese Irony Detection
Resource type	Corpus
Size	1 MByte
Languages	Chinese (zho)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	<Not Specified>
Conditions of use	Freely Available
Description	Ciron includes more than 8.7k posts, collected from the Weibo, a microblogging platform. Ciron is one scale larger in size than the previously available dataset. Ciron is collected with no pre-conditions to ensure much wide coverage. Five fine-grained classes for ironic ratings: 1 (not ironic), 2 (unlikely ironic), 3 (insufficient evidence), 4 (weakly ironic), 5 (strongly ironic). Class 2 and 4 are introduced to allow for fine-grained extent in the judgement. The inter-rater agreement Fleiss' Kappa results is 0.470. Evaluation of four different deep learning classifiers proves the usefulness of Ciron as an important resource for Chinese irony detection.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/105_res_1.zip [1.05 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.701.pdf
Edition	LREC 2020

Name	cldf-datasets/normansinitic: Structural and lexical data for the paper by Norman (2013) on Chinese dialect classification
Resource type	Lexicon
Size	400 lexemes
Languages	Chinese (zho), Chinese dialects
Production status	Newly created-finished
Resource usage	Lexicon Creation/Annotation
License	CC-BY 4.0
Conditions of use	Freely Available
Description	See documentation (and main text).
Download from	https://doi.org/10.5281/zenodo.3552559
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.864.pdf
Edition	LREC 2020

Name	Collaborative Dialogue in Minecraft
Resource type	Corpus
Size	32 GByte
Languages	English (eng)
Production status	Existing-updated
Resource usage	Knowledge Discovery/Representation
License	CC-BY-SA 4.0
Conditions of use	Freely Available
Description	Collaborative Dialogue in Minecraft: the annotations were added for the corpus sentences.
Download from	https://drive.google.com/drive/folders/16lDzswcQh8DR2jkQJdoVTK-RyVDFPHKa
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.717.pdf
Edition	LREC 2020

Name	ConceptNet 5
Resource type	Ontology
Size	34 million concepts
Languages	English, French, German (deu), Italian, Spanish, Russian, Portuguese, Japanese, Dutch, Chinese and many others
Production status	Existing-updated
Resource usage	Semantic Web
License	CC-BY-SA 4.0
Conditions of use	Freely Available
Description	ConceptNet is a freely-available semantic network, designed to help computers understand the meanings of words that people use. ConceptNet originated from the crowdsourcing project Open Mind Common Sense, which was launched in 1999 at the MIT Media Lab. It has since grown to include knowledge from other crowdsourced resources, expert-created resources, and games with a purpose.
Download from	http://conceptnet.io/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.38.pdf
Edition	LREC 2020

Name	Corpora of Portuguese Humour
Resource type	Corpus
Size	NULL NULL
Languages	Portuguese
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	Corpora balanced between two styles of humour (one-liners, headlines) and for styles of no-humour (general-knowledge questions, why questions, news headlines, proverbs), in Portuguese, useful for text classification / humour recognition.
Download from	https://github.com/andreclemencio/Recognizing-Humor-in-Portuguese/tree/master/Datasets/Balanceados
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf
Edition	LREC 2020

Name	Corpus of Chinese Dynastic Histories
Resource type	Corpus
Size	23000000 tokens
Languages	Literary Chinese (lzh)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	CC-BY 4.0
Conditions of use	Freely Available
Description	Newly created, on the basis of Wikisource, Corpus of Chinese Dynastic Histories currently includes twenty-four histories.
Download from	https://osf.io/tp729/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.98.pdf
Edition	LREC 2020

Name	Corpus of Polish Spatial Texts 2.0 (PST 2.0)
Resource type	Corpus
Size	61315 tokens
Languages	Polish (pol)
Production status	Existing-updated
Resource usage	Information Extraction, Information Retrieval
License	CC-BY-SA 4.0
Conditions of use	Freely Available
Description	The extended version of Polish Spatial Text corpus. Texts derived from polish travel blogs manually annotated with spatial expressions. A spatial expression is a text fragment which describes a relative location of two or more physical objects to each other.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/934_res_1.zip [973 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.265.pdf
Edition	LREC 2020

Name	Corpus REDEWIEDERGABE (Core corpus)
Resource type	Corpus
Size	489608 tokens
Languages	German (deu)
Production status	Newly created-finished
Resource usage	Machine Learning
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	Corpus REDEWIEDERGABE is a German-language historical corpus with detailed annotations for speech, thought and writing representation (ST&WR). It consists of narrative texts as well as newspaper and magazine texts published between 1840 and 1920 and has approximately 490,000 tokens. The textual material is balanced over decades and between fictional and non-fictional texts. The annotation system is rooted in narratological theory and specifies the type of ST&WR (direct, free indirect, indirect, reported) and the medium (speech, thought, writing) as well as several more specialized attributes. In addition, the framing phrase, the key word that indicates the speech, thought or writing act, and the source (speaker/author) of the ST&WR are annotated. The annotation is a consens annotation, based on two independent primary annotations. The corpus is available in a TEI-compliant XML format, UIMA XMI format, and a column-based plain text format (UTF8, CONLL style).
Download from	https://github.com/redewiedergabe/corpus/tree/master/data/main
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.100.pdf
Edition	LREC 2020

Name	Costra 1.0
Resource type	Corpus
Size	4262 sentences
Languages	Czech (ces)
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	CC-BY 4.0
Conditions of use	Freely Available
Description	Dataset of complex sentence transformations, COSTRA 1.0. is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. This first version of the dataset is limited to sentences in Czech.
Download from	http://hdl.handle.net/11234/1-3123
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.434.pdf
Edition	LREC 2020

Name	DaNewsroom
Resource type	Corpus
Size	1.1M entries
Languages	Danish (dan)
Production status	Newly created-finished
Resource usage	Summarisation
License	OpenSource
Conditions of use	Freely Available
Description	Large-scale automatic summarisation dataset, comprised of articles and their summaries.
Download from	https://github.com/danielvarab/da-newsroom
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.831.pdf
Edition	LREC 2020

Name	DEbateNet-mig15
Resource type	Corpus
Size	556185 tokens
Languages	German
Production status	Existing-updated
Resource usage	Political Science
License	CLARIN ACA+BY+NORED (EULA)
Conditions of use	Freely Available
Description	Corpus of manually annotated articles for political science research. Source: die Tageszeitung (reference German newspaper). Year: 2015. Multi-level annotation: claim identification and categorization into 97 categories (annotation schema available at https://github.com/mardy-spp/mardy_acl2019/blob/master/codebook.pdf); actor identification; claim attribution (linking of claims to actors); polarity; claim date. Additionally, an R package is released which can be used to explore the dataset and replicate the analysis of it.
Download from	http://hdl.handle.net/11022/1007-0000-0007-DB07-B
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.115.pdf
Edition	LREC 2020

Name	Developing an Arabic Infectious Disease Ontology to Include Non-Standard Terminology
Resource type	Ontology
Size	247 entries
Languages	Arabic (ara)
Production status	Newly created-finished
Resource usage	Text Mining
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/138_res_1.txt [241 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.596.pdf
Edition	LREC 2020

Name	DIP-SumEval
Resource type	Corpus
Size	49 topics (x 10 summaries and 11 dimensions)
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Summarisation
License	CC-BY 4.0
Conditions of use	Freely Available, Attribution Required
Description	This repository contains the summaries and evaluations from the paper 'A Dataset for the Analysis of Text Quality Dimensions in Summarization Evaluation' presented at LREC 2020. In total, the data set contains over 400 automatically generated summaries for 49 topics of an data set for multi-document summarization, 1274 judgements according to 11 text and summary quality criteria on a Likert-scale (1 to 5) performed by 26 trained annotators, and 43218 pairwise judgements according to 6 criteria performed by 64 crowd-workers. If you want to use the data please cite the paper http://www.lrec-conf.org/proceedings/lrec2020/bib/2020.lrec-1.826.bib.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/915_res_1.zip [531 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.826.pdf
Edition	LREC 2020

Name	DiscSense
Resource type	Lexicon
Size	NULL MByte
Languages	English
Production status	Newly created-finished
Resource usage	Discourse
License	<Not Specified>
Conditions of use	Freely Available
Description	DiscSense was built automatically: we finetuned a BERT model for discourse marker prediction between sentences (using the Discovery dataset) and used the marker prediction between sentences of existing datasets.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/33_res_1.tsv [289 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.125.pdf
Edition	LREC 2020

Name	Domain-Specific Dataset of Difficulty Ratings for German Noun Compounds
Resource type	Terminology
Size	1030 lexemes
Languages	German (deu)
Production status	Newly created-finished
Resource usage	Difficulty of domain-specific German closed compounds
License	CC-BY-SA 4.0
Conditions of use	Freely Available
Description	Dataset with difficulty ratings for 1,030 German closed noun compounds extracted from domain-specific texts for do-it-ourself (DIY), cooking and automotive. It includes two-part compounds for cooking and DIY, and two- to four-part compounds for automotive.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/137_res_1.zip [13 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.537.pdf
Edition	LREC 2020

Name	dutch-archaeo-NER-dataset
Resource type	Corpus
Size	8.4 MByte
Languages	Dutch (nld)
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	GNU GPL v3.0
Conditions of use	Freely Available
Description	A manually annotated NER dataset, consisting of Dutch archaeological excavation reports. The following entity types are labelled: - Artefacts - Time periods - Materials - Places (geographical locations) - Archaeological contexts - Species The dataset is provided in the BIO format, with each token on 1 line and empty lines denoting sentence boundaries. On each line you can find the token, PoS tag, morphological segmentation and finally the label, separated by spaces. The PoS tag and morphological segmentation are assigned by Frog. This work was funded by a grant provided by the Leiden University Centre For Digital Humanities
Download from	https://doi.org/10.5281/zenodo.3544543
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.562.pdf
Edition	LREC 2020

Name	ElkarHizketak v1.0
Resource type	Corpus
Size	1634 questions
Languages	Basque (eus)
Production status	Newly created-finished
Resource usage	Dialogue
License	CC-BY-SA 4.0
Conditions of use	Freely Available
Description	ElkarHizketak is a Conversational Question Answering (CQA) dataset in Basque created by Basque speaker volunteers. The dataset contains close to 400 dialogues and more than 1600 questions and answers, and its small size presents a realistic low-resource scenario for CQA systems. The dataset is built on top of Wikipedia sections about popular people and organizations. The dialogues involve two crowd workers: (1) a student asks questions after reading a small introduction about the person, but without seeing the section text; and (2) a teacher answers the questions selecting a span of text of the section.
Download from	http://ixa2.si.ehu.es/convai/elkarhizketak-v1.0.zip
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.55.pdf
Edition	LREC 2020

Name	English WordNet Random Walk Pseudo-Corpora
Resource type	Corpus
Size	81 corpora
Languages	English
Production status	Newly created-finished
Resource usage	Language Modelling
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	The archive contains a collection of 81 pseudo-corpora (9.9GB). These are text files that contain pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy. The random walk algorithm produces a pseudo-sentence from WordNet by randomly picking a node (SynSet) in WordNet, randomly choosing a word in the SynSet, and then randomly picking a connected node and repeating the process. At every step there is a 15% chance for the random walk to stop; it also stops if it has no more connected nodes to take. Once the walk stops, a sentence is generated, and the same process repeats for each new sentence. Each line in the generated file represents one pseudo-sentence, where words are delimited by spaces. Example sentences: - measure musical notation tonality minor mode - Dutch-processed cocoa powder chocolate milk The corpus files are different in size, as well as in some parameters that were used to generate them. The parameters are: - size: number of sentences/lines in the corpus - direction: the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both) - minimal sentence length: the shortest length sentence (in number of words)
Download from	https://doi.org/10.21427/he55-6481
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.602.pdf
Edition	LREC 2020

Name	English-Hiligaynon
Resource type	Lexicon
Size	1200 entries
Languages	English (eng), Hiligaynon
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	From Owner
Description	Word-word translation of 1,200 English words into Hiligaynon.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/163_res_1.txt [20 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.313.pdf
Edition	LREC 2020

Name	Evaluation Benchmark for Domain Taxonomies from Knowledge Graphs (EBDT-KG)
Resource type	Evaluation Data
Size	155 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Knowledge Discovery/Representation
License	CC-BY 4.0
Conditions of use	Freely Available
Description	An application-specific gold standard dataset extracted from the Wikipedia knowledge graph and an evaluation framework to comparatively assess the quality of noisy automatically extracted taxonomies.
Download from	https://sites.google.com/unitelmasapienza.it/wikipediataxonomies/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.285.pdf
Edition	LREC 2020

Name	Event it Pronouns
Resource type	Corpus
Size	8 MByte
Languages	English (eng), Estonian (ekk), Finnish (fin), German (deu), Spanish (esp), French (fra), Hungarian (hun), Italian (ita), Latvian (lav), Dutch (nld), Polish (pol), Portuguese (por), Romanian (ron), Slovak (slk), Slovenian (slv), Swedish (swe)
Production status	Newly created-finished
Resource usage	Anaphora, Coreference
License	OpenSource
Conditions of use	Freely Available
Description	This is a multilingual corpus of English 'it' pronouns annotated with their translations in the Europarl languages. The English 'it' pronouns have also been automatically labeled as 'event'.
Download from	https://github.com/sharidloaiciga/multilingual-event-pronouns
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.12.pdf
Edition	LREC 2020

Name	Fake News Filipino Dataset
Resource type	Corpus
Size	3206 entries
Languages	English (eng), Filipino (fil)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	GNU-GPL v3.0
Conditions of use	Freely Available
Description	The first curated fake news detection dataset in Filipino. The dataset has 3206 samples with a perfect 50/50 split between real and fake news articles. Fake news articles are from sites tagged as fake by VeraFiles (an independent news organization) and the National Union of Journalists in the Philippines (NUJP). Real news were sourced from mainstream news sites in the Philippines. Data collection was performed across five months, as content and websites that spread fake news are usually taken down or go offline quickly, which makes text-based fake news data in Filipino scarce.
Download from	https://github.com/jcblaisecruz02/Tagalog-fake-news
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.316.pdf
Edition	LREC 2020

Name	Fake.Br Corpus
Resource type	Corpus
Size	22.4 MByte
Languages	Portuguese (por)
Production status	Existing-used
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available, Attribution Required
Description	The Fake.Br Corpus is composed of true and fake news that were manually aligned, focusing only on Brazilian Portuguese. There are 7,200 news, with exact 3,600 true and 3,600 fake news. All of them are in plain text format, with each one in a different file.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/939_res_1.zip [23.45 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.176.pdf
Edition	LREC 2020

Name	Fakeddit
Resource type	Corpus
Size	1063106 entries
Languages	English
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	We present Fakeddit, a novel multimodal dataset consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classi?cation categories through distant supervision.
Download from	https://github.com/entitize/fakeddit
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.755.pdf
Edition	LREC 2020

Name	FalseFriendsLexicon
Resource type	Lexicon
Size	900 KByte
Languages	French (fra), Italian (ita), Portuguese (por), Romanian (ron), Spanish (spa), English (eng)
Production status	Newly created-finished
Resource usage	Lexicon Creation/Annotation
License	CC-BY 4.0
Conditions of use	Freely Available
Description	This lexicon of false friends pairs in 6 languages, includes for each language pair automatically detected pairs of false friends along with a score of falseness.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/180_res_1.zip [628 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.367.pdf
Edition	LREC 2020

Name	FandomCorpora
Resource type	Corpus
Size	1627 Summaries
Languages	English (eng), German (deu)
Production status	Newly created-finished
Resource usage	Summarisation
License	MIT License
Conditions of use	Freely Available
Description	Three sample corpora created with the FandomCorpus Construction Framework.
Download from	https://datamanagementlab.github.io/fandomCorpus/data.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.827.pdf
Edition	LREC 2020

Name	FAPESP-PBSMT-NMT
Resource type	Corpus
Size	310 KByte
Languages	Brazilian Portuguese, English (eng)
Production status	Newly created-finished
Resource usage	Machine Learning
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	This resource is a collection of 300 parallel sentences written in Brazilian Portuguese and English annotated with machine translation errors. There are two sets of 300 sentences: one output by a Phrase-Based Statistical Machine Translation system and another one output by a Neural Machine Translation System. Details about the annotation process can be found at the paper (Caseli; Inácio, LREC 2020).
Download from	https://github.com/LALIC-UFSCar/FAPESP-PBSMT-NMT
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.446.pdf
Edition	LREC 2020

Name	FastText.cc
Resource type	Grammar/Language Model
Size	NULL NULL
Languages	Portuguese (por)
Production status	Existing-used
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.
Download from	https://fasttext.cc/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.669.pdf
Edition	LREC 2020

Name	FEVER Dev partition delexicalized
Resource type	Corpus
Size	10 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_5.txt [9.84 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
Edition	LREC 2020

Name	Fin-SoMe
Resource type	Corpus
Size	10000 sentences
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	CC BY-NC-SA 4.0
Conditions of use	From Owner
Description	Both front-desk and middle-desk experts were hired from a bank treasury. The front-desk expert, working in the treasury marketing unit, was assumed to be a risk-neutral individual (RN), and the middle-desk expert from the risk management department was to be assumed a risk-averse individual (RA). Annotators used five months to complete the labeling of 10,000 tweets. A tweet is investigated from four angles: (1) The market sentiment (bullish/bearish) of the tweet; (2) The presence or absence of reasoning in the tweet supporting the investor's analysis; (3) The writer's sentiment (positive/negative); (4) The gain/loss of the writer's trade. Because writers do not always show market sentiment, writer sentiment, or gain/loss in their tweets, annotators assigned a "None" label to tweets without the related narratives.
Download from	http://nlg.csie.ntu.edu.tw/nlpresource/FinSoMe
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.749.pdf
Edition	LREC 2020

Name	FlauBERT
Resource type	Grammar/Language Model
Size	552.6 MByte
Languages	French (fra)
Production status	Newly created-finished
Resource usage	Language Modelling
License	OpenSource
Conditions of use	Freely Available
Description	Pre-trained language model for French.
Download from	https://github.com/getalp/Flaubert
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.302.pdf
Edition	LREC 2020

Name	FloDusTA
Resource type	Corpus
Size	9000 tweets
Languages	Arabic (ara)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	<Not Specified>
Conditions of use	Freely Available
Description	Flood, Dust Storm, Traffic Accident Saudi Event dataset
Download from	https://github.com/BatoolHamawi/FloDusTA
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.174.pdf
Edition	LREC 2020

Name	fnc
Resource type	Corpus
Size	42 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	A comprehensive collection of all the urls is also presented in the paper referenced herewith.
Download from	https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
Edition	LREC 2020

Name	FNC delexicalized training partition
Resource type	Corpus
Size	95 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
Edition	LREC 2020

Name	FNC dev partition delexicalized
Resource type	Corpus
Size	55 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
Edition	LREC 2020

Name	ForwardQuestions
Resource type	Corpus
Size	38480 questions
Languages	English
Production status	Complete
Resource usage	Question Generation
License	CC-BY 3.0 Unported
Conditions of use	<Not Specified>
Description	The ForwardQuestions data set is made of human-generated questions related to knowledge triples. This data set results from the conversion and merger of the existing SimpleDBPediaQA and SimpleQuestionsWikidata data sets, including the mapping of predicates from DBPedia to Wikidata, and the selection of 'forward' questions as opposed to 'backward' ones. The new data set can be used to generate novel questions given an unseen Wikidata triple.
Download from	https://github.com/johannamelly/ForwardQuestions
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/ISA16/pdf/2020.isa-1.7.pdf
Edition	ISA16 2020

Name	GeBioCorpus_v2
Resource type	Corpus
Size	2000 sentences
Languages	Catalan (cat), English (eng), Spanish (spa)
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CC-BY-NC 4.0
Conditions of use	Freely Available
Description	Corpus of 2,000 sentences in English, Spanish and Catalan (gender-balanced), which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/325_res_1.zip [389 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.502.pdf
Edition	LREC 2020

Name	GeoWAC
Resource type	Corpus
Size	45 billion words
Languages	ara, aze, bel, bul, cat, ces, dan, deu, ell, eng, est, fas, fin, fra, gle, hbs, hin, hun, ind, isl, ita, jpn, kat, kaz, kor, lav, lit, mkd, mon, nld, nor, pol, por, ron, rus, slk, slv, spa, sqi, swe, tam, tgl, tur, ukr, urd, uzb, vie, zho
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	GNU GPL v2.0
Conditions of use	Freely Available
Description	Gigaword geo-referenced web corpora for 50 language (varieties) that are explicitly balanced for geographic and demographic representation.
Download from	https://www.earthlings.io/corpus_download.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.308.pdf
Edition	LREC 2020

Name	German-English parallel data by the Presidency of the Council of the EU held by Austria in 2006
Resource type	Corpus
Size	4973 Translation Units
Languages	English, German
Production status	In use
Resource usage	Machine Translation
License	Open Under-PSI
Conditions of use	Freely Available
Description	News/media (press releases, interviews and Common Foreign and Security Policy statements) in German and English by the EU-Presidency, Austria, 2006.German-English parallel data containing news and statements (press releases, interviews and Common Foreign and Security Policy statements) unidirectional from German to English by the Presidency of the Council of the EU held by Austria in 2006. Aligned with HunAlign, a language-independent sentence aligner (Varga et al, 2005) and manually evaluated by two evaluators. Format: .tmx. Character encoding: UTF-8.
Download from	https://www.elrc-share.eu/repository/browse/german-english-parallel-data-by-the-presidency-of-the-council-of-the-eu-held-by-austria-in-2006/e38b283eac3e11e8b7d400155d0267062180d233a0fd4e84b8dffb9b25cc1775/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf
Edition	LT4Gov 2020

Name	GeWiki
Resource type	Corpus
Size	240000 entries
Languages	German (deu)
Production status	Newly created-finished
Resource usage	Summarisation
License	OpenSource
Conditions of use	Freely Available
Description	A summarization corpus extracted from Wikipedia articles. The corpus is split into training (220'000 texts), evaluation (10'000 texts) and test (10'000 texts) sets. For each text (text_id.src) there is a corresponding reference summary (text_id.tgt).
Download from	https://github.com/domfr/GeWiki
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.821.pdf
Edition	LREC 2020

Name	GitHub Typo Corpus
Resource type	Corpus
Size	353055 entries
Languages	Chinese (zho), English (eng), Japanese (jpn), Others
Production status	Newly created-finished
Resource usage	Spelling Correction, Grammatical Error Correction
License	<Not Specified>
Conditions of use	Freely Available
Description	A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors.
Download from	https://github.com/mhagiwara/github-typo-corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.835.pdf
Edition	LREC 2020

Name	Glossary "Austrian Administration. German - English"
Resource type	Lexical Conceptual Resource General Information
Size	673 entries
Languages	English, German (AT)
Production status	In use
Resource usage	Machine Translation
License	Use and non-commercial distribution permitted if resource is unchanged and source stated
Conditions of use	Freely Available
Description	The terminological resource entitled Fachglossar Österreichische Verwaltung. Deutsch - Englisch covers terminology in the public administration domain in German and English. It contains terminology from the areas of Austrian public law, legislation and executive authorities. The terminology is standardised by an informal working group of translators employed with the Austrian federal ministries.
Download from	https://www.sprachressourcen.at/verwaltungsglossar/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf
Edition	LT4Gov 2020

Name	Glossary terms in German related to Austria and the topics of the trio presidency
Resource type	Lexical Conceptual Resource General Information
Size	71000 terms
Languages	German
Production status	<Not Specified>
Resource usage	Machine Translation
License	CC-BY-SA 3.0
Conditions of use	Freely Available
Description	German monolingual terminology and named entities. The dataset was created by University of Vienna by crawling, extracting and compiling content from Wikipedia. Terms crawled from de.wikipedia.org on the Austrian and EU-presidency related topics of the EU Council Presidency 2018: agricultural policy, digitalisation, development policy, European Union, Single Market, trade, climate policy, migration, sustainability, Austria, politics, security policy, environment, Western Balkans, economic policy.
Download from	https://www.elrc-share.eu/repository/browse/terms-in-german-related-to-austria-and-the-topics-of-the-trio-presidency/b82781c4ab9e11e8b7d400155d026706f61ef02809fb4748944b1af1b434f0a9/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf
Edition	LT4Gov 2020

Name	Gold Standard for Cognate Pairs in English-Dutch and French-Dutch
Resource type	Lexicon
Size	51220 entries
Languages	Dutch, English, French
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	From Owner
Description	This resource contains a context-independent gold standard for English-Dutch and French-Dutch cognate detection. To this end, automatic word alignment was applied on the Dutch Parallel Corpus, and all term equivalents with a Normalized Levenshtein distance smaller than 0.5 were extracted. This resulted in a list with 28,503 English-Dutch candidate cognate pairs, and 22,715 French-Dutch candidate cognate pairs, which were subsequently manually labeled according to the guidelines established in Labat et al. 2019. The following labels were annotated: (1) Cognate: words which have a similar form and meaning in all contexts, (2) Partial cognate: words which have a similar form, but only share the same meaning in some contexts, (3) False friend: words which have a similar form but a different meaning, (4) Proper name: proper nouns (e.g. persons, companies, cities, coun-tries, etc.) and their derivations, (5) Error: word alignment errors and compound nouns of which one part is a cognate but the other part is missing in one of the languages, and (6) No standard: words that do not occur in the dictionary of that particular language.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/367_res_1.zip [1.65 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.504.pdf
Edition	LREC 2020

Name	Greek POS Tagged Tweets
Resource type	Corpus
Size	1,9 MByte
Languages	Modern Greek (1453-) (ell)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	From Owner
Description	<Not Specified>
Download from	https://hilab.di.ionio.gr/index.php/en/datasets/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.476.pdf
Edition	LREC 2020

Name	Habibi Corpus
Resource type	Corpus
Size	378646 MByte
Languages	Arabic (ara), Arabic Dialects, Egyptian, Maghrebi, Iraqi, Gulf, Levantine, Sudanese
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	Freely Available
Description	Habibi is the first freely available corpus of Arabic song lyrics. The corpus comprises of more than 30,000 Arabic songs from 18 different Arab countries. Each Song in the corpus comes with a song title along with the singer's full name, country of origin and dialect. The corpus also contains information about the song's writer and composer. Each song has a singer, song title, and lyrics. The corpus is made of 30,071 Arabic songs with their lyrics split into 527,870 sentences (song verses). The corpus contains information about 1,765 singers, 3,789 song writers and 2,463 composers. The singers span across 18 Arabic countries including Egypt,Saudi, Lebanon, Iraq, Sudan, Kuwait, Syria, UAE, Morocco, Tunisia, Yemen, Jordan, Algeria, Qatar, Bahrain, Oman, Palestine and Libya. The lyrics covers 6 dialects: Egyptian, Gulf, Levantine, Iraqi, Sudanese and Maghrebi (North African). The corpus in CSV file format is available through ELRA's Share Your LR's initiative (size 10,856 MB). The complete corpus in txt and csv format as well as Habibi's word embeddings are available on http://ucrel-web.lancaster.ac.uk/habibi/.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/450_res_1.zip [11.12 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.165.pdf
Edition	LREC 2020

Name	Hallig-Wartburg Ontology
Resource type	Ontology
Size	6.8 Mbyte
Languages	<Not Specified>
Production status	Newly created-finished
Resource usage	Language Modelling
License	CC0
Conditions of use	Freely Available
Description	This ontology is based on the so-called Hallig-Warburg: Rudolf Hallig and Walther von Wartburg, Begriffssystem als Grundlage für die Lexikographie / Système raisonné des concepts pour servir de base à la lexicographie, Berlin (Akademie-Verlag), [1952] 1963; Hallig-Wartburg is a conceptual scheme used in historical linguistic resources.
Download from	https://lod.academy/site/vocabs/hallig-wartburg
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/LDL2020/pdf/2020.ldl-1.1.pdf
Edition	LDL 2020

Name	HatEval
Resource type	Corpus
Size	19600 entries
Languages	English, Spanish
Production status	Existing-used
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	Hate Speech detection in Twitter but featured by two specific different targets, immigrants and women, in a multilingual perspective, for Spanish and English.
Download from	https://competitions.codalab.org/competitions/19935
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf
Edition	LREC 2020

Name	Hindi Courteous Dataset
Resource type	Corpus
Size	52.3 MByte
Languages	Hindi
Production status	Newly created-finished
Resource usage	Natural Language Generation
License	<Not Specified>
Conditions of use	From Owner
Description	The conversations between various customers and the trained customer care agents of different companies on their Twitter handles were used for building the dataset. The Twitter data for Hindi was mined from twitter for different companies. The dataset comprises of conversations with their ids followed by the utterance id. We have two sets of responses available in the dataset i.e., courteous response as well as the corresponding generic response.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/969_res_1.zip [13.08 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.514.pdf
Edition	LREC 2020

Name	Hindi-Discourse-Modes
Resource type	Corpus
Size	1 MByte
Languages	Hindi (hin)
Production status	Newly created-finished
Resource usage	Discourse
License	MIT License
Conditions of use	Freely Available
Description	A new corpus consisting of sentences from Hindi short stories annotated for five different discourse models argumentative, narrative, descriptive, dialogic and informative.
Download from	https://github.com/midas-research/hindi-discourse
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.149.pdf
Edition	LREC 2020

Name	Hindi_Event
Resource type	Corpus
Size	13.6 MByte
Languages	Hindi
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	<Not Specified>
Conditions of use	From Owner
Description	This is a benchmark dataset in Hindi language for Event Extraction. All the annotated documents are XML documents in which event triggers, argument triggers, event type information, argument role information and event-argument linking information are tagged. The dataset contains 1709 XML documents.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1292_res_1.zip [3.29 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.273.pdf
Edition	LREC 2020

Name	HSE Corpus for Discovering Safety in Construction
Resource type	Corpus
Size	600 sentences
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	Open Government Licence 3.0 / CC-BY-SA 4.0 International Licence
Conditions of use	Freely Available
Description	The "HSE Corpus for Discovering Safety in Construction" is a semantically annotated corpus for safety in construction, consisting of 600 sentences from RIDDOR workplace accident reports. The corpus is annotated with six types of named entities relevant to the construction safety domain. The annotation of the corpus was split between four different annotators; 400 of the sentences have been annotated by at least two of the annotators. The average inter-annotator agreement rate is 0.79 F-Score.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/426_res_1.zip [1.05 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.245.pdf
Edition	LREC 2020

Name	IGC-Parl
Resource type	Corpus
Size	218889307 tokens
Languages	Icelandic
Production status	1st version published
Resource usage	<Not Specified>
License	CC-BY 4.0
Conditions of use	<Not Specified>
Description	The 2019 version of the corpus includes speeches from 1911-2019. It comprises 404 thousand speeches and just under 219 million words. The corpus has been automatically part-of-speech tagged and lemmatised. It is annotated with extensive metadata about the speeches, speakers and political parties, including speech topic, whether the speaker is in the government coalition or opposition, age and gender of speaker at the time of delivery and references to sound and video recordings where available. The corpus is distributed in TEI-conformant format.
Download from	http://hdl.handle.net/20.500.12537/14
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/ParlaCLARIN2/pdf/2020.parlaclarin-1.3.pdf
Edition	ParlaCLARIN2 2020

Name	IKAT Dataset (German)
Resource type	Corpus
Size	NULL NULL
Languages	German (deu)
Production status	Existing-used
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	Freely Available, Attribution Required
Description	This dataset contains annotations about implied information in argumentative texts. The annotations are performed on sentence pairs from the Microtext corpus (the original German version, Peldszus/Stede 2015), that stand in an argumentative relation according to the argumentation graph. There are 464 such sentence pairs in the 112 texts in the corpus, i.e., approx. 4 pairs per microtext.
Download from	https://www.cl.uni-heidelberg.de/english/research/downloads/resource_pages/IKAT/IKAT_data.shtml
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.282.pdf
Edition	LREC 2020

Name	Imperative Low-level Complete Image Edit Requests
Resource type	Corpus
Size	2540 sentences
Languages	English
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	From Owner
Description	2540 image edit requests with edit type adjust. Each request is annotated with (1) referring expression (2) attribute (3) value. (1) referring expression refers to an object in an image~(not accompanied). (2) attribute belongs to one of the five categories (a) brightness, (b) contrast (c) hue (d) saturation (e) lightness. (3) value is an integer between -100 and 100. Dataset is randomly split into train/dev/test of 2056/243/241 respectively. There is also an "abstract" version where (2) attribute and (3) value are replaced with tokens <abstract> and <value>.
Download from	https://github.com/tzuhsial/ImageEditingWithDialogue/tree/master/imadial-nlu/data/ILLC-IER
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.51.pdf
Edition	LREC 2020

Name	imSituVQA
Resource type	Corpus
Size	52 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Question Answering
License	<Not Specified>
Conditions of use	Freely Available
Description	imSituVQA is a corpus for Visual Question Answering Annotated with Semantic Frame Information, based on the imSitu dataset. So images can be downloaded from http://imsitu.org/download/ Question answer pairs were created from image annotations of imSitu dataset. Each question answer is annotated with frame semantic information.
Download from	https://github.com/givenbysun/imSituVQA/blob/master/data/imSituVQA.json
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.678.pdf
Edition	LREC 2020

Name	Indian-Language-Dataset
Resource type	Corpus
Size	461 MByte
Languages	Bengali, Malayalam, Tamil, Telugu, Urdu (urd)
Production status	Existing-updated
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	Freely Available
Description	We obtained the data from different resources such as EnTamV2.0 (Ramasamy et al., 2012), Opus (Tiedemann, 2012) and UMC005 (Jawaid and Zeman, 2011). The sentences are of domain news, cinema, bible and movie subtitles. We combined and preprocessed the data of Tamil, Malayalam, Telugu, Bengali, and Urdu. After preprocessing (as described below) and cleaning, we split the dataset into train, test, and validation accordingly.
Download from	https://github.com/himanshudce/Indian-Language-Dataset
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.444.pdf
Edition	LREC 2020

Name	INEL Dolgan Corpus 1.0
Resource type	Corpus
Size	NULL NULL
Languages	Dolgan, English, Russian
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	CC-BY-NC-SA 4.0
Conditions of use	Freely Available
Description	The INEL Dolgan corpus has been created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages"), 2016-2033. The corpus makes possible typologically aware corpus-based grammatical research on the Dolgan language and expands the documentation of the lesser described indigenous languages of Northern Eurasia. Dolgan is an endangered Siberian (North-Eastern) Turkic language, which is spoken on the Taymyr peninsula and in adjacent areas in Northern Siberia. The number of speakers is approx. 1,000. Whereas in many settlements, including Dudinka, Dolgan is rarely spoken in everyday's life, there are three remote villages (Syndassko, Popigay and Novorybnoe) where Dolgan is until now very present and frequently spoken. There is no final agreement about a possible dialectal division of Dolgan. Mostly, two or three varieties of Dolgan are mentioned: Upper (Western) Dolgan, Lower (Eastern) Dolgan and Anabar Dolgan. Upper Dolgan is spoken in the settlements of Volochanka, Ust`-Avam, Katyryk, Xeta, Novaya, Kresty. Lower Dolgan is spoken in the villages Zhdanixa, Novorybnoe, Syndassko and Popigay and seems to be the more vital variety of Dolgan. The settlement of Xatanga is said to form the border between the two dialects. Anabar Dolgan is spoken in the Anabar ulus of the Sakha republic (Yakutia) and possibly best regarded a transitory variety of Dolgan and Yakut. Formerly, the territory around Dudinka and Norilsk belonged to the Upper Dolgan area. Nowadays, however, the traditional living places of Dolgans do not exist anymore; the Dolgans living in Dudinka and Norilsk come originally from the whole Taymyr peninsula, and thus from different dialectal regions. The INEL Dolgan corpus contains both Upper and Lower Dolgan material. For several reasons Anabar Dolgan is not included. The INEL Dolgan corpus is composed of texts from different sources: 1. Published folklore texts from an edited volume ("Fol'klor Dolgan", P.E. Efremov 2000), 2. Transcripts of recordings obtained from the Taymyr House of Folk Art (TDNT) in Dudinka (1970s-2000s), 3. Transcripts from the collection of Dr. Eugénie Stapert recorded on several fieldwork trips in 2007-2010, 4. Transcripts of recordings made on a fieldwork trip in 2017. The first group as well as parts of the third group were already transcribed and translated, the rest of the recordings was transcribed and translated within the INEL project. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, as well as annotation of Russian borrowings. Some texts also have annotations for syntactic functions, semantic roles and information structure/information status. The corpus has been produced in the context of the joint research funding of the German Federal Government and Federal States in the Academies' Programme, with funding from the Federal Ministry of Education and Research and the Free and Hanseatic City of Hamburg. The Academies' Programme is coordinated by the Union of the German Academies of Sciences and Humanities. The corpus could not have been produced without the support of the Taymyr House of Folk Art (TDNT). On the one hand, the TDNT provided valuable audio material (see above). On the other hand, the TDNT helped in all kinds of organizational tasks. Moreover, the sound materials from the TDNT were transcribed and translated by the following native speakers of Dolgan: Nina Semyonovna Kudryakova (TDNT Head of Department of folklore and ethnography), who also worked as editor for transcriptions and translations by other consultants, Svetlana Semyonovna Kudryakova, Egor Kudryakov, Adeya Evdokimovna Eske, Aleksandra Tuprina, Illarion Tuprin In summer 2017 (06.07.2017 - 20.08.2017), Chris Lasse Däbritz and Dr. Eugénie Stapert had the opportunity to conduct fieldwork with Dolgans in the town of Dudinka. Thanks to the organizational work of the TDNT, it was possible to work many hours each day with very motivated consultants. The fieldwork had two major goals: Firstly, most uncertainties in the analysis of the folklore texts from (Efremov 2000) could be resolved. Secondly, a lot of material obtained from the TDNT could be transcribed, translated and analyzed. Moreover, roughly half an hour of new material was recorded, namely an experiment on Social Cognition. The following local consultants were involved in the fieldwork: Nina Semyonovna Kudryakova, Anna Alekseevna Barbolina, Vera Polikarpovna Bettu, Galina Sidorovna Chuprina, Adeya Evdokimovna Eske, Stepanida Il`inichna Kudryakova, Yuliya Kupchik, Polina Prokop`evna Uodaj The web-based search interface is using the Tsakonian Corpus platform developed by Dr. Timofey Arkhangelskiy, Humboldt Research Fellow at IFUU, Hamburg University.
Download from	http://hdl.handle.net/11022/0000-0007-CAE7-1
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.322.pdf
Edition	LREC 2020

Name	INEL Dolgan Corpus 1.0
Resource type	Corpus
Size	72912 words
Languages	Dolgan (dlg), English (eng), German (deu), Russian (rus)
Production status	Existing-used
Resource usage	Knowledge Discovery/Representation
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	Dolgan is an endangered Siberian (North-Eastern) Turkic language, which is spoken on the Taymyr peninsula and in adjacent areas in Northern Siberia. The number of speakers is approx. 1,000. Whereas in many settlements, including Dudinka, Dolgan is rarely spoken in everyday's life, there are three remote villages (Syndassko, Popigay and Novorybnoe) where Dolgan is until now very present and frequently spoken. The INEL Dolgan corpus has been created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages"), 2016-2033. The corpus makes possible typologically aware corpus-based grammatical research on the Dolgan language and expands the documentation of the lesser described indigenous languages of Northern Eurasia.
Download from	http://hdl.handle.net/11022/0000-0007-D832-D
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf
Edition	LREC 2020

Name	INEL Kamas Corpus 0.1
Resource type	Corpus
Size	48293 words
Languages	English (eng), German (deu), Kamas (xas), Russian (rus)
Production status	Existing-used
Resource usage	Knowledge Discovery/Representation
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	Kamas is an extinct Samoyedic language (Uralic family). The INEL Kamas corpus comprises folklore texts collected by Kai Donner in 1912-1914, before the language shift, and transcribed audio recordings of the last speaker, Klavdiya Plotnikova made between 1964 and 1970. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, annotation of borrowings. Some texts also have annotations for syntactic structure, semantic roles and information status.
Download from	http://hdl.handle.net/11022/0000-0007-CF46-2
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf
Edition	LREC 2020

Name	INEL Selkup Corpus 0.1
Resource type	Corpus
Size	18673 words
Languages	English (eng), German (deu), Russian (rus), Selkup (sel)
Production status	Existing-used
Resource usage	Knowledge Discovery/Representation
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	Selkup is an endangered Southern Samoyedic language (Uralic family). The INEL Selkup corpus is composed of texts from the archive of Angelina Ivanovna Kuzmina (1924-2002), who gathered a large amount of material on Selkup in almost all regions where the Selkup people lived in 1962-1977. Most texts in the corpus originate from the handwritten part of the archive, the others come from sound recordings made by A.I. Kuzmina, transcribed and translated within the INEL project. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, annotation of borrowings. Some texts also have annotations for syntactic structure, semantic roles and information status.
Download from	http://hdl.handle.net/11022/0000-0007-CC85-D
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf
Edition	LREC 2020

Name	JParaCrawl
Resource type	Corpus
Size	8763995 sentences
Languages	English (eng), Japanese (jpn)
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	Proprietary License (Research purposes - No commercial Use)
Conditions of use	Freely Available
Description	JParaCrawl is a large freely available English-Japanese parallel corpus created by NTT.
Download from	http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.443.pdf
Edition	LREC 2020

Name	Kannada Event Annotated Treebank
Resource type	Corpus
Size	37020 lexemes
Languages	Kannada
Production status	<Not Specified>
Resource usage	For Kannada NLP
License	<Not Specified>
Conditions of use	Freely Available
Description	This is an event annotated Kannada Dependency Treebank. The dataset has been annotated based on the guidelines mentioned in the paper.
Download from	https://drive.google.com/drive/folders/11ZXpP4mQcDcM91SKHiSNEtWi_mAkXku7
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/ISA16/pdf/2020.isa-1.10.pdf
Edition	ISA16 2020

Name	KORE 50^DYWC
Resource type	Evaluation Data
Size	50 sentences
Languages	English
Production status	Existing-used
Resource usage	Evaluation/Validation
License	OpenSource
Conditions of use	Freely Available
Description	Entity linking evaluation data set with annotations based on DBpedia, YAGO, Wikidata, and Crunchbase.
Download from	http://people.aifb.kit.edu/mfa/datasets/kore50-lrec2020.zip
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.291.pdf
Edition	LREC 2020

Name	Korean FrameNet 1.1
Resource type	Corpus
Size	40,7 MByte
Languages	Korean (kor)
Production status	Newly created-finished
Resource usage	Semantic Role Labeling
License	CC BY-NC-SA 2.0
Conditions of use	Freely Available
Description	Korean FrameNet is a lexical database that has rich annotations to represent the meaning of Korean text using semantic frames.
Download from	https://github.com/machinereading/koreanframenet
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.30.pdf
Edition	LREC 2020

Name	Korean Movie Review Emotion Dataset
Resource type	Corpus
Size	11216 KByte
Languages	Korean (kor)
Production status	Newly created-finished
Resource usage	Emotion Recognition/Generation
License	OpenSource
Conditions of use	Freely Available
Description	The Korean Movie Review Emotion (KMRE) dataset was annotated with six types of emotions in the NSMC dataset by following an in-house annotation procedure, which achieves the best performance. In here, the emotion neutral was excluded because there are no neutral reviews in the NSMC. After the construction, the training dataset was splitted into 80% for training and 20% for development separately. Specific emotions (i.e. anger, happiness, and sadness) were found as more frequently annotated in the KMRE dataset. Furthermore, the train, dev, and test of KMRE dataset tend to have similar emotion label distribution. The KMRE dataset has 119,995 sentences of train, 29,999 sentences of development, and 49,997 sentences of test.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/891_res_1.zip [11.48 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.199.pdf
Edition	LREC 2020

Name	kosentences
Resource type	Corpus
Size	31427658 sentences
Languages	Korean
Production status	Newly created-finished
Resource usage	Language Modelling
License	MIT Licence, GNU Free Documentation License, CC-BY-NC-SA 4.0
Conditions of use	Freely Available
Description	kosentences is a large-scale web corpus of Korean text. It can be used for any task which requires a large amount of unsupervised and (relatively) well-formed text, such as language modeling.
Download from	https://github.com/cynthia/kosentences/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.429.pdf
Edition	LREC 2020

Name	LEDGAR
Resource type	Corpus
Size	104990418 tokens
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	MIT License
Conditions of use	Freely Available
Description	We present LEDGAR, a multilabel corpus of legal provisions in contracts. The corpus was crawled and scraped from the public domain (SEC filings) and is, to the best of our knowledge, the first freely available corpus of its kind. Since the corpus was constructed semi-automatically, we apply and discuss various approaches to noise removal. Due to the rather large labelset of over 12,000 labels annotated in almost 100,000 provisions in over 60,000 contracts, we believe the corpus to be of interest for research in the field of Legal NLP, (large-scale or extreme) text classification, as well as for legal studies.
Download from	https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.155.pdf
Edition	LREC 2020

Name	LibriVoxDeEN
Resource type	Corpus
Size	62 GByte
Languages	English, German
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	LibriVoxDeEn is corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that speech alignment quality is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for German speech recognition and for end-to-end German-to-English speech translation.
Download from	https://www.cl.uni-heidelberg.de/statnlpgroup/librivoxdeen/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.441.pdf
Edition	LREC 2020

Name	List of English-French Cognates
Resource type	Lexicon
Size	492 entries
Languages	English (eng), French (fra)
Production status	Newly created-finished
Resource usage	Word Sense Disambiguation
License	CC-BY 4.0
Conditions of use	Freely Available
Description	This is a new database for computational approaches of synchronized diachronic investigation of language pairs, and subsequent novel findings stemming from the cognates temporal comparison of the two chosen languages. The present study is the first in the literature to use computational approaches and large data to make a cross-language temporal analysis. Taken and computed from google n-grams, a cognate database was constructed with 492 English-French word pairs composed of nouns, adjectives and adverbs. Each pair has between one and four forms in English, and up to ten in French. In English, most words have only one form for adjectives and adverbs, while most nouns have two forms (singular and plural). In French, with masculine and feminine, singular and plural, most nouns and adjectives can be found in four different surface forms.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/584_res_1.txt [76 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.107.pdf
Edition	LREC 2020

Name	LK Hadith Corpus
Resource type	Language Resources/Technologies Infrastructure
Size	10266304 tokens
Languages	Arabic, English
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	OpenSource
Conditions of use	Freely Available, Attribution Required
Description	Well-structured Arabic-English Parallel Corpus of Islamic Hadith with its components (Isnad and Matn) segmented.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/980_res_1.zip [14.54 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.415.pdf
Edition	LREC 2020

Name	Machine Reading Comprehension Dataset
Resource type	Corpus
Size	NULL NULL
Languages	English
Production status	Newly created-finished
Resource usage	Question Answering
License	<Not Specified>
Conditions of use	From Owner
Description	The dataset is for building Machine Reading Comprehension model on Scholarly Articles.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1035_res_1.zip [408 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.675.pdf
Edition	LREC 2020

Name	MAGPIE Corpus
Resource type	Corpus
Size	56622 entries
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Machine Learning
License	CC-BY 4.0
Conditions of use	Freely Available
Description	The MAGPIE Corpus is a large sense-annotated corpus of potentially idiomatic expressions (PIEs), based on the British National Corpus (BNC). Potentially idiomatic expressions are like idiomatic expressions, but the term also covers literal uses of idiomatic expressions, such as 'I leave work at the end of the day.' for the idiom 'at the end of the day'. The corpus contains 56,622 instances, covering 1,756 different idiom types, all of which have crowdsourced meaning labels.
Download from	https://github.com/hslh/magpie-corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.35.pdf
Edition	LREC 2020

Name	Mandarinograd
Resource type	Corpus
Size	154 entries
Languages	Mandarin Chinese (cmn)
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	Mandarinograd is a collection of 154 Winograd Schemas in Mandarin Chinese, mainly translated and adapted from English. Winograd Schemas represent particularly hard problems in natural language understanding, related to anaphora resolution. They are designed to involve common sense reasoning and to limit the biases and artefacts commonly found in natural language understanding datasets.
Download from	https://gitlab.com/vanTot/mandarinograd/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.3.pdf
Edition	LREC 2020

Name	Mapudungun Corpus
Resource type	Corpus
Size	142 hours
Languages	Mapudungun (arn), Spanish (spa)
Production status	Existing-updated
Resource usage	Speech Synthesis, Speech Recognition and Machine Translation
License	CC-BY-NC-SA 3.0
Conditions of use	Freely Available
Description	142 hours of conversations in Mapudungun, fully transcribed and translated in Spanish. Training, development, and test splits are provided for speech synthesis, speech recognition, and machine translation.
Download from	https://github.com/mingjund/mapudungun-corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.350.pdf
Edition	LREC 2020

Name	MASRI-HEADSET Corpus
Resource type	Corpus
Size	8 hours
Languages	Maltese (mlt)
Production status	Newly created-finished
Resource usage	Speech Recognition/Understanding
License	<Not Specified>
Conditions of use	From Owner. The corpus is free for Research/Academic purposes. This corpus belongs to the University of Malta. All rights reserved.
Description	The MASRI-HEADSET CORPUS (MHC) is suitable for speech recognition and it has the following characteristics: - The MHC has an exact duration of 8 hours and 6 minutes. It has 3864 audio \n files. - The MHC has recordings from 25 different speakers: 13 women and 12 men. - Data in MHC is classified by speaker. It means, all the recordings of one single speaker are stored in one single directory. - Data is also classified according to the gender (male/female) of the \n speakers. - Audio files in the MHC are distributed in a 16khz@16bit mono format.\n\n- Every audio file has an ID that is compatible with ASR engines such as \n Kaldi and CMU-Sphinx. - Transcriptions in MHC are lowercase. No punctuation marks are permitted except dashes (-) and apostrophes (') because they belong to the Maltese \n orthography.
Download from	https://www.um.edu.mt/projects/masri/index.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.784.pdf
Edition	LREC 2020

Name	MEL dataset
Resource type	Corpus
Size	51 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	CC BY-NC-SA 3.0
Conditions of use	Freely Available
Description	This is an evaluation corpus for multimedia entity linking as described in (Adjali et al., 2020). It is made of tweet identifiers separated into 4 files: - kb: knowledge base containing 2657213 ids - mel_dev_ids: development dataset for evaluation (16599 ids) - mel_test_ids: test dataset for evaluation (36521 ids) - mel_train_ids: train dataset for evaluation (35976 ids) More information at https://github.com/OA256864/MEL_Tweets
Download from	https://drive.google.com/open?id=1kkRpVJpo-U6Gt_r4Ly-ciq4pAY03CoTg
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.528.pdf
Edition	LREC 2020

Name	Minho Word Pool
Resource type	Lexicon
Size	700 KByte
Languages	Portuguese
Production status	Existing-used
Resource usage	Acquisition
License	<Not Specified>
Conditions of use	Freely Available
Description	Minho Word Pool (MWP) is a dataset that provides normative values of imageability, concreteness and subjective frequency for 3,800 (European) Portuguese words, three subjective measures, which in spite of being extensively used in research, were still scarce for Portuguese. Data were collected with 2,357 college students who were native speakers of European Portuguese. Participants rated 100 words drawn randomly from the full set in each of the three subjective indices using a web survey procedure (via a URL link). Analyses comparing the MWP ratings with those obtained for the same words from other national and international databases showed that the MWP norms are reliable and valid, thus providing researchers with a useful tool to support research in all neuroscientific areas using verbal stimuli.
Download from	http://p-pal.di.uminho.pt/about/databases
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf
Edition	LREC 2020

Name	MNLI DEV partition matched delexicalized
Resource type	Corpus
Size	12 NULL
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_4.txt [12.92 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
Edition	LREC 2020

Name	MNLI DEV partition mis matched delexicalized
Resource type	Corpus
Size	2.2 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_7.txt [2.33 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
Edition	LREC 2020

Name	MNLI training delexicalized
Resource type	Corpus
Size	492 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	This is the masked version of the datasets FEVER,FNC,MEDNLI, MNLI, , delexicalized with the OA-NER technique presented in the paper referenced herewith.
Download from	https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf
Edition	LREC 2020

Name	Monsanto Trial Document
Resource type	Treebank
Size	15000 sentences
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Sensitive Information Detection
License	CC-BY 4.0
Conditions of use	Freely Available
Description	Constituent parse trees over labeled sentences. Corpus is publicized documents from the Monsanto trial.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/219_res_1.zip [4.54 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.158.pdf
Edition	LREC 2020

Name	Movie MPAA Information Corpus
Resource type	Corpus
Size	5562 entries
Languages	English (eng)
Production status	Existing-updated
Resource usage	Machine Learning
License	<Not Specified>
Conditions of use	From Owner
Description	The data set of this work is an expansion of the existing movie script dataset (http://ritual.uh.edu/1493-2/) to include the MPAA rating for the movies. Other important features of our dataset are: (1) 50 more movies compared to the original dataset (2) up to 12 most similar movies for each movie in the dataset along with the MPAA rating of those similar movies (3) users' vote for MPAA components (violence, profanity, sex, drug/alcohol, frightening) (4) reason of the MPAA rating published by the MPAA organization (5) poster images for movies. PS: The whole dataset is a 5G file. So, we uploaded the meta-information (similar movies, metadata file, reasons, MPAA component ) directly to this page, and we provided the whole dataset (meta-information + posters + script) through a link.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/548_res_1.zip [1.27 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.166.pdf
Edition	LREC 2020

Name	Multi sentenced Algebraic Mathematical word problems in English
Resource type	Corpus
Size	NULL sentences
Languages	English
Production status	Newly created-finished
Resource usage	Machine Learning
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_2.txt [234 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf
Edition	LREC 2020

Name	Multi sentenced simple Mathematical word problems in English
Resource type	Corpus
Size	NULL sentences
Languages	English
Production status	Newly created-finished
Resource usage	Machine Learning
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_1.txt [140 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf
Edition	LREC 2020

Name	Multi sentenced simple Mathematical word problems in Sinhala
Resource type	Corpus
Size	NULL sentences
Languages	Sinhala
Production status	Newly created-finished
Resource usage	Machine Learning
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_3.txt [250 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf
Edition	LREC 2020

Name	Multi sentenced simple Mathematical word problems in Tamil
Resource type	Corpus
Size	NULL sentences
Languages	Tamil
Production status	Newly created-finished
Resource usage	Machine Learning
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_4.txt [234 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf
Edition	LREC 2020

Name	Multi-Party Dialogue Dataset
Resource type	Corpus
Size	6.7 MByte
Languages	Chinese
Production status	Newly created-finished
Resource usage	Dialogue
License	<Not Specified>
Conditions of use	Freely Available
Description	The Multi-Party Dialogue Dataset (MPDD) consists of two files, dialogue.json and metadata.json. The file dialogue.json contains the dialogues. Each dialogue has a unique case index value in the json file, and is a list composed of the utterances in speaking order. Every utterance in the list contains the speaker, content, and annotated labels shown in data format. The list of the listener in the utterance contains all listeners in this utterance with their relation type. The data format of dialogue.json is shown as follows. {case index:[{"speaker": speaker's name,"utterance": utterance, "listener": [{"name": listener's name,"relation": relation type between speaker and listener}, ...],"emotion": speaker's emotion type}, ...]} The metadata is given in metadata.json. The file defines all the emotion, relation types, and the sub-classes in the two perspectives, position, and field. The data format of metadata.json is shown as follows. {"relation":["parent", ...],"field":{"family":["parent", ...],"school":["teacher", ...],"company":["boss", ...],"others":["couple", ...]},"position":{"superior":["parent",...],"peer":["spouse", ...],"inferior":["child",...]},"emotion":["fear",...]}
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/641_res_1.zip [953 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.76.pdf
Edition	LREC 2020

Name	Multi-Sense Dataset (MSD-1030)
Resource type	Evaluation Data
Size	1030 word pairs
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	MSD-1030 is an English semantic similarity dataset designed to facilitate more reliable evaluations of sense embeddings. Unlike previous word similarity datasets, most words in MSD-1030 are multi-sense words, which requires sense embedding models to capture multiple meanings instead of only the most dominant sense. In addition, because of the carefully designed word pair selection process, the distribution of the similarity scores are balanced over the entire range. MSD-1030 also does not contain phrases, which are usually skipped in evaluations of word and sense embedding models. MSD-1030 is more suitable for evaluating multi-sense representations and may shed light on room for improvement in related research.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/567_res_1.zip [19 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.711.pdf
Edition	LREC 2020

Name	multiCEFRLex
Resource type	Lexicon
Size	41425 entries
Languages	English (eng), French (fra), Swedish (swe)
Production status	Newly created-finished
Resource usage	Language Learning/Grading
License	<Not Specified>
Conditions of use	Freely Available
Description	Lexical resources for foreign language (L2) learning, teaching and research graded following the Common European Framework of Reference for Languages (CEFR).
Download from	http://pub.cl.uzh.ch/purl/multiCEFRLex
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.43.pdf
Edition	LREC 2020

Name	Multilingual Hate Speech Dataset for Fairness Evaluation
Resource type	Corpus
Size	7.4 MByte
Languages	English (eng), Italian (ita), Polish (pol), Portuguese (por), Spanish (spa)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	Apache 2.0
Conditions of use	Freely Available
Description	First corpus for fairness evaluation of document classifiers in the field of hate speech detection.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1174_res_1.zip [7.39 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.180.pdf
Edition	LREC 2020

Name	Multilingual Task-Oriented Dialog Data-Basque
Resource type	Corpus
Size	6404 sentences
Languages	Basque (eus)
Production status	Newly created-finished
Resource usage	Dialogue
License	CC-BY-NC-SA 4.0
Conditions of use	Freely Available
Description	The resource contains manually and automatically generated and annotated utterances for Basque from Multilingual Task-Oriented Dialog Dataset. 34,18 utterances were manually translated and annotated, and 2,986 automatically translated by a neuralMT system from Spanish ones and annotated by projecting the slot tags using a word aligner.
Download from	https://hizkuntzateknologiak.elhuyar.eus/assets/files/fmtodelh.tgz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.340.pdf
Edition	LREC 2020

Name	MultiMWE corpora
Resource type	Corpus
Size	3302268 entries
Languages	Chinese, English, German
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	Freely Available. Free for Research Only
Description	The size of the bilingual MWE pairs is 3,159,226 and 143,042 for German-English and Chinese-English respectively after filtering.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/78_res_1.zip [14.95 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.363.pdf
Edition	LREC 2020

Name	Multiword Expressions dataset with type and complexity annotation
Resource type	Evaluation Data
Size	1.3 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	CC-BY-NC-SA 4.0
Conditions of use	Freely Available
Description	We release a re-annotated Complex Word Identification Shared Task 2018 dataset originally provided by Yimam et al. (2017). The multiword expressions (MWE) in this data is annotated with MWE type, which helps drawing conclusions about the complexity of each MWE type for native and non-native readers. This work demonstrates that incorporating the type of MWE based on this dataset into a lexical complexity assessment system improves its performance. Together with the original complexity labels, this dataset represents a valuable resource for the text simplification community.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1446_res_1.tsv [1.30 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.545.pdf
Edition	LREC 2020

Name	MyFixit Dataset
Resource type	Corpus
Size	313 MByte
Languages	English
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	CC-BY-NC-SA 3.0
Conditions of use	Freely Available
Description	MyFixit is a collection of repair manuals, collected from the iFixit website. There are in total 31,601 repair manuals in 15 device categories. Each step in the manuals of the 'Mac Laptop' category is annotated with the required tool, disassembled parts, and the removal verbs (1,497 manuals with 36,659 steps).
Download from	https://github.com/rub-ksv/MyFixit-Dataset
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.260.pdf
Edition	LREC 2020

Name	Norm It
Resource type	Corpus
Size	12822 words
Languages	Italian
Production status	Existing-updated
Resource usage	Lexical Normalization
License	<Not Specified>
Conditions of use	Freely Available
Description	<Not Specified>
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1079_res_1.tgz [69 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.769.pdf
Edition	LREC 2020

Name	North Sámi active learning morphological segmentation annotations
Resource type	Evaluation Data
Size	796 types
Languages	Northern Sami (sme)
Production status	Existing-used
Resource usage	Morphological Analysis
License	CC-BY 4.0
Conditions of use	Freely Available
Description	Morphological segmentation data set, for unsupervised and weakly supervised learning.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/872_res_4.tgz [45 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.486.pdf
Edition	LREC 2020

Name	NTCIR14-QALab-PoliInfo-FormalRunDataset
Resource type	Evaluation Data
Size	45 MByte
Languages	Japanese (jpn)
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	<Not Specified>
Conditions of use	Freely Available
Description	NTCIR14-QALab-PoliInfo-FormalRunDataset.
Download from	https://github.com/kmr-y/NTCIR14-QALab-PoliInfo-FormalRunDataset
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.253.pdf
Edition	LREC 2020

Name	NZZ Black Letter Ground Truth
Resource type	Corpus
Size	304286 words
Languages	German
Production status	Newly created-finished
Resource usage	Machine Learning
License	CC-BY-NC 4.0
Conditions of use	Freely Available
Description	The Neue Zürcher Zeitung (NZZ) has been publishing in black letter from its very first issue in 1780 until 1947. From this time period, one frontpage per year was randomly sampled, resulting in a total of 167 pages. Frontpages were chosen because they typically contain highly relevant material and to make sure not to sample pages containing exclusively advertisements or stock information. During certain periods, the NZZ was published several times a day, and there were supplements, too. Due to incomplete metadata, the sampling included frontpages from supplements.
Download from	https://github.com/impresso/NZZ-black-letter-ground-truth
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.436.pdf
Edition	LREC 2020

Name	Odia Sentiment Dataset
Resource type	Corpus
Size	582 KByte
Languages	Odia
Production status	Newly created-finished
Resource usage	Machine Learning
License	<Not Specified>
Conditions of use	From Owner
Description	This dataset consists of 2045 sentences in Odia annotated with sentiments labelled as +1(positive), 0(neutral), -1(negative). This sentences were picked from news domain, with source being "Samaja" Archives.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/789_res_1.txt [582 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.339.pdf
Edition	LREC 2020

Name	OLID
Resource type	Corpus
Size	14100 entries
Languages	English
Production status	Existing-used
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	OLID contains a collection of annotated tweets using an annotation model that encompasses the following three levels: A: Offensive Language Detection B: Categorization of Offensive Language C: Offensive Language Target Identification OLID was used in the OffensEval: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 - Task 6) shared task.
Download from	https://competitions.codalab.org/competitions/20011
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf
Edition	LREC 2020

Name	OpenWordNet.PT
Resource type	Lexicon
Size	39000 synsets
Languages	Portuguese
Production status	Existing-used
Resource usage	Acquisition
License	CC-BY 4.0
Conditions of use	Freely Available
Description	The OpenWordnet-PT (abbreviated as OpenWN-PT or simply OWN-PT) is an open access wordnet for Portuguese.
Download from	http://openwordnet-pt.org/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf
Edition	LREC 2020

Name	Optimized Twitter Search Phrases
Resource type	Corpus
Size	20000 phrases
Languages	<Not Specified>
Production status	Finished
Resource usage	<Not Specified>
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This resource contains optimized search phrases for the 50 most prevalent languages on Twitter. Each list consists of key phrases that identify the most target language tweets while limiting the number of other-language tweets retrieved. Use a list in combination with the Twitter Streaming API to obtain the language-specific stream of tweets. When collecting Twitter data in this way, please refer to the accompanying paper to know the limits of your collection.
Download from	https://www.clips.uantwerpen.be/twitter/phrases.zip
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/WAC-II/pdf/2020.wac-1.8.pdf
Edition	WACXII 2020

Name	PAN-SemEval-Hyperpartisan-News-Detection-19
Resource type	Evaluation Data
Size	1000000 documents
Languages	English
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	CC-BY 4.0
Conditions of use	From Owner
Description	<Not Specified>
Download from	https://pan.webis.de/semeval19/semeval19-web/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.159.pdf
Edition	LREC 2020

Name	Parallel Sentiment
Resource type	Corpus
Size	300 KByte
Languages	ar,cs,de,en,es,fi,fr,hi,id,it,ja,ko,pl,pt,ru,sv,th,tr,zh
Production status	Newly created-finished
Resource usage	Emotion Recognition/Generation
License	<Not Specified>
Conditions of use	Freely Available
Description	Sentence-level sentiment annotations (positive or negative) for 19 languages semi-automatically generated from PUD (parallel corpora in Universal Dependencies).
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/288_res_1.zip [13 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.500.pdf
Edition	LREC 2020

Name	ParaPat
Resource type	Corpus
Size	68000000 sentences
Languages	Chinese (zho), English (eng), French (fra), Japanese (jpn), Russian (rus), Korean, German,Spanish
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	GNU-GPL v3.0
Conditions of use	Freely Available
Description	ParaPat is a corpus of parallel sentences from patents extracted from Google Patents.
Download from	https://github.com/soares-f/parapat
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.465.pdf
Edition	LREC 2020

Name	PO-EMO
Resource type	Corpus
Size	4191 lines
Languages	English, German
Production status	Newly created-finished
Resource usage	Emotion Recognition/Generation
License	<Not Specified>
Conditions of use	Freely Available
Description	PO-EMO is a corpus of German and English Poetry, with fine-grained multi-label annotation of aesthetic emotions.
Download from	https://github.com/tnhaider/poetry-emotion
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.205.pdf
Edition	LREC 2020

Name	Prague Czech English Dependency Treebank 2.0
Resource type	Treebank
Size	50000 sentences
Languages	Czech, English
Production status	Existing-updated
Resource usage	<Not Specified>
License	CC BY-NC-SA 3.0
Conditions of use	Freely Available
Description	The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.
Download from	http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf
Edition	LREC 2020

Name	Prague Dependency Treebank - Consolidated 1.0
Resource type	Treebank
Size	4000000 tokens
Languages	Czech
Production status	Newly created-finished
Resource usage	<Not Specified>
License	<Not Specified>
Conditions of use	Restricted Access
Description	The Prague Dependency Treebank - Consolidated (PDT-C) is a consolidated release of the existing PDT-corpora of Czech data with manual annotation at all three PDT-annotation layers (morphological, surface syntax and deep syntax layer).
Download from	http://hdl.handle.net/11234/1-3185
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf
Edition	LREC 2020

Name	Prague Dependency Treebank 3.5
Resource type	Treebank
Size	50000 sentences
Languages	Czech
Production status	Existing-updated
Resource usage	<Not Specified>
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (over 800 thousand nodes) on all layers, from tectogrammatical to words, and additional sentences on the analytical (surface dependency syntax) and morphological layers of annotation (approx. 2 million words in total). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations.
Download from	http://hdl.handle.net/11234/1-2621
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf
Edition	LREC 2020

Name	PT Lexical-Semantic Relations
Resource type	Lexicon
Size	NULL NULL
Languages	Portuguese
Production status	Existing-used
Resource usage	Acquisition
License	<Not Specified>
Conditions of use	Freely Available
Description	Lexical-Semantic relations acquired from ten computational lexical resources in Portuguese (PAPEL, Dicionário Aberto, Wiktionary.PT, TeP, OpenThesaurus.PT, OpenWordnet-PT, PULO, WordNet.Br, Port4Nooj, ConceptNet) together with the number of resources where they were extracted from.
Download from	http://ontopt.dei.uc.pt/index.php?sec=download_outros
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf
Edition	LREC 2020

Name	pwnalign
Resource type	Evaluation Data
Size	1M entries
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	A set of 3 goldstandards for the evaluation of alignments between WordNet and Wikipedia.
Download from	https://sourceforge.net/projects/pwnalign/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.597.pdf
Edition	LREC 2020

Name	Read Speech Corpus (RSC)
Resource type	Corpus
Size	8.3 GByte
Languages	Romanian (ron)
Production status	Newly created-finished
Resource usage	Speech Recognition/Understanding
License	CC-BY-NC-ND 3.0
Conditions of use	From Owner
Description	Read Speech Corpus (RSC) is the largest publicly available Romanian speech corpora for training and evaluating Automatic Speech Recognition (ASR) systems, comprising 100 hours from 164 speakers.
Download from	https://speed.pub.ro/downloads/speech-datasets/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.814.pdf
Edition	LREC 2020

Name	Renmin-OCR-NER
Resource type	Evaluation Data
Size	438249 tokens
Languages	Chinese (cmn)
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	<Not Specified>
Conditions of use	Freely Available
Description	Newspaper images transcribed with text locations and transcriptions for OCR evaluation and annotated for named entities.
Download from	https://github.com/hltcoe/cmn-renmin-ocr-ner-dataset
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.570.pdf
Edition	LREC 2020

Name	Resources for Academic Writing
Resource type	Corpus
Size	8.6 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	<Not Specified>
Conditions of use	Freely Available
Description	The dataset contains keyphrases and n-grams extracted from reference corpus, training, and test dataset fro Informal word identification task, and training and test for academic paraphrasing task. A README file is included explaining the details.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1020_res_1.zip [8.59 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.722.pdf
Edition	LREC 2020

Name	Rigor-Mortis-LREC2020
Resource type	Corpus
Size	41820 tokens
Languages	French (fra)
Production status	Newly created-finished
Resource usage	Multiword Expressions
License	CC-BY-NC 4.0
Conditions of use	Freely Available
Description	The resource is an export of the Multiword expressions annotated by players the the Rigor-Mortis Game. See https://github.com/gwaps4nlp/rigor-mortis/blob/master/LREC_2020/LREC2020_RM.pdf
Download from	https://github.com/gwaps4nlp/rigor-mortis/tree/master/LREC_2020
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.541.pdf
Edition	LREC 2020

Name	Romanian Named Entity Corpus
Resource type	Corpus
Size	26377 entities
Languages	Romanian
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	MIT License
Conditions of use	Freely Available
Description	RONEC - the Romanian Named Entity Corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition. It is available in BRAT and CoNLL-U Plus formats, and it is free to use and extend at github.com/dumitrescustefan/ronec.
Download from	https://github.com/dumitrescustefan/ronec
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.546.pdf
Edition	LREC 2020

Name	RusLTC
Resource type	Corpus
Size	2.3 million tokens
Languages	English (eng), Russian (rus)
Production status	Existing-updated
Resource usage	Document Classification, Text categorisation
License	CC-BY-SA 3.0
Conditions of use	Freely Available
Description	Russian Learner Translator Corpus (RusLTC) is a bi-directional parallel corpus, which stores English-Russian translations produced by university students majoring in translation studies in Russia (Russian L1). The English-Russian subcorpus has multiple translations to 402 sources, while Russian-English part includes 125 source-targets sets. The number of translations to a single source averages at 8 (min=1, max=67). Many texts are translations of excerpts from larger sources; the median lengths of translations in the English-to-Russian subcorpus is 380 words. RusLTC is downloadable as a customized TMX-file and a plaintext archive (https://www.rus-ltc.org/static/html/about.html). The TMX contains the sentence-aligned version of the corpus with the automatic alignment errors corrected manually. It also carries 10 metadata fields, including the translator's gender and affiliation, education type and level, grade for the translation, year and conditions of translation (routine/exam; home/classroom) and source text genre. RusLTC includes texts in 10 genres: academic, informational, educational, essay, technical, fiction, educational, speech, letters, advertisement texts in 10 genres: academic, informational, educational, essay, technical, fiction, educational, speech, letters, advertisement. Detailed and automatically updated statistics is available from https://dev.rus-ltc.org/statistics/.
Download from	https://www.rus-ltc.org/static/html/about.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.505.pdf
Edition	LREC 2020

Name	Scholarly Documents Deduplication Dataset
Resource type	Corpus
Size	204 MByte
Languages	English
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	<Not Specified>
Conditions of use	Freely Available
Description	A labelled dataset of 100K scholarly documents categorised into non-duplicates, near-duplicates and exact duplicates groups. 50K examples are non-duplicates; 25K examples belong to the exact-duplicate groups and the remaining 25K example belong to the near-duplicate groups.
Download from	https://core.ac.uk/documentation/dataset/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.113.pdf
Edition	LREC 2020

Name	Seed corpus for entity coreference in email conversations
Resource type	Corpus
Size	866 coreference chains
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Anaphora, Coreference
License	Apache 2.0
Conditions of use	Freely Available
Description	Corpus for submission entitled: A study on Entity Resolution in Email Conversations. The seed corpus contains 46 email threads comprising of 245 email messages. The annotations are provided in CoNLL format.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/899_res_1.tar.xz [83 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.8.pdf
Edition	LREC 2020

Name	Semantic verb classes
Resource type	Evaluation Data
Size	17 entries
Languages	English
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	CC BY 4.0
Conditions of use	Freely Available
Description	17 semantic verb classes collected for English via a semantic clustering task, where 825 verbs are placed one by one into circles visible on the computer screen by an annotator, each circle corresponding to a grouping of similar and related verbs.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/444_res_2.txt [6 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.705.pdf
Edition	LREC 2020

Name	SemEU-R CMV Corpus
Resource type	Corpus
Size	NULL NULL
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	From Owner
Description	<Not Specified>
Download from	http://katfuji.lab.tuat.ac.jp/nlp_datasets/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.143.pdf
Edition	LREC 2020

Name	SentiEcon
Resource type	Lexicon
Size	6470 entries
Languages	English
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	CC BY-NC 3.0
Conditions of use	Freely Available for Academic Purposes
Description	Domain-specific computational lexicon designed for sentiment analysis applications, we compiled from our own corpus of online business news. It was created as a plug-in lexicon for the sentiment analysis tool Lingmotif, and thus it follows its data structure requirements and presupposes the availability of a general-language core sentiment lexicon that covers non-specific sentiment-carrying terms and phrases. It contains 6,470 entries, both single and multi-word expressions, each with tags denoting their semantic orientation and intensity.
Download from	http://tecnolengua.uma.es/sentiecon
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.623.pdf
Edition	LREC 2020

Name	Serial Speakers: a Dataset of TV Series
Resource type	Corpus
Size	45 MByte
Languages	English
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	CC-BY 4.0
Conditions of use	Freely Available
Description	"Serial Speakers" is a dataset of 161 annotated episodes from three popular American TV serials, "Breaking Bad" (62 annotated episodes), "Game of Thrones" (73), and "House of Cards". "Serial Speakers" is suitable for addressing both high level multimedia retrieval tasks in real world scenarios, and lower level speech processing tasks in challenging conditions. The boundaries, speaker and textual content of every speech turn, along with all scene boundaries, have been manually annotated for the whole set of episodes; the shot boundaries and the recurring shots for the first season of each of the three series; and the interacting speakers for a subset of 10 episodes. Because of copyright restrictions, the textual content is encrypted in the public version of the dataset, but we provide the users with a simple tool for recovering the text from their own subtitle files (https://github.com/bostxavier/Serial-Speakers).
Download from	https://figshare.com/articles/TV_Series_Corpus/3471839
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.525.pdf
Edition	LREC 2020

Name	SpA-Verb
Resource type	Evaluation Data
Size	29721 entries
Languages	English
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	CC BY 4.0
Conditions of use	Freely Available
Description	An English intrinsic evaluation verb similarity dataset containing similarity scores for 29,721 unique verb pairs and 825 target verbs, obtained via a two-phase data collection design: a rough semantic clustering phase, where verbs are placed in broad semantic classes, and a spatial multi-arrangement phase, where fine-grained spatial similarity judgments are elicited on related verbs, clustered together in Phase 1.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/444_res_1.txt [746 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.705.pdf
Edition	LREC 2020

Name	SQuAD-es v1.1
Resource type	Corpus
Size	87595 (context, question, answer) tuples
Languages	Spanish (spa)
Production status	Newly created-finished
Resource usage	Question Answering
License	MIT License
Conditions of use	Freely Available
Description	Automatic Spanish translation of the training set of the Stanford Question Answering Dataset (SQuAD) v1.1
Download from	https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.677.pdf
Edition	LREC 2020

Name	Statement Classification Dataset, arXMLiv 08.2018
Resource type	Corpus
Size	10555689 paragraphs
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	SIGMathLing Non-Disclosure-Agreement (research-only use)
Conditions of use	Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself. Membership is free and granted on the honor system.
Description	This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph, immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations (e.g. Definition, Proof, Remark). A control dataset of the same statements is also included with all mathematical symbolism omitted (nomath), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. It may be considered as a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance. Community feedback is welcome on all of: data quality, representation issues, as well as organization and archival best practices. Jointly release new versions of this data are planned together with new releases of the arXMLiv corpus.
Download from	https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.153.pdf
Edition	LREC 2020

Name	STEM-ECR
Resource type	Corpus
Size	3.3 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	CC-BY-SA 3.0
Conditions of use	Freely Available
Description	The STEM ECR v1.0 dataset introduces the task of Scientific Entity Extraction, Classification, and Resolution on scholarly publications in STEM (Science, Technology, Engineering, and Medicine) disciplines. It comprises annotated scholarly abstracts from 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. The annotated data includes: phrase-based scientific entities, and their corresponding disambiguated references in Wikipedia and Wiktionary as applicable. The purpose of the dataset is to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/1130_res_1.zip [4.93 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.268.pdf
Edition	LREC 2020

Name	SVALex
Resource type	Lexicon
Size	15619 entries
Languages	Swedish (swe)
Production status	Existing-used
Resource usage	Language Learning/Grading
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	Lexical resources for foreign language (L2) learning, teaching and research graded following the Common European Framework of Reference for Languages (CEFR)
Download from	http://pub.cl.uzh.ch/purl/multiCEFRLex
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.43.pdf
Edition	LREC 2020

Name	TableBank
Resource type	Evaluation Data
Size	26,6 GByte
Languages	<Not Specified>
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	CC-BY-NC-ND 4.0
Conditions of use	From Owner
Description	TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.
Download from	https://github.com/doc-analysis/TableBank
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.236.pdf
Edition	LREC 2020

Name	TamilMemes
Resource type	Corpus
Size	NULL NULL
Languages	English, Tamil, Tamil-English code mixed
Production status	<Not Specified>
Resource usage	<Not Specified>
License	<Not Specified>
Conditions of use	<Not Specified>
Description	TamilMemes folder contains troll and non-troll images in the directory with the same name.
Download from	https://github.com/sharduls007/TamilMemes
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/workshops/WILDRE-5/pdf/2020.wildre-1.2.pdf
Edition	WILDRE-5 2020

Name	TaPaCo
Resource type	Corpus
Size	2789631 sentences
Languages	English, Esperanto, Italian, Russian, Turkish, French, German, Portuguese, Spanish + 88 other languages
Production status	Newly created-finished
Resource usage	Textual Entailment and Paraphrasing
License	CC-BY 2.0
Conditions of use	Freely Available
Description	a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database.
Download from	https://doi.org/10.5281/zenodo.3707949
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.848.pdf
Edition	LREC 2020

Name	Taxonomic Word Embeddings - Trained on English WordNet Random Walk Pseudo-Corpora
Resource type	Corpus
Size	71 models
Languages	English
Production status	Newly created-finished
Resource usage	Language Modelling
License	CC BY-NC-SA 4.0
Conditions of use	Freely Available
Description	This archive contains a collection of computational models called word embeddings. These are vectors that contain numerical representations of words. They have been trained on pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy, and thus reflect taxonomic knowledge about words (rather than contextual). A separate embedding model was trained for 72 random walk corpora that were generated, and thus 72 different embedding models are made available (4.4GB). For training an off-the-shelf implementation of pytorch was used and no major parameters were changed, essentially using it 'as is'. Each model has been trained for 30 epochs. As the corpus files differed with regards to the parameters used in their generation, these are also reflected in the models. The parameters are: - size: number of sentences/lines in the training corpus - direction: the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both) - minimal sentence length: the shortest length sentence (in number of words)
Download from	https://arrow.dit.ie/datas/12/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.602.pdf
Edition	LREC 2020

Name	The Alice Datasets: fMRI & EEG Observations of Natural Language Comprehension
Resource type	Corpus
Size	75 participants
Languages	English
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	CC0
Conditions of use	Freely Available
Description	The Alice Datasets are a set of datasets based on magnetic resonance data and electrophysiological data, collected while participants heard a story in English. Along with the datasets and the text of the story, there exists a variety of different linguistic and computational measures ranging from prosodic predictors to predictors capturing hierarchical syntactic information included in the supplementary materials. These ecologically valid datasets can be easily reused to replicate prior work and to test new hypotheses about natural language comprehension in the brain.
Download from	https://openneuro.org/datasets/ds002322/versions/1.0.3
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.15.pdf
Edition	LREC 2020

Name	Time-Aware Word Embeddings for Three Lebanese News Archives
Resource type	Corpus
Size	3.7 GByte
Languages	Arabic
Production status	Newly created-finished
Resource usage	Text Mining
License	CC-BY 4.0
Conditions of use	Freely Available
Description	The resource consists of various Arabic word embeddings trained using three large Lebanese news archives, namely Assafir, Annahar and Alhayat. The archives consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. To train the word embeddings, Google's Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt.
Download from	https://zenodo.org/record/3538880
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.580.pdf
Edition	LREC 2020

Name	TOROT
Resource type	Treebank
Size	1155577 words
Languages	Church Slavic (chu), Old East Slavonic (orv), Russian (rus)
Production status	Existing-updated
Resource usage	Corpus Creation/Annotation
License	CC-BY-NC-SA 3.0
Conditions of use	Freely Available
Description	Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) spans from the earliest Old Church Slavonic to modern Russian texts, covering more than a thousand years of continuous language history. It is a member of the larger PROIEL family of treebanks of ancient languages.
Download from	http://torottreebank.github.io/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.646.pdf
Edition	LREC 2020

Name	TRANSLIT
Resource type	Corpus
Size	1.6M entries
Languages	<Not Specified>
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CC0
Conditions of use	Freely Available
Description	A large-scale name transliteration corpus. It is a merging of large transliteration resources such as JRC, Geonames, amanzon transliteration and wiki-lang-all (new) names. The resources were unified and the name variants were merged.
Download from	https://github.com/fbenites/TRANSLIT
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.399.pdf
Edition	LREC 2020

Name	Turku NER corpus
Resource type	Corpus
Size	202116 tokens
Languages	Finnish (fin)
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	CC-BY-SA 4.0
Conditions of use	Freely Available
Description	The Turku NER corpus is a manually annotated broad-coverage corpus for Finnish NER, covering over 10,000 annotations for person, organization, location, product and event mentions as well as dates in 754 documents (approx. 200,000 tokens) representing ten different genres of text. The corpus builds on and extends the existing annotations of the Turku Dependency Treebank corpus included in the Universal Dependencies repository.
Download from	https://turkunlp.org/turku-ner-corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.567.pdf
Edition	LREC 2020

Name	TutorialVQA
Resource type	Corpus
Size	4 GByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Question Answering
License	CC-BY-NC 4.0
Conditions of use	Freely Available
Description	The VideoQA corpus contains questions and answers pertaining to Photoshop helpx tutorial videos. The questions were collected on Amazon Mechanical Turk (AMT) and pertain to helpx tutorial video segments. These video segments were manually annotated. In total there are approximately 6,000 questions collected, pertaining to 408 total video segments, and relating to 76 videos. Turkers were shown the video segments and asked to input questions which the video segment can answer. To expand the question set, a separate task asked Turkers to paraphrase the original questions. The videos originate from https://helpx.adobe.com/.
Download from	https://github.com/acolas1/TutorialVQAData
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.670.pdf
Edition	LREC 2020

Name	TV-AfD_Imperative_Corpus
Resource type	Corpus
Size	3 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Language Modelling
License	GNU GPL v3.0
Conditions of use	Freely Available
Description	This is a corpus with imperative and non-imperative sentences from the TV show The Big Bang Theory and Wikipedia Articles for Deletion. We used rule-based classification and manual annotation to classify imperative sentences. 1. Data from the TV show Both raw data and classified data are included in this repository. Each file in the raw data is from one episode with each line representing one dialogue from that specific episode. In the tab-separated classification data file, each line represents one dialogue. The columns respectively are: data source (in the formant of nxmm with 'n' indicating the season of the show and 'mm' representing the episode), text, imperative classification (1 for imperative and 0 for non-imperative), imperative category, and whether the imperative sentence has affixal negative markers or not. 2. Data from Wikipedia Articles for Deletion Classified data for sentences only are included from this source. The file is tab separated with two columns of text and classification (1 for imperative and 0 for non-imperative) respectively. Each line represent one instance of data.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/235_res_1.zip [1.07 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.805.pdf
Edition	LREC 2020

Name	Twitter Datasets
Resource type	Corpus
Size	1904 stance-annotated tweets
Languages	German (deu)
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	CC-BY 4.0
Conditions of use	Freely Available
Description	The German Titling Twitter Corpus consists of 1904 stance-annotated tweets (125KB) collected in June/July 2018 mentioning 24 German politicians with a doctoral degree. The Addendum contains an additional 296 stance-annotated tweets from each month of 2018 mentioning 6 left-leaning and 4 right-leaning politicians with a doctoral degree.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/181_res_1.zip [125 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.606.pdf
Edition	LREC 2020

Name	Twitter Users with City Population Size
Resource type	Corpus
Size	2000000 sentences
Languages	English
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	<Not Specified>
Conditions of use	Freely Available
Description	2 million twitter user IDs and tweets IDs for tweets written by each user. Can be used to retrieve original data from the Twitter API subject to Twitters terms of service.
Download from	http://lit.eecs.umich.edu/downloads.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.771.pdf
Edition	LREC 2020

Name	Universal Dependencies
Resource type	Treebank
Size	25 million words
Languages	Afrikaans, Akkadian, Amharic, Ancient Greek, Arabic, Armenian, Assyrian, Bambara, Basque, Belarusian, Bhojpuri, Breton, Bulgarian, Buryat, Cantonese, Catalan, Chinese, Classical Chinese, Coptic, Croatian, Czech, Danish, Dutch, English, Erzya, Estonian, Faroese, Finnish, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hindi English, Hungarian, Indonesian, Irish, Italian, Japanese, Karelian, Kazakh, Komi Permyak, Komi Zyrian, Korean, Kurmanji, Latin, Latvian, Lithuanian, Livvi, Maltese, Marathi, Mbya Guarani, Moksha, Naija, North Sami, Norwegian, Old Church Slavonic, Old French, Old Russian, Persian, Polish, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Skolt Sami, Slovak, Slovenian, Spanish, Swedish, Swedish Sign Language, Swiss German, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Upper Sorbian, Urdu, Uyghur, Vietnamese, Warlpiri, Welsh, Wolof, Yoruba
Production status	Existing-updated
Resource usage	Parsing and Tagging
License	Universal Dependencies 2.6 License Agreement
Conditions of use	Freely Available
Description	Universal Dependencies is an initiative to create cross-linguistically consistent morphosyntactic annotation for many languages. The latest release, UD v2.5, contains 157 treebanks in 90 languages. The resources are available from https://lindat.mff.cuni.cz/. Documentation can be found at https://universaldependencies.org.
Download from	https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3105
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.497.pdf
Edition	LREC 2020

Name	Urban Dictionary Emeddings
Resource type	Corpus
Size	3 GByte
Languages	English
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	<Not Specified>
Conditions of use	Freely Available
Description	Word embeddings trained on the entirety of the content of Urban Dictionary as of October 16, 2019. A skip-gram architecture with negative sampling was used within the fastText framework.
Download from	http://smash.inf.ed.ac.uk/ud-embeddings/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.586.pdf
Edition	LREC 2020

Name	Urdu-News-Augmented-Dataset
Resource type	Corpus
Size	5.04 MByte
Languages	English, Urdu (urd)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	<Not Specified>
Conditions of use	Freely Available
Description	This is a language resource which contains a dataset of 900 news articles originally in Urdu annotated as real or fake. Additionally, it contains a 400 news article as an augmentation dataset generated using Google Translate MT system from English to Urdu, as well as a number of combinations of these datasets for exploration of the augmentation effect. The original English Fake News dataset is available from https://web.eecs.umich.edu/~mihalcea/downloads.html#FakeNews.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/66_res_1.zip [5.29 Mb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.309.pdf
Edition	LREC 2020

Name	WFM mathematical definitions dataset
Resource type	Corpus
Size	NULL NULL
Languages	English
Production status	Newly created-finished
Resource usage	Definition extraction
License	GNU GPL v3.0
Conditions of use	Freely Available
Description	The WFM dataset (Vanetik et al., 2019) was created by collecting and processing 2352 articles from Wolfram Mathworld (Weisstein and others, 2007). The dataset contains 1793 sentences, of which 811 are definitions and 982 are non-definitions. Sentences were extracted automatically and then manually separated into two categories: definitions and statements (non-definitions).
Download from	https://github.com/uplink007/FinalProject/tree/master/data/wolfram
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.256.pdf
Edition	LREC 2020

Name	WikiFactCheck-English
Resource type	Corpus
Size	124821 entries
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Fact checking
License	<Not Specified>
Conditions of use	Freely Available
Description	WikiFactCheck-English, a dataset of 124k triples consisting of a claim, context and an evidence document extracted from English Wikipedia articles and citations, as well as 34k manually written claims that are refuted by the evidence documents. This is the largest fact checking dataset consisting of real claims and evidence to date; it will allow the development of fact checking systems that can better process claims and evidence in the real world.
Download from	https://github.com/WikiFactCheck-English
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.849.pdf
Edition	LREC 2020

Name	word2word
Resource type	Lexicon
Size	41 GByte
Languages	af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	GNU GPL v3.0
Conditions of use	Freely Available
Description	A collection of bilingual lexicons for 3,564 language pairs.
Download from	https://github.com/Kyubyong/word2word
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.371.pdf
Edition	LREC 2020

Name	Yarega
Resource type	Corpus
Size	92848 words
Languages	Telugu
Production status	Newly created-finished
Resource usage	Opinion Mining/Sentiment Analysis
License	<Not Specified>
Conditions of use	Freely Available
Description	The sentences in the dataset are in Telugu Script. The annotated data for aspect term identification is in the form of word/tag. Annotated data for aspect polarity classification and aspect term categorisation is in JSON format. Each JSON file contains an array of JSON objects. Each object consists of two properties, "sentence" and "aspectTerms". The property "sentence" has the review sentence. The property "aspectTerms" has an array of JSON objects where each object has the properties of "aspectTerm", "start", "end", "polarity" and "category". "aspectTerm" has the aspect term itself. "start" and "end" have the beginning and ending indexes of the aspect term."polarity" and "category" are the annotated polarity of the aspect term and the annotated category of the aspect term respectively. The dataset contains 5027 sentences and 92848 tokens. The total number of aspect terms in all those sentences is 7130. Of which, 3521 aspect terms are of positive polarity, 2480 are of negative polarity and 1129 are of neutral polarity.
Download from	http://lrec2020.lrec-conf.org/sharedlrs2020/706_res_1.zip [53 Kb]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.617.pdf
Edition	LREC 2020

Name	Yorùbá and Twi Corpus
Resource type	Corpus
Size	NULL NULL
Languages	Twi, Yorùbá
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	Freely Available
Description	This is a collection of some noisy and fully diacritized Yorùbá texts from different sources. Also, it includes some Twi texts. Also included is annotated Yorùbá texts for Named Entity recognition tasks.
Download from	https://github.com/ajesujoba/YorubaTwi-Embedding
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf
Edition	LREC 2020

Name	Yorùbá and Twi FastText Embedding
Resource type	Corpus
Size	NULL NULL
Languages	Twi, Yorùbá
Production status	Newly created-finished
Resource usage	Natural Language Processing Downstream tasks
License	<Not Specified>
Conditions of use	Freely Available
Description	Collection of FastText per-trained Embeddings for and Twi which can be used for downstream tasks in Natural Language Processing.
Download from	https://drive.google.com/drive/folders/1jkwLBkxJhnfVvf1yd7PyZw0nY8aNYaNN?usp=sharing
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf
Edition	LREC 2020

Name	Yorùbá and Twi WordSim-353 dataset
Resource type	Evaluation Data
Size	NULL NULL
Languages	Twi, Yorùbá
Production status	Newly created-finished
Resource usage	Evaluation/Validation
License	<Not Specified>
Conditions of use	Freely Available
Description	This is the collection of the WordSim-353 word pairs dataset for Yorùbá and Twi. All the already existing WordSim-353 word pairs were translated from English to Yorùbá and Twi.
Download from	https://github.com/ajesujoba/YorubaTwi-Embedding
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf
Edition	LREC 2020

Name	Zamboanga Chabacano Spelling Errors Ontology
Resource type	Ontology
Size	NULL NULL
Languages	Chavacano (cbk)
Production status	Newly created-finished
Resource usage	Corpus Creation/Annotation
License	<Not Specified>
Conditions of use	Freely Available
Description	A Zamboanga Chabacano spelling errors ontology in Web Ontology Language (OWL) format.
Download from	https://research.chavacano.org/cwzcc.owl
Referring paper	http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.327.pdf
Edition	LREC 2020

Search for LRs

index.html

Important dates

Latest Tweets