LREC recognises the importance of sharing Language Resources (LRs) and making them available to the community. When submitting a paper, participants were offered the possibility to share their LRs (data, tools, web-services, etc.), uploading them in a special LREC repository set up by ELRA. This effort of sharing LRs, linked to the LRE Map initiative for their description, contributes to creating a common repository where everyone can deposit and share data.
After the conference, the Shared LRs set at LREC 2020 was manually checked and a cleaned version of the list of LRs is now available. The LRs in this list comply with the following criteria:
- LRs accessible (whether downloadable directly or through an an external URL)
- LRs categorized as Datasets only. It can be a:
- Corpus
- Evaluation Data,
- Grammar/Language Model,
- Language Resources/Technologies Infrastructure,
- Lexical Conceptual Resource General Information,
- Lexicon,
- Ontology,
- Terminology,
- Treebank
Excluded LRs are:
- Uploaded LRswith a content that does not match the description
- LRs with no download URL or URL now a dead link
- LRs categorized as tools or guidelines
- LRs associated to rejected papers
index.html
-
Name A Chinese Corpus for Fine-grained Entity Typing Resource type Corpus Size 650 MByte Languages Chinese, English Production status Newly created-finished Resource usage Named Entity Recognition License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from https://drive.google.com/file/d/1xorWUdTi9r43tTEdwJ4tKa9ErvRjossU/view Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.548.pdf Edition LREC 2020 -
Name A corpus of precise natural entailment problems Resource type Corpus Size 150 entailment problems Languages English Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/172_res_1.zip [31 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.844.pdf Edition LREC 2020 -
Name A New Dataset for Sindhi Named Entity Recognition Resource type Corpus Size 1358724 words Languages Sindhi Production status Newly created-finished Resource usage Named Entity Recognition License <Not Specified> Conditions of use Freely Available Description We present the first large named entity recognition dataset for low-resourced Sindhi language with quality baselines. Our work mainly consists of three novel contributions. Firstly, the annotation of a gold-standard SiNER (Sindhi Named Entity Recognition) dataset, secondly we present quality baselines using conditional random field (CRF) and recent state-of-the-art deep bi-directional long short term memory networks (Bi-LSTM) neural models in named entity recognition and thirdly the performance comparison of Bi-LSTM, Bi-LSTM-CRF models on GloVe and fastText word representations. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1444_res_1.zip [8.88 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.361.pdf Edition LREC 2020 -
Name A Seed Corpus of Hindu Temples In India Resource type Corpus Size 1.7 MByte Languages English (eng) Production status Newly created-finished Resource usage Information Extraction, Information Retrieval License <Not Specified> Conditions of use Freely Available Description This json file consists of 4933 high accuracy facts about 573 temples. The file format is as below -------------------------------- {"head_line": {"questions": [] }"templ1":{"context": "text","answers": [ [],[]] ------------------------------------ Download from https://github.com/priyaradhakrishnan0/templeKB/blob/master/corpus/WebTempleCorpus.json Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.32.pdf Edition LREC 2020 -
Name AbuseEval Resource type Corpus Size 14100 NULL Languages English Production status Newly created-finished Resource usage Corpus Creation/Annotation License CC BY-NC-SA 4.0 Conditions of use Freely Available Description Extension of OLID/OffensEval data set with distinction of explicit vs implicit offensive messages. Annotation of Abusive Language, distinguishing also between explicit vs implicit offensive messages. Download from https://github.com/tommasoc80/AbuseEval Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf Edition LREC 2020 -
Name Amharic News Corpus Resource type Corpus Size 16295955 tokens Languages Amharic (amh) Production status Newly created-finished Resource usage Knowledge Discovery/Representation License <Not Specified> Conditions of use Freely Available Description This corpus of Amharic news websites was built by using our own crawler on four Amharic news websites from a variety of genres. As a pre-process, we removed non-Amharic words and letters and replaced all digits with the `#` character. After pre-processing, the corpus contained 16,295,955 tokens consisting of 855,109 unique tokens with average frequency of 19.05; this is similar in size to the English `text8` benchmark (17 million tokens, with an average frequency of 66). This is the corpus used for training embeddings in the publication: Mersha, Amanuel and Wu, Stephen. Morphology-rich Alphasyllabary Embeddings. Proceedings of the 12th Edition of the Language Resources and Evaluation Conference (LREC). Marseilles, France. 2020. Download from https://github.com/leobitz/amharic_word_embedding Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.315.pdf Edition LREC 2020 -
Name Amharic Treebank Resource type Treebank Size 1000 sentences Languages Amharic (amh) Production status <Not Specified> Resource usage <Not Specified> License <Not Specified> Conditions of use Freely Available Description The Amharic treebank (ATT) version 1 contains 1,074 manually-annotated sentences (5,245 tokens or 10,010 words). The sentences were collected from grammar books, biographies, news, and fictional and religious texts. The researchers made an effort to include different types of sentences. Download from https://github.com/Binyamephrem/Amharic-treebank Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.238.pdf Edition RAIL 2020 -
Name Anföranden Resource type Corpus Size 121987537 tokens Languages Swedish Production status Available Resource usage <Not Specified> License CC-BY 4.0 Conditions of use Freely Available Description This resource contains all Swedish parliamentary debates from autumn 1993 to spring 2018. They have been cleaned and linguistically annotated through the corpus pipeline of Språkbanken Text, University of Gothenburg. Download from https://spraakbanken.gu.se/en/resources/rd-anf Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/ParlaCLARIN2/pdf/2020.parlaclarin-1.2.pdf Edition ParlaCLARIN2 2020 -
Name Annotated Corpora for Term Extraction Research (ACTER) Resource type Corpus Size 19002 terms and named entities Languages Dutch, English, French Production status Version 1.2 completed and validated Resource usage Training and evaluation data for automatic term extraction (ATE) License CC-BY-NC-SA 4.0 Conditions of use Freely Available under appropriate license Description The Annotated Corpora for Term Extraction Research (ACTER), version 1.2, contain texts in four domains (corruption, dressage (horse riding), heart failure, and wind energy) and three languages (English, French, Dutch). For each corpus (combination of domain & language), around 50k tokens have been manually annotated to identify terminology and named entities (almost 600k annotated tokens in total). The results are presented as lists of annotations per corpus, with one (lowercased, unlemmatised, unique) annotation per line, tab-separated by its label. In total, there are 19k unique annotations. The annotation process is transparent and well-documented, with freely available guidelines (http://hdl.handle.net/1854/LU-8503113) and several published papers for the validation of the dataset. It has also been used for the TermEval 2020 shared task on automatic term extraction, organised at the CompuTerm workshop at LREC 2020. Download from https://bit.ly/3lNHdr2 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/COMPUTERM2020/pdf/2020.computerm-1.12.pdf Edition COMPUTERM 2020 -
Name Anomalous Lexical Units Resource type Lexicon Size 300 entries Languages English (eng) Production status Newly created-finished Resource usage Lexicon Creation/Annotation License <Not Specified> Conditions of use Freely Available Description The resource contains 300 WordNet synsets that cannot be characterized by any existing semantic frame in Berkeley FrameNet data release 1.7. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/611_res_1.tsv [20 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.431.pdf Edition LREC 2020 -
Name AR-ASAG Dataset Resource type Corpus Size 4.46 MByte Languages Algerian Arabic (arq) Production status Newly created-finished Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description The Dataset consists of reported evaluations related to Automatic short answers grading in Arabic Language. The exams were conducted under natural conditions of evaluation. Each test consists of 16 short answer questions (a total of 48 questions). Students submitted answers to these questions. The number of answers obtained is different from one question to another. The Dataset includes a total of 2133 pairs (Model Answer, student answer). For each pair, two grades (Mark1 and Mark2 ) are associated with a manual Average Gold Score. Both manual grades are available in the Dataset. The Inter-Annotators Agreement: (Pearson=0.8384) and (Root Mean Square Error : RMSE=0.8381). The Dataset encompasses 5 types of questions: "Define", "Explain", "What consequences?", "Justify", and "What is the difference?".
AR-ASAG Dataset is available in different versions: TXT, XML, XML-MOODLE and Database (.DB). The .DB format allows to make the necessary exports according to specific analysis needs. The XML-MOODLE format is used on Moodle e-learning Platforms.Download from http://lrec2020.lrec-conf.org/sharedlrs2020/345_res_ 1.rar [751.75 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.321.pdf Edition LREC 2020 -
Name Arabic In-Domain Cyber Text Corpus Resource type Corpus Size 22.9 MByte Languages Algerian Arabic (arq) Production status Newly created-finished Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use Freely Available Description This is an Arabic text corpus covering the specific field of cybercrimes. Documents : 1273 Words : 2009110 The corpus was used to explore the Automatic Short Answer Grading (ASAG) field for Arabic Language. The corpus was automatically obtained from texts extracted from a collection of URLs according to a list of key terms. Key terms are combined and queried to a search engine, which returns a list of potentially relevant URLs. The URLs are then inspected and validated. Relevant web pages are retrieved, automatically cleaned of HTML tags. The text is extracted and added to the corpus. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/345_res_ 2.rar [6.15 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.321.pdf Edition LREC 2020 -
Name ArSpecEmb Resource type Corpus Size 3196k words Languages Arabic (ara) Production status Newly created-finished Resource usage Opinion Mining/Sentiment Analysis License OpenSource Conditions of use Freely Available Description ArSpecEmb regroups embeddings that take into account specificity of Arabic language. They are trained on different types of corpora. Download from https://lium.univ-lemans.fr/en/arsentimentanalysis/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.610.pdf Edition LREC 2020 -
Name Austrian named entities Resource type Lexical Conceptual Resource General Information Size 15457 named entities Languages German Production status In use Resource usage Machine Translation License CC-BY-SA 3.0 Conditions of use Freely Available Description Named entities of Austrian political parties, ministries of the federal administration, names of the federal ministers, name of the president, names of state secretaries, daily newspapers, Austrian common last names, Austrian common first names (male, female), Austrian geographical names (localities, municipalities, states, political districts), names of Austrian stock companies. The dataset "Austrian (German) named entities" is a compilation by University of Vienna of several datasets. Common names and Geographical names data originate from Statistics Austria. Names of Austrian stock companies originate from Wikimedia. Public domain information -- Names of President, State Secretaries and Ministers and Newspaper names are compiled by University of Vienna. Download from https://www.elrc-share.eu/repository/browse/austrian-named-entities/b0998b12ab9611e8b7d400155d02670612bad73492934202887a45e227312e0e Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf Edition LT4Gov 2020 -
Name BERT-Base, Multilingual Resource type Grammar/Language Model Size NULL NULL Languages 104 languages Production status Existing-used Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in "Well-Read Students Learn Better: On the Importance of Pre-training Compact Models" (https://arxiv.org/abs/1908.08962). Download from https://github.com/google-research/bert Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.669.pdf Edition LREC 2020 -
Name BERT-based Abstractive Summarization Resource type Grammar/Language Model Size NULL NULL Languages <Not Specified> Production status Newly created-finished Resource usage Summarisation License MIT License Conditions of use <Not Specified> Description Summarization Library. Download from https://github.com/axenov/BERT-Summ-OpenNMT Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.825.pdf Edition LREC 2020 -
Name Biblical Hebrew Graph Database Resource type Lexicon Size 10 MByte Languages Biblical Hebrew Production status In production Resource usage <Not Specified> License <Not Specified> Conditions of use <Not Specified> Description A thesaurus of Biblical Hebrew, with headwords linked by sematic, phonetic, and distributional similarities. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/16_res_1.tgz [734 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.660.pdf Edition LT4HALA 2020 -
Name Bilingual Corpus of English-Arabic Parallel Tweets Resource type Corpus Size 3 MByte Languages Arabic, English Production status Complete Resource usage For Research License Apache 2.0 Conditions of use Publicly Available Description We provide tweet IDs of 166,706 pairs of English-Arabic parallel tweets. We share a list of 1389 twitter accounts who post English-Arabic parallel tweets. We also share subset of twitter accounts annotated for their countries and topics they most frequently post tweets about. Download from http://bit.ly/ 2xApE8V Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/BUCC2020/pdf/2020.bucc-1.3.pdf Edition BUCC 2020 -
Name Bilingual troll tweets Resource type Corpus Size 2.3 MByte Languages English, Russian (rus) Production status Newly created-finished Resource usage Text Mining License <Not Specified> Conditions of use Freely Available Description A bilingual dataset with troll tweets. The dataset we built contains tweets from troll and legitimate accounts in two languages (including translation from one language to another) and can be used for training models for automatic detection of troll tweets/accounts in either English, Russian, or both languages. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1055_res_1.zip [2.32 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.766.pdf Edition LREC 2020 -
Name Broad-Coverage German Sentiment Classification for Dialog Systems Resource type Corpus Size 5355043 entries Languages German Production status Newly created and combined with existing Resource usage Document Classification, Text categorisation License MIT License Conditions of use Freely Available Description This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper. Download from https://github.com/oliverguhr/german-sentiment Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf Edition LREC 2020 -
Name CA-EHN Resource type Corpus Size 90505 analogies Languages Chinese Production status Newly created-finished Resource usage Evaluation/Validation License CC-BY-NC-SA 4.0 Conditions of use Freely Available Description CA-EHN is a commonsense word analogy dataset containing 90,505 analogies covering 5,656 words and 763 relations. The typical usage of word analogy datasets are to evaluate word embeddings. Download from https://github.com/ckiplab/CA-EHN Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.365.pdf Edition LREC 2020 -
Name Cairo Student Code-Switch Corpus Resource type Corpus Size 505 KByte Languages Egyptian Arabic (arz), English (eng) Production status Newly created-finished Resource usage Morphological Analysis License OpenSource Conditions of use From Owner Description A corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1030_res_1.txt [517 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.489.pdf Edition LREC 2020 -
Name CantoMap Resource type Corpus Size 768 minutes Languages , Yue Chinese (yue) Production status Newly created-finished Resource usage Corpus Creation/Annotation License GNU GPL v3.0 Conditions of use Freely Available Description The Cantonese MapTask corpus is a collection of recordings of the MapTask task in contemporary Hong Kong Cantonese. The corpus contains the following: - Recordings of the sessions of the MapTask are in the folder "ConversationData". Each sub-folder contains audio files and corresponding ELAN files that contain the transcription of the audio in Chinese characters (both non-segmented and manually segmented) and jyutping romanization. - Recordings of the reading task where participants read words aloud from a list are in the folder "ReadingData". - The maps used for the task are in the folder "Maps". In total, the corpus offers 12 hours and 48 minutes of recordings. Download from https://github.com/gwinterstein/CantoMap Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.355.pdf Edition LREC 2020 -
Name CEASE Resource type Corpus Size 66.3 KByte Languages English (eng) Production status Newly created-finished Resource usage Emotion Recognition/Generation License <Not Specified> Conditions of use From Owner Description The dataset is present in the 'data' folder which contains 15 '.txt' files, one for each emotion class: Emotion Classes considered are: Abuse, Anger, Blame, Fear, Forgiveness, Guilt, Hopefulness, Hopelessness, Happiness_Peacefulness, Information, Instruction, Love, Pride, Sorrow, Thankfulness. Each file contains pre-processed sentences from real-life suicide notes. The full corpus contains 2393 annotated instances from 15 emotion classes. Note: To anonymize any real-life reference of persons, address and organization in this dataset, the authors have used three tags: NAME in place of any person's name ADDRESS in place of any address of a person or organization ORGANIZATION in place of any name of an industry, organization, firm, etc. The authors have taken the effort to anonymize any real-life references in the dataset, but still, if any reference is missed, it is advised to anyone using this dataset to anonymize them using the above mentioned three tags before using it in their work. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/907_res_1.zip [7 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf Edition LREC 2020 -
Name CETE codes and datasets Resource type Evaluation Data Size 68.7 MByte Languages English Production status Existing-used Resource usage Question Answering License <Not Specified> Conditions of use Freely Available Description We share the source code of our models as well as the pre-processed datasets (TREC-QA, WikiQA, SemEval2015-CQA, SemEval2016-CQA, SemEval2017-CQA). Our fine-tuning model is implemented based on this repository: https://github.com/huggingface/transformers and the feature-based model is implemented based on this repository: https://github.com/AnubhavGupta3377/Text-Classification-Models-Pytorch. The datasets are already publicly available. But as the publicly shared question answering datasets are often available in unstructured text format, the pre-processed versions is shared so that it can be easily adapted for future research and reproduction of results. Each instance in the dataset is separated by " ". Download from https://drive.google.com/file/d/1iCm515ZhEN-UmdOR_ievK3EA8iuXacSz/view?usp=sharing Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.676.pdf Edition LREC 2020 -
Name Chinese Humor Corpus Resource type Corpus Size 12.7 MByte Languages Chinese (zho) Production status Newly created-finished Resource usage Document Classification, Text categorisation License MIT License Conditions of use Freely Available Description This Chinese humor corpus contains 3,365 jokes collected from over 40 sources. Each joke was labeled with five levels of funniness, eight skill sets of humor, and six dimensions of intent. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/632_res_1.zip [5.23 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.168.pdf Edition LREC 2020 -
Name Chinese PragBank Resource type Corpus Size 167 KByte Languages Mandarin Chinese (cmn) Production status Newly created-finished Resource usage Corpus Creation/Annotation License OpenSource Conditions of use From Owner Description This corpus collects Chinese readers' veridicality judgments to news events, namely, whether readers believe an even is likely to happen or not. A total of 968 event sentences are included and each sentence is annotated with 6 veridicality judgments. There are 4 columns in the corpus, which are news sentences, normalized sentences, target events, and veridicality judgments. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/869_res_1.txt [171 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.33.pdf Edition LREC 2020 -
Name Cifu Resource type Lexicon Size 51798 entries Languages Yue Chinese (yue) Production status Newly created-finished Resource usage Psycholinguistics License GNU GPL v3.0 Conditions of use Freely Available Description This resource offers a lexical database for Hong Kong Cantonese. Each entry in the lexicon includes the following information: - the term in Chinese characters - a Jyutping romanization of the term - the frequency of the term (occurence counts and per million words) in four different genres: Written, Spoken adult, Spoken child and Child directed speech - a count of the strokes in each character in the term (as a measure of the character complexity) - 6 measures of the Neighborhood Density of each entry (distinguishing whether insertion/deletion are allowed when searching for neighbors and the genre from which frequencies are retrived to calculate the Neighborhood Density: Written, Spoken Adult or their average). Download from http://lrec2020.lrec-conf.org/sharedlrs2020/508_res_1.txt [7.86 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.375.pdf Edition LREC 2020 -
Name Ciron: a New Benchmark Dataset for Chinese Irony Detection Resource type Corpus Size 1 MByte Languages Chinese (zho) Production status Newly created-finished Resource usage Document Classification, Text categorisation License <Not Specified> Conditions of use Freely Available Description Ciron includes more than 8.7k posts, collected from the Weibo, a microblogging platform. Ciron is one scale larger in size than the previously available dataset. Ciron is collected with no pre-conditions to ensure much wide coverage. Five fine-grained classes for ironic ratings: 1 (not ironic), 2 (unlikely ironic), 3 (insufficient evidence), 4 (weakly ironic), 5 (strongly ironic). Class 2 and 4 are introduced to allow for fine-grained extent in the judgement. The inter-rater agreement Fleiss' Kappa results is 0.470. Evaluation of four different deep learning classifiers proves the usefulness of Ciron as an important resource for Chinese irony detection. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/105_res_1.zip [1.05 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.701.pdf Edition LREC 2020 -
Name cldf-datasets/normansinitic: Structural and lexical data for the paper by Norman (2013) on Chinese dialect classification Resource type Lexicon Size 400 lexemes Languages Chinese (zho), Chinese dialects Production status Newly created-finished Resource usage Lexicon Creation/Annotation License CC-BY 4.0 Conditions of use Freely Available Description See documentation (and main text). Download from https://doi.org/10.5281/zenodo.3552559 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.864.pdf Edition LREC 2020 -
Name Collaborative Dialogue in Minecraft Resource type Corpus Size 32 GByte Languages English (eng) Production status Existing-updated Resource usage Knowledge Discovery/Representation License CC-BY-SA 4.0 Conditions of use Freely Available Description Collaborative Dialogue in Minecraft: the annotations were added for the corpus sentences. Download from https://drive.google.com/drive/folders/16lDzswcQh8DR2jkQJdoVTK-RyVDFPHKa Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.717.pdf Edition LREC 2020 -
Name ConceptNet 5 Resource type Ontology Size 34 million concepts Languages English, French, German (deu), Italian, Spanish, Russian, Portuguese, Japanese, Dutch, Chinese and many others Production status Existing-updated Resource usage Semantic Web License CC-BY-SA 4.0 Conditions of use Freely Available Description ConceptNet is a freely-available semantic network, designed to help computers understand the meanings of words that people use. ConceptNet originated from the crowdsourcing project Open Mind Common Sense, which was launched in 1999 at the MIT Media Lab. It has since grown to include knowledge from other crowdsourced resources, expert-created resources, and games with a purpose. Download from http://conceptnet.io/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.38.pdf Edition LREC 2020 -
Name Corpora of Portuguese Humour Resource type Corpus Size NULL NULL Languages Portuguese Production status Newly created-finished Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description Corpora balanced between two styles of humour (one-liners, headlines) and for styles of no-humour (general-knowledge questions, why questions, news headlines, proverbs), in Portuguese, useful for text classification / humour recognition. Download from https://github.com/andreclemencio/Recognizing-Humor-in-Portuguese/tree/master/Datasets/Balanceados Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf Edition LREC 2020 -
Name Corpus of Chinese Dynastic Histories Resource type Corpus Size 23000000 tokens Languages Literary Chinese (lzh) Production status Newly created-finished Resource usage Corpus Creation/Annotation License CC-BY 4.0 Conditions of use Freely Available Description Newly created, on the basis of Wikisource, Corpus of Chinese Dynastic Histories currently includes twenty-four histories. Download from https://osf.io/tp729/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.98.pdf Edition LREC 2020 -
Name Corpus of Polish Spatial Texts 2.0 (PST 2.0) Resource type Corpus Size 61315 tokens Languages Polish (pol) Production status Existing-updated Resource usage Information Extraction, Information Retrieval License CC-BY-SA 4.0 Conditions of use Freely Available Description The extended version of Polish Spatial Text corpus. Texts derived from polish travel blogs manually annotated with spatial expressions. A spatial expression is a text fragment which describes a relative location of two or more physical objects to each other. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/934_res_1.zip [973 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.265.pdf Edition LREC 2020 -
Name Corpus REDEWIEDERGABE (Core corpus) Resource type Corpus Size 489608 tokens Languages German (deu) Production status Newly created-finished Resource usage Machine Learning License CC BY-NC-SA 4.0 Conditions of use Freely Available Description Corpus REDEWIEDERGABE is a German-language historical corpus with detailed annotations for speech, thought and writing representation (ST&WR).
It consists of narrative texts as well as newspaper and magazine texts published between 1840 and 1920 and has approximately 490,000 tokens. The textual material is balanced over decades and between fictional and non-fictional texts.
The annotation system is rooted in narratological theory and specifies the type of ST&WR (direct, free indirect, indirect, reported) and the medium (speech, thought, writing) as well as several more specialized attributes. In addition, the framing phrase, the key word that indicates the speech, thought or writing act, and the source (speaker/author) of the ST&WR are annotated.
The annotation is a consens annotation, based on two independent primary annotations.
The corpus is available in a TEI-compliant XML format, UIMA XMI format, and a column-based plain text format (UTF8, CONLL style).Download from https://github.com/redewiedergabe/corpus/tree/master/data/main Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.100.pdf Edition LREC 2020 -
Name Costra 1.0 Resource type Corpus Size 4262 sentences Languages Czech (ces) Production status Newly created-finished Resource usage Evaluation/Validation License CC-BY 4.0 Conditions of use Freely Available Description Dataset of complex sentence transformations, COSTRA 1.0. is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. This first version of the dataset is limited to sentences in Czech. Download from http://hdl.handle.net/11234/1-3123 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.434.pdf Edition LREC 2020 -
Name DaNewsroom Resource type Corpus Size 1.1M entries Languages Danish (dan) Production status Newly created-finished Resource usage Summarisation License OpenSource Conditions of use Freely Available Description Large-scale automatic summarisation dataset, comprised of articles and their summaries. Download from https://github.com/danielvarab/da-newsroom Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.831.pdf Edition LREC 2020 -
Name DEbateNet-mig15 Resource type Corpus Size 556185 tokens Languages German Production status Existing-updated Resource usage Political Science License CLARIN ACA+BY+NORED (EULA) Conditions of use Freely Available Description Corpus of manually annotated articles for political science research. Source: die Tageszeitung (reference German newspaper). Year: 2015. Multi-level annotation: claim identification and categorization into 97 categories (annotation schema available at https://github.com/mardy-spp/mardy_acl2019/blob/master/codebook.pdf); actor identification; claim attribution (linking of claims to actors); polarity; claim date. Additionally, an R package is released which can be used to explore the dataset and replicate the analysis of it. Download from http://hdl.handle.net/11022/1007-0000-0007-DB07-B Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.115.pdf Edition LREC 2020 -
Name Developing an Arabic Infectious Disease Ontology to Include Non-Standard Terminology Resource type Ontology Size 247 entries Languages Arabic (ara) Production status Newly created-finished Resource usage Text Mining License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/138_res_1.txt [241 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.596.pdf Edition LREC 2020 -
Name DIP-SumEval Resource type Corpus Size 49 topics (x 10 summaries and 11 dimensions) Languages English (eng) Production status Newly created-finished Resource usage Summarisation License CC-BY 4.0 Conditions of use Freely Available, Attribution Required Description This repository contains the summaries and evaluations from the paper 'A Dataset for the Analysis of Text Quality Dimensions in Summarization Evaluation' presented at LREC 2020. In total, the data set contains over 400 automatically generated summaries for 49 topics of an data set for multi-document summarization, 1274 judgements according to 11 text and summary quality criteria on a Likert-scale (1 to 5) performed by 26 trained annotators, and 43218 pairwise judgements according to 6 criteria performed by 64 crowd-workers. If you want to use the data please cite the paper http://www.lrec-conf.org/proceedings/lrec2020/bib/2020.lrec-1.826.bib. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/915_res_1.zip [531 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.826.pdf Edition LREC 2020 -
Name DiscSense Resource type Lexicon Size NULL MByte Languages English Production status Newly created-finished Resource usage Discourse License <Not Specified> Conditions of use Freely Available Description DiscSense was built automatically: we finetuned a BERT model for discourse marker prediction between sentences (using the Discovery dataset) and used the marker prediction between sentences of existing datasets. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/33_res_1.tsv [289 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.125.pdf Edition LREC 2020 -
Name Domain-Specific Dataset of Difficulty Ratings for German Noun Compounds Resource type Terminology Size 1030 lexemes Languages German (deu) Production status Newly created-finished Resource usage Difficulty of domain-specific German closed compounds License CC-BY-SA 4.0 Conditions of use Freely Available Description Dataset with difficulty ratings for 1,030 German closed noun compounds extracted from domain-specific texts for do-it-ourself (DIY), cooking and automotive. It includes two-part compounds for cooking and DIY, and two- to four-part compounds for automotive. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/137_res_1.zip [13 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.537.pdf Edition LREC 2020 -
Name dutch-archaeo-NER-dataset Resource type Corpus Size 8.4 MByte Languages Dutch (nld) Production status Newly created-finished Resource usage Named Entity Recognition License GNU GPL v3.0 Conditions of use Freely Available Description A manually annotated NER dataset, consisting of Dutch archaeological excavation reports. The following entity types are labelled: - Artefacts - Time periods - Materials - Places (geographical locations) - Archaeological contexts - Species The dataset is provided in the BIO format, with each token on 1 line and empty lines denoting sentence boundaries. On each line you can find the token, PoS tag, morphological segmentation and finally the label, separated by spaces. The PoS tag and morphological segmentation are assigned by Frog. This work was funded by a grant provided by the Leiden University Centre For Digital Humanities Download from https://doi.org/10.5281/zenodo.3544543 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.562.pdf Edition LREC 2020 -
Name ElkarHizketak v1.0 Resource type Corpus Size 1634 questions Languages Basque (eus) Production status Newly created-finished Resource usage Dialogue License CC-BY-SA 4.0 Conditions of use Freely Available Description ElkarHizketak is a Conversational Question Answering (CQA) dataset in Basque created by Basque speaker volunteers. The dataset contains close to 400 dialogues and more than 1600 questions and answers, and its small size presents a realistic low-resource scenario for CQA systems. The dataset is built on top of Wikipedia sections about popular people and organizations. The dialogues involve two crowd workers: (1) a student asks questions after reading a small introduction about the person, but without seeing the section text; and (2) a teacher answers the questions selecting a span of text of the section. Download from http://ixa2.si.ehu.es/convai/elkarhizketak-v1.0.zip Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.55.pdf Edition LREC 2020 -
Name English WordNet Random Walk Pseudo-Corpora Resource type Corpus Size 81 corpora Languages English Production status Newly created-finished Resource usage Language Modelling License CC BY-NC-SA 4.0 Conditions of use Freely Available Description The archive contains a collection of 81 pseudo-corpora (9.9GB). These are text files that contain pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy. The random walk algorithm produces a pseudo-sentence from WordNet by randomly picking a node (SynSet) in WordNet, randomly choosing a word in the SynSet, and then randomly picking a connected node and repeating the process. At every step there is a 15% chance for the random walk to stop; it also stops if it has no more connected nodes to take. Once the walk stops, a sentence is generated, and the same process repeats for each new sentence. Each line in the generated file represents one pseudo-sentence, where words are delimited by spaces. Example sentences: - measure musical notation tonality minor mode - Dutch-processed cocoa powder chocolate milk The corpus files are different in size, as well as in some parameters that were used to generate them. The parameters are: - size: number of sentences/lines in the corpus - direction: the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both) - minimal sentence length: the shortest length sentence (in number of words) Download from https://doi.org/10.21427/he55-6481 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.602.pdf Edition LREC 2020 -
Name English-Hiligaynon Resource type Lexicon Size 1200 entries Languages English (eng), Hiligaynon Production status Newly created-finished Resource usage Machine Translation, SpeechToSpeech Translation License <Not Specified> Conditions of use From Owner Description Word-word translation of 1,200 English words into Hiligaynon. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/163_res_1.txt [20 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.313.pdf Edition LREC 2020 -
Name Evaluation Benchmark for Domain Taxonomies from Knowledge Graphs (EBDT-KG) Resource type Evaluation Data Size 155 MByte Languages English (eng) Production status Newly created-finished Resource usage Knowledge Discovery/Representation License CC-BY 4.0 Conditions of use Freely Available Description An application-specific gold standard dataset extracted from the Wikipedia knowledge graph and an evaluation framework to comparatively assess the quality of noisy automatically extracted taxonomies. Download from https://sites.google.com/unitelmasapienza.it/wikipediataxonomies/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.285.pdf Edition LREC 2020 -
Name Event it Pronouns Resource type Corpus Size 8 MByte Languages English (eng), Estonian (ekk), Finnish (fin), German (deu), Spanish (esp), French (fra), Hungarian (hun), Italian (ita), Latvian (lav), Dutch (nld), Polish (pol), Portuguese (por), Romanian (ron), Slovak (slk), Slovenian (slv), Swedish (swe) Production status Newly created-finished Resource usage Anaphora, Coreference License OpenSource Conditions of use Freely Available Description This is a multilingual corpus of English 'it' pronouns annotated with their translations in the Europarl languages. The English 'it' pronouns have also been automatically labeled as 'event'. Download from https://github.com/sharidloaiciga/multilingual-event-pronouns Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.12.pdf Edition LREC 2020 -
Name Fake News Filipino Dataset Resource type Corpus Size 3206 entries Languages English (eng), Filipino (fil) Production status Newly created-finished Resource usage Document Classification, Text categorisation License GNU-GPL v3.0 Conditions of use Freely Available Description The first curated fake news detection dataset in Filipino. The dataset has 3206 samples with a perfect 50/50 split between real and fake news articles. Fake news articles are from sites tagged as fake by VeraFiles (an independent news organization) and the National Union of Journalists in the Philippines (NUJP). Real news were sourced from mainstream news sites in the Philippines. Data collection was performed across five months, as content and websites that spread fake news are usually taken down or go offline quickly, which makes text-based fake news data in Filipino scarce. Download from https://github.com/jcblaisecruz02/Tagalog-fake-news Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.316.pdf Edition LREC 2020 -
Name Fake.Br Corpus Resource type Corpus Size 22.4 MByte Languages Portuguese (por) Production status Existing-used Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available, Attribution Required Description The Fake.Br Corpus is composed of true and fake news that were manually aligned, focusing only on Brazilian Portuguese. There are 7,200 news, with exact 3,600 true and 3,600 fake news. All of them are in plain text format, with each one in a different file. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/939_res_1.zip [23.45 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.176.pdf Edition LREC 2020 -
Name Fakeddit Resource type Corpus Size 1063106 entries Languages English Production status Newly created-finished Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description We present Fakeddit, a novel multimodal dataset consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classi?cation categories through distant supervision. Download from https://github.com/entitize/fakeddit Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.755.pdf Edition LREC 2020 -
Name FalseFriendsLexicon Resource type Lexicon Size 900 KByte Languages French (fra), Italian (ita), Portuguese (por), Romanian (ron), Spanish (spa), English (eng) Production status Newly created-finished Resource usage Lexicon Creation/Annotation License CC-BY 4.0 Conditions of use Freely Available Description This lexicon of false friends pairs in 6 languages, includes for each language pair automatically detected pairs of false friends along with a score of falseness. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/180_res_1.zip [628 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.367.pdf Edition LREC 2020 -
Name FandomCorpora Resource type Corpus Size 1627 Summaries Languages English (eng), German (deu) Production status Newly created-finished Resource usage Summarisation License MIT License Conditions of use Freely Available Description Three sample corpora created with the FandomCorpus Construction Framework. Download from https://datamanagementlab.github.io/fandomCorpus/data.html Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.827.pdf Edition LREC 2020 -
Name FAPESP-PBSMT-NMT Resource type Corpus Size 310 KByte Languages Brazilian Portuguese, English (eng) Production status Newly created-finished Resource usage Machine Learning License CC BY-NC-SA 4.0 Conditions of use Freely Available Description This resource is a collection of 300 parallel sentences written in Brazilian Portuguese and English annotated with machine translation errors. There are two sets of 300 sentences: one output by a Phrase-Based Statistical Machine Translation system and another one output by a Neural Machine Translation System. Details about the annotation process can be found at the paper (Caseli; Inácio, LREC 2020). Download from https://github.com/LALIC-UFSCar/FAPESP-PBSMT-NMT Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.446.pdf Edition LREC 2020 -
Name FastText.cc Resource type Grammar/Language Model Size NULL NULL Languages Portuguese (por) Production status Existing-used Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. Download from https://fasttext.cc/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.669.pdf Edition LREC 2020 -
Name FEVER Dev partition delexicalized Resource type Corpus Size 10 MByte Languages English (eng) Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_5.txt [9.84 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf Edition LREC 2020 -
Name Fin-SoMe Resource type Corpus Size 10000 sentences Languages English (eng) Production status Newly created-finished Resource usage Corpus Creation/Annotation License CC BY-NC-SA 4.0 Conditions of use From Owner Description Both front-desk and middle-desk experts were hired from a bank treasury. The front-desk expert, working in the treasury marketing unit, was assumed to be a risk-neutral individual (RN), and the middle-desk expert from the risk management department was to be assumed a risk-averse individual (RA). Annotators used five months to complete the labeling of 10,000 tweets. A tweet is investigated from four angles: (1) The market sentiment (bullish/bearish) of the tweet; (2) The presence or absence of reasoning in the tweet supporting the investor's analysis; (3) The writer's sentiment (positive/negative); (4) The gain/loss of the writer's trade. Because writers do not always show market sentiment, writer sentiment, or gain/loss in their tweets, annotators assigned a "None" label to tweets without the related narratives. Download from http://nlg.csie.ntu.edu.tw/nlpresource/FinSoMe Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.749.pdf Edition LREC 2020 -
Name FlauBERT Resource type Grammar/Language Model Size 552.6 MByte Languages French (fra) Production status Newly created-finished Resource usage Language Modelling License OpenSource Conditions of use Freely Available Description Pre-trained language model for French. Download from https://github.com/getalp/Flaubert Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.302.pdf Edition LREC 2020 -
Name FloDusTA Resource type Corpus Size 9000 tweets Languages Arabic (ara) Production status Newly created-finished Resource usage Document Classification, Text categorisation License <Not Specified> Conditions of use Freely Available Description Flood, Dust Storm, Traffic Accident Saudi Event dataset Download from https://github.com/BatoolHamawi/FloDusTA Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.174.pdf Edition LREC 2020 -
Name fnc Resource type Corpus Size 42 MByte Languages English (eng) Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description A comprehensive collection of all the urls is also presented in the paper referenced herewith. Download from https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf Edition LREC 2020 -
Name FNC delexicalized training partition Resource type Corpus Size 95 MByte Languages English (eng) Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf Edition LREC 2020 -
Name FNC dev partition delexicalized Resource type Corpus Size 55 MByte Languages English (eng) Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf Edition LREC 2020 -
Name ForwardQuestions Resource type Corpus Size 38480 questions Languages English Production status Complete Resource usage Question Generation License CC-BY 3.0 Unported Conditions of use <Not Specified> Description The ForwardQuestions data set is made of human-generated questions related to knowledge triples. This data set results from the conversion and merger of the existing SimpleDBPediaQA and SimpleQuestionsWikidata data sets, including the mapping of predicates from DBPedia to Wikidata, and the selection of 'forward' questions as opposed to 'backward' ones. The new data set can be used to generate novel questions given an unseen Wikidata triple. Download from https://github.com/johannamelly/ForwardQuestions Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/ISA16/pdf/2020.isa-1.7.pdf Edition ISA16 2020 -
Name GeBioCorpus_v2 Resource type Corpus Size 2000 sentences Languages Catalan (cat), English (eng), Spanish (spa) Production status Newly created-finished Resource usage Machine Translation, SpeechToSpeech Translation License CC-BY-NC 4.0 Conditions of use Freely Available Description Corpus of 2,000 sentences in English, Spanish and Catalan (gender-balanced), which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/325_res_1.zip [389 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.502.pdf Edition LREC 2020 -
Name GeoWAC Resource type Corpus Size 45 billion words Languages ara, aze, bel, bul, cat, ces, dan, deu, ell, eng, est, fas, fin, fra, gle, hbs, hin, hun, ind, isl, ita, jpn, kat, kaz, kor, lav, lit, mkd, mon, nld, nor, pol, por, ron, rus, slk, slv, spa, sqi, swe, tam, tgl, tur, ukr, urd, uzb, vie, zho Production status Newly created-finished Resource usage Corpus Creation/Annotation License GNU GPL v2.0 Conditions of use Freely Available Description Gigaword geo-referenced web corpora for 50 language (varieties) that are explicitly balanced for geographic and demographic representation. Download from https://www.earthlings.io/corpus_download.html Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.308.pdf Edition LREC 2020 -
Name German-English parallel data by the Presidency of the Council of the EU held by Austria in 2006 Resource type Corpus Size 4973 Translation Units Languages English, German Production status In use Resource usage Machine Translation License Open Under-PSI Conditions of use Freely Available Description News/media (press releases, interviews and Common Foreign and Security Policy statements) in German and English by the EU-Presidency, Austria, 2006.German-English parallel data containing news and statements (press releases, interviews and Common Foreign and Security Policy statements) unidirectional from German to English by the Presidency of the Council of the EU held by Austria in 2006. Aligned with HunAlign, a language-independent sentence aligner (Varga et al, 2005) and manually evaluated by two evaluators. Format: .tmx. Character encoding: UTF-8. Download from https://www.elrc-share.eu/repository/browse/german-english-parallel-data-by-the-presidency-of-the-council-of-the-eu-held-by-austria-in-2006/e38b283eac3e11e8b7d400155d0267062180d233a0fd4e84b8dffb9b25cc1775/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf Edition LT4Gov 2020 -
Name GeWiki Resource type Corpus Size 240000 entries Languages German (deu) Production status Newly created-finished Resource usage Summarisation License OpenSource Conditions of use Freely Available Description A summarization corpus extracted from Wikipedia articles. The corpus is split into training (220'000 texts), evaluation (10'000 texts) and test (10'000 texts) sets. For each text (text_id.src) there is a corresponding reference summary (text_id.tgt). Download from https://github.com/domfr/GeWiki Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.821.pdf Edition LREC 2020 -
Name GitHub Typo Corpus Resource type Corpus Size 353055 entries Languages Chinese (zho), English (eng), Japanese (jpn), Others Production status Newly created-finished Resource usage Spelling Correction, Grammatical Error Correction License <Not Specified> Conditions of use Freely Available Description A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors. Download from https://github.com/mhagiwara/github-typo-corpus Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.835.pdf Edition LREC 2020 -
Name Glossary "Austrian Administration. German - English" Resource type Lexical Conceptual Resource General Information Size 673 entries Languages English, German (AT) Production status In use Resource usage Machine Translation License Use and non-commercial distribution permitted if resource is unchanged and source stated Conditions of use Freely Available Description The terminological resource entitled Fachglossar Österreichische Verwaltung. Deutsch - Englisch covers terminology in the public administration domain in German and English. It contains terminology from the areas of Austrian public law, legislation and executive authorities. The terminology is standardised by an informal working group of translators employed with the Austrian federal ministries. Download from https://www.sprachressourcen.at/verwaltungsglossar/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf Edition LT4Gov 2020 -
Name Glossary terms in German related to Austria and the topics of the trio presidency Resource type Lexical Conceptual Resource General Information Size 71000 terms Languages German Production status <Not Specified> Resource usage Machine Translation License CC-BY-SA 3.0 Conditions of use Freely Available Description German monolingual terminology and named entities. The dataset was created by University of Vienna by crawling, extracting and compiling content from Wikipedia. Terms crawled from de.wikipedia.org on the Austrian and EU-presidency related topics of the EU Council Presidency 2018: agricultural policy, digitalisation, development policy, European Union, Single Market, trade, climate policy, migration, sustainability, Austria, politics, security policy, environment, Western Balkans, economic policy. Download from https://www.elrc-share.eu/repository/browse/terms-in-german-related-to-austria-and-the-topics-of-the-trio-presidency/b82781c4ab9e11e8b7d400155d026706f61ef02809fb4748944b1af1b434f0a9/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4Gov/pdf/2020.lt4gov-1.5.pdf Edition LT4Gov 2020 -
Name Gold Standard for Cognate Pairs in English-Dutch and French-Dutch Resource type Lexicon Size 51220 entries Languages Dutch, English, French Production status Newly created-finished Resource usage Evaluation/Validation License <Not Specified> Conditions of use From Owner Description This resource contains a context-independent gold standard for English-Dutch and French-Dutch cognate detection. To this end, automatic word alignment was applied on the Dutch Parallel Corpus, and all term equivalents with a Normalized Levenshtein distance smaller than 0.5 were extracted. This resulted in a list with 28,503 English-Dutch candidate cognate pairs, and 22,715 French-Dutch candidate cognate pairs, which were subsequently manually labeled according to the guidelines established in Labat et al. 2019. The following labels were annotated: (1) Cognate: words which have a similar form and meaning in all contexts, (2) Partial cognate: words which have a similar form, but only share the same meaning in some contexts, (3) False friend: words which have a similar form but a different meaning, (4) Proper name: proper nouns (e.g. persons, companies, cities, coun-tries, etc.) and their derivations, (5) Error: word alignment errors and compound nouns of which one part is a cognate but the other part is missing in one of the languages, and (6) No standard: words that do not occur in the dictionary of that particular language. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/367_res_1.zip [1.65 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.504.pdf Edition LREC 2020 -
Name Greek POS Tagged Tweets Resource type Corpus Size 1,9 MByte Languages Modern Greek (1453-) (ell) Production status Newly created-finished Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use From Owner Description <Not Specified> Download from https://hilab.di.ionio.gr/index.php/en/datasets/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.476.pdf Edition LREC 2020 -
Name Habibi Corpus Resource type Corpus Size 378646 MByte Languages Arabic (ara), Arabic Dialects, Egyptian, Maghrebi, Iraqi, Gulf, Levantine, Sudanese Production status Newly created-finished Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use Freely Available Description Habibi is the first freely available corpus of Arabic song lyrics. The corpus comprises of more than 30,000 Arabic songs from 18 different Arab countries. Each Song in the corpus comes with a song title along with the singer's full name, country of origin and dialect. The corpus also contains information about the song's writer and composer. Each song has a singer, song title, and lyrics. The corpus is made of 30,071 Arabic songs with their lyrics split into 527,870 sentences (song verses). The corpus contains information about 1,765 singers, 3,789 song writers and 2,463 composers. The singers span across 18 Arabic countries including Egypt,Saudi, Lebanon, Iraq, Sudan, Kuwait, Syria, UAE, Morocco, Tunisia, Yemen, Jordan, Algeria, Qatar, Bahrain, Oman, Palestine and Libya. The lyrics covers 6 dialects: Egyptian, Gulf, Levantine, Iraqi, Sudanese and Maghrebi (North African). The corpus in CSV file format is available through ELRA's Share Your LR's initiative (size 10,856 MB). The complete corpus in txt and csv format as well as Habibi's word embeddings are available on http://ucrel-web.lancaster.ac.uk/habibi/. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/450_res_1.zip [11.12 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.165.pdf Edition LREC 2020 -
Name Hallig-Wartburg Ontology Resource type Ontology Size 6.8 Mbyte Languages <Not Specified> Production status Newly created-finished Resource usage Language Modelling License CC0 Conditions of use Freely Available Description This ontology is based on the so-called Hallig-Warburg: Rudolf Hallig and Walther von Wartburg, Begriffssystem als Grundlage für die Lexikographie / Système raisonné des concepts pour servir de base à la lexicographie, Berlin (Akademie-Verlag), [1952] 1963; Hallig-Wartburg is a conceptual scheme used in historical linguistic resources. Download from https://lod.academy/site/vocabs/hallig-wartburg Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/LDL2020/pdf/2020.ldl-1.1.pdf Edition LDL 2020 -
Name HatEval Resource type Corpus Size 19600 entries Languages English, Spanish Production status Existing-used Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description Hate Speech detection in Twitter but featured by two specific different targets, immigrants and women, in a multilingual perspective, for Spanish and English. Download from https://competitions.codalab.org/competitions/19935 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf Edition LREC 2020 -
Name Hindi Courteous Dataset Resource type Corpus Size 52.3 MByte Languages Hindi Production status Newly created-finished Resource usage Natural Language Generation License <Not Specified> Conditions of use From Owner Description The conversations between various customers and the trained customer care agents of different companies on their Twitter handles were used for building the dataset. The Twitter data for Hindi was mined from twitter for different companies. The dataset comprises of conversations with their ids followed by the utterance id. We have two sets of responses available in the dataset i.e., courteous response as well as the corresponding generic response. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/969_res_1.zip [13.08 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.514.pdf Edition LREC 2020 -
Name Hindi-Discourse-Modes Resource type Corpus Size 1 MByte Languages Hindi (hin) Production status Newly created-finished Resource usage Discourse License MIT License Conditions of use Freely Available Description A new corpus consisting of sentences from Hindi short stories annotated for five different discourse models argumentative, narrative, descriptive, dialogic and informative. Download from https://github.com/midas-research/hindi-discourse Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.149.pdf Edition LREC 2020 -
Name Hindi_Event Resource type Corpus Size 13.6 MByte Languages Hindi Production status Newly created-finished Resource usage Information Extraction, Information Retrieval License <Not Specified> Conditions of use From Owner Description This is a benchmark dataset in Hindi language for Event Extraction. All the annotated documents are XML documents in which event triggers, argument triggers, event type information, argument role information and event-argument linking information are tagged. The dataset contains 1709 XML documents. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1292_res_1.zip [3.29 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.273.pdf Edition LREC 2020 -
Name HSE Corpus for Discovering Safety in Construction Resource type Corpus Size 600 sentences Languages English (eng) Production status Newly created-finished Resource usage Named Entity Recognition License Open Government Licence 3.0 / CC-BY-SA 4.0 International Licence Conditions of use Freely Available Description The "HSE Corpus for Discovering Safety in Construction" is a semantically annotated corpus for safety in construction, consisting of 600 sentences from RIDDOR workplace accident reports. The corpus is annotated with six types of named entities relevant to the construction safety domain. The annotation of the corpus was split between four different annotators; 400 of the sentences have been annotated by at least two of the annotators. The average inter-annotator agreement rate is 0.79 F-Score. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/426_res_1.zip [1.05 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.245.pdf Edition LREC 2020 -
Name IGC-Parl Resource type Corpus Size 218889307 tokens Languages Icelandic Production status 1st version published Resource usage <Not Specified> License CC-BY 4.0 Conditions of use <Not Specified> Description The 2019 version of the corpus includes speeches from 1911-2019. It comprises 404 thousand speeches and just under 219 million words. The corpus has been automatically part-of-speech tagged and lemmatised. It is annotated with extensive metadata about the speeches, speakers and political parties, including speech topic, whether the speaker is in the government coalition or opposition, age and gender of speaker at the time of delivery and references to sound and video recordings where available. The corpus is distributed in TEI-conformant format. Download from http://hdl.handle.net/20.500.12537/14 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/ParlaCLARIN2/pdf/2020.parlaclarin-1.3.pdf Edition ParlaCLARIN2 2020 -
Name IKAT Dataset (German) Resource type Corpus Size NULL NULL Languages German (deu) Production status Existing-used Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use Freely Available, Attribution Required Description This dataset contains annotations about implied information in argumentative texts. The annotations are performed on sentence pairs from the Microtext corpus (the original German version, Peldszus/Stede 2015), that stand in an argumentative relation according to the argumentation graph. There are 464 such sentence pairs in the 112 texts in the corpus, i.e., approx. 4 pairs per microtext. Download from https://www.cl.uni-heidelberg.de/english/research/downloads/resource_pages/IKAT/IKAT_data.shtml Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.282.pdf Edition LREC 2020 -
Name Imperative Low-level Complete Image Edit Requests Resource type Corpus Size 2540 sentences Languages English Production status Newly created-finished Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use From Owner Description 2540 image edit requests with edit type adjust. Each request is annotated with (1) referring expression (2) attribute (3) value. (1) referring expression refers to an object in an image~(not accompanied). (2) attribute belongs to one of the five categories (a) brightness, (b) contrast (c) hue (d) saturation (e) lightness. (3) value is an integer between -100 and 100. Dataset is randomly split into train/dev/test of 2056/243/241 respectively. There is also an "abstract" version where (2) attribute and (3) value are replaced with tokens <abstract> and <value>. Download from https://github.com/tzuhsial/ImageEditingWithDialogue/tree/master/imadial-nlu/data/ILLC-IER Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.51.pdf Edition LREC 2020 -
Name imSituVQA Resource type Corpus Size 52 MByte Languages English (eng) Production status Newly created-finished Resource usage Question Answering License <Not Specified> Conditions of use Freely Available Description imSituVQA is a corpus for Visual Question Answering Annotated with Semantic Frame Information, based on the imSitu dataset. So images can be downloaded from http://imsitu.org/download/ Question answer pairs were created from image annotations of imSitu dataset. Each question answer is annotated with frame semantic information. Download from https://github.com/givenbysun/imSituVQA/blob/master/data/imSituVQA.json Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.678.pdf Edition LREC 2020 -
Name Indian-Language-Dataset Resource type Corpus Size 461 MByte Languages Bengali, Malayalam, Tamil, Telugu, Urdu (urd) Production status Existing-updated Resource usage Machine Translation, SpeechToSpeech Translation License <Not Specified> Conditions of use Freely Available Description We obtained the data from different resources such as EnTamV2.0 (Ramasamy et al., 2012), Opus (Tiedemann, 2012) and UMC005 (Jawaid and Zeman, 2011). The sentences are of domain news, cinema, bible and movie subtitles. We combined and preprocessed the data of Tamil, Malayalam, Telugu, Bengali, and Urdu. After preprocessing (as described below) and cleaning, we split the dataset into train, test, and validation accordingly. Download from https://github.com/himanshudce/Indian-Language-Dataset Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.444.pdf Edition LREC 2020 -
Name INEL Dolgan Corpus 1.0 Resource type Corpus Size NULL NULL Languages Dolgan, English, Russian Production status Newly created-finished Resource usage Corpus Creation/Annotation License CC-BY-NC-SA 4.0 Conditions of use Freely Available Description The INEL Dolgan corpus has been created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages"), 2016-2033. The corpus makes possible typologically aware corpus-based grammatical research on the Dolgan language and expands the documentation of the lesser described indigenous languages of Northern Eurasia. Dolgan is an endangered Siberian (North-Eastern) Turkic language, which is spoken on the Taymyr peninsula and in adjacent areas in Northern Siberia. The number of speakers is approx. 1,000. Whereas in many settlements, including Dudinka, Dolgan is rarely spoken in everyday's life, there are three remote villages (Syndassko, Popigay and Novorybnoe) where Dolgan is until now very present and frequently spoken. There is no final agreement about a possible dialectal division of Dolgan. Mostly, two or three varieties of Dolgan are mentioned: Upper (Western) Dolgan, Lower (Eastern) Dolgan and Anabar Dolgan. Upper Dolgan is spoken in the settlements of Volochanka, Ust`-Avam, Katyryk, Xeta, Novaya, Kresty. Lower Dolgan is spoken in the villages Zhdanixa, Novorybnoe, Syndassko and Popigay and seems to be the more vital variety of Dolgan. The settlement of Xatanga is said to form the border between the two dialects. Anabar Dolgan is spoken in the Anabar ulus of the Sakha republic (Yakutia) and possibly best regarded a transitory variety of Dolgan and Yakut. Formerly, the territory around Dudinka and Norilsk belonged to the Upper Dolgan area. Nowadays, however, the traditional living places of Dolgans do not exist anymore; the Dolgans living in Dudinka and Norilsk come originally from the whole Taymyr peninsula, and thus from different dialectal regions. The INEL Dolgan corpus contains both Upper and Lower Dolgan material. For several reasons Anabar Dolgan is not included. The INEL Dolgan corpus is composed of texts from different sources: 1. Published folklore texts from an edited volume ("Fol'klor Dolgan", P.E. Efremov 2000), 2. Transcripts of recordings obtained from the Taymyr House of Folk Art (TDNT) in Dudinka (1970s-2000s), 3. Transcripts from the collection of Dr. Eugénie Stapert recorded on several fieldwork trips in 2007-2010, 4. Transcripts of recordings made on a fieldwork trip in 2017. The first group as well as parts of the third group were already transcribed and translated, the rest of the recordings was transcribed and translated within the INEL project. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, as well as annotation of Russian borrowings. Some texts also have annotations for syntactic functions, semantic roles and information structure/information status. The corpus has been produced in the context of the joint research funding of the German Federal Government and Federal States in the Academies' Programme, with funding from the Federal Ministry of Education and Research and the Free and Hanseatic City of Hamburg. The Academies' Programme is coordinated by the Union of the German Academies of Sciences and Humanities. The corpus could not have been produced without the support of the Taymyr House of Folk Art (TDNT). On the one hand, the TDNT provided valuable audio material (see above). On the other hand, the TDNT helped in all kinds of organizational tasks. Moreover, the sound materials from the TDNT were transcribed and translated by the following native speakers of Dolgan: Nina Semyonovna Kudryakova (TDNT Head of Department of folklore and ethnography), who also worked as editor for transcriptions and translations by other consultants, Svetlana Semyonovna Kudryakova, Egor Kudryakov, Adeya Evdokimovna Eske, Aleksandra Tuprina, Illarion Tuprin In summer 2017 (06.07.2017 - 20.08.2017), Chris Lasse Däbritz and Dr. Eugénie Stapert had the opportunity to conduct fieldwork with Dolgans in the town of Dudinka. Thanks to the organizational work of the TDNT, it was possible to work many hours each day with very motivated consultants. The fieldwork had two major goals: Firstly, most uncertainties in the analysis of the folklore texts from (Efremov 2000) could be resolved. Secondly, a lot of material obtained from the TDNT could be transcribed, translated and analyzed. Moreover, roughly half an hour of new material was recorded, namely an experiment on Social Cognition. The following local consultants were involved in the fieldwork: Nina Semyonovna Kudryakova, Anna Alekseevna Barbolina, Vera Polikarpovna Bettu, Galina Sidorovna Chuprina, Adeya Evdokimovna Eske, Stepanida Il`inichna Kudryakova, Yuliya Kupchik, Polina Prokop`evna Uodaj The web-based search interface is using the Tsakonian Corpus platform developed by Dr. Timofey Arkhangelskiy, Humboldt Research Fellow at IFUU, Hamburg University. Download from http://hdl.handle.net/11022/0000-0007-CAE7-1 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.322.pdf Edition LREC 2020 -
Name INEL Dolgan Corpus 1.0 Resource type Corpus Size 72912 words Languages Dolgan (dlg), English (eng), German (deu), Russian (rus) Production status Existing-used Resource usage Knowledge Discovery/Representation License CC BY-NC-SA 4.0 Conditions of use Freely Available Description Dolgan is an endangered Siberian (North-Eastern) Turkic language, which is spoken on the Taymyr peninsula and in adjacent areas in Northern Siberia. The number of speakers is approx. 1,000. Whereas in many settlements, including Dudinka, Dolgan is rarely spoken in everyday's life, there are three remote villages (Syndassko, Popigay and Novorybnoe) where Dolgan is until now very present and frequently spoken. The INEL Dolgan corpus has been created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages"), 2016-2033. The corpus makes possible typologically aware corpus-based grammatical research on the Dolgan language and expands the documentation of the lesser described indigenous languages of Northern Eurasia. Download from http://hdl.handle.net/11022/0000-0007-D832-D Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf Edition LREC 2020 -
Name INEL Kamas Corpus 0.1 Resource type Corpus Size 48293 words Languages English (eng), German (deu), Kamas (xas), Russian (rus) Production status Existing-used Resource usage Knowledge Discovery/Representation License CC BY-NC-SA 4.0 Conditions of use Freely Available Description Kamas is an extinct Samoyedic language (Uralic family). The INEL Kamas corpus comprises folklore texts collected by Kai Donner in 1912-1914, before the language shift, and transcribed audio recordings of the last speaker, Klavdiya Plotnikova made between 1964 and 1970. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, annotation of borrowings. Some texts also have annotations for syntactic structure, semantic roles and information status. Download from http://hdl.handle.net/11022/0000-0007-CF46-2 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf Edition LREC 2020 -
Name INEL Selkup Corpus 0.1 Resource type Corpus Size 18673 words Languages English (eng), German (deu), Russian (rus), Selkup (sel) Production status Existing-used Resource usage Knowledge Discovery/Representation License CC BY-NC-SA 4.0 Conditions of use Freely Available Description Selkup is an endangered Southern Samoyedic language (Uralic family). The INEL Selkup corpus is composed of texts from the archive of Angelina Ivanovna Kuzmina (1924-2002), who gathered a large amount of material on Selkup in almost all regions where the Selkup people lived in 1962-1977. Most texts in the corpus originate from the handwritten part of the archive, the others come from sound recordings made by A.I. Kuzmina, transcribed and translated within the INEL project. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, annotation of borrowings. Some texts also have annotations for syntactic structure, semantic roles and information status. Download from http://hdl.handle.net/11022/0000-0007-CC85-D Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.354.pdf Edition LREC 2020 -
Name JParaCrawl Resource type Corpus Size 8763995 sentences Languages English (eng), Japanese (jpn) Production status Newly created-finished Resource usage Machine Translation, SpeechToSpeech Translation License Proprietary License (Research purposes - No commercial Use) Conditions of use Freely Available Description JParaCrawl is a large freely available English-Japanese parallel corpus created by NTT. Download from http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.443.pdf Edition LREC 2020 -
Name Kannada Event Annotated Treebank Resource type Corpus Size 37020 lexemes Languages Kannada Production status <Not Specified> Resource usage For Kannada NLP License <Not Specified> Conditions of use Freely Available Description This is an event annotated Kannada Dependency Treebank. The dataset has been annotated based on the guidelines mentioned in the paper. Download from https://drive.google.com/drive/folders/11ZXpP4mQcDcM91SKHiSNEtWi_mAkXku7 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/ISA16/pdf/2020.isa-1.10.pdf Edition ISA16 2020 -
Name KORE 50^DYWC Resource type Evaluation Data Size 50 sentences Languages English Production status Existing-used Resource usage Evaluation/Validation License OpenSource Conditions of use Freely Available Description Entity linking evaluation data set with annotations based on DBpedia, YAGO, Wikidata, and Crunchbase. Download from http://people.aifb.kit.edu/mfa/datasets/kore50-lrec2020.zip Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.291.pdf Edition LREC 2020 -
Name Korean FrameNet 1.1 Resource type Corpus Size 40,7 MByte Languages Korean (kor) Production status Newly created-finished Resource usage Semantic Role Labeling License CC BY-NC-SA 2.0 Conditions of use Freely Available Description Korean FrameNet is a lexical database that has rich annotations to represent the meaning of Korean text using semantic frames. Download from https://github.com/machinereading/koreanframenet Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.30.pdf Edition LREC 2020 -
Name Korean Movie Review Emotion Dataset Resource type Corpus Size 11216 KByte Languages Korean (kor) Production status Newly created-finished Resource usage Emotion Recognition/Generation License OpenSource Conditions of use Freely Available Description The Korean Movie Review Emotion (KMRE) dataset was annotated with six types of emotions in the NSMC dataset by following an in-house annotation procedure, which achieves the best performance. In here, the emotion neutral was excluded because there are no neutral reviews in the NSMC. After the construction, the training dataset was splitted into 80% for training and 20% for development separately. Specific emotions (i.e. anger, happiness, and sadness) were found as more frequently annotated in the KMRE dataset. Furthermore, the train, dev, and test of KMRE dataset tend to have similar emotion label distribution. The KMRE dataset has 119,995 sentences of train, 29,999 sentences of development, and 49,997 sentences of test. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/891_res_1.zip [11.48 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.199.pdf Edition LREC 2020 -
Name kosentences Resource type Corpus Size 31427658 sentences Languages Korean Production status Newly created-finished Resource usage Language Modelling License MIT Licence, GNU Free Documentation License, CC-BY-NC-SA 4.0 Conditions of use Freely Available Description kosentences is a large-scale web corpus of Korean text. It can be used for any task which requires a large amount of unsupervised and (relatively) well-formed text, such as language modeling. Download from https://github.com/cynthia/kosentences/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.429.pdf Edition LREC 2020 -
Name LEDGAR Resource type Corpus Size 104990418 tokens Languages English (eng) Production status Newly created-finished Resource usage Document Classification, Text categorisation License MIT License Conditions of use Freely Available Description We present LEDGAR, a multilabel corpus of legal provisions in contracts. The corpus was crawled and scraped from the public domain (SEC filings) and is, to the best of our knowledge, the first freely available corpus of its kind. Since the corpus was constructed semi-automatically, we apply and discuss various approaches to noise removal. Due to the rather large labelset of over 12,000 labels annotated in almost 100,000 provisions in over 60,000 contracts, we believe the corpus to be of interest for research in the field of Legal NLP, (large-scale or extreme) text classification, as well as for legal studies. Download from https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.155.pdf Edition LREC 2020 -
Name LibriVoxDeEN Resource type Corpus Size 62 GByte Languages English, German Production status Newly created-finished Resource usage Machine Translation, SpeechToSpeech Translation License CC BY-NC-SA 4.0 Conditions of use Freely Available Description LibriVoxDeEn is corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that speech alignment quality is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for German speech recognition and for end-to-end German-to-English speech translation. Download from https://www.cl.uni-heidelberg.de/statnlpgroup/librivoxdeen/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.441.pdf Edition LREC 2020 -
Name List of English-French Cognates Resource type Lexicon Size 492 entries Languages English (eng), French (fra) Production status Newly created-finished Resource usage Word Sense Disambiguation License CC-BY 4.0 Conditions of use Freely Available Description This is a new database for computational approaches of synchronized diachronic investigation of language pairs, and subsequent novel findings stemming from the cognates temporal comparison of the two chosen languages. The present study is the first in the literature to use computational approaches and large data to make a cross-language temporal analysis. Taken and computed from google n-grams, a cognate database was constructed with 492 English-French word pairs composed of nouns, adjectives and adverbs. Each pair has between one and four forms in English, and up to ten in French. In English, most words have only one form for adjectives and adverbs, while most nouns have two forms (singular and plural). In French, with masculine and feminine, singular and plural, most nouns and adjectives can be found in four different surface forms. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/584_res_1.txt [76 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.107.pdf Edition LREC 2020 -
Name LK Hadith Corpus Resource type Language Resources/Technologies Infrastructure Size 10266304 tokens Languages Arabic, English Production status Newly created-finished Resource usage Evaluation/Validation License OpenSource Conditions of use Freely Available, Attribution Required Description Well-structured Arabic-English Parallel Corpus of Islamic Hadith with its components (Isnad and Matn) segmented. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/980_res_1.zip [14.54 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.415.pdf Edition LREC 2020 -
Name Machine Reading Comprehension Dataset Resource type Corpus Size NULL NULL Languages English Production status Newly created-finished Resource usage Question Answering License <Not Specified> Conditions of use From Owner Description The dataset is for building Machine Reading Comprehension model on Scholarly Articles. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1035_res_1.zip [408 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.675.pdf Edition LREC 2020 -
Name MAGPIE Corpus Resource type Corpus Size 56622 entries Languages English (eng) Production status Newly created-finished Resource usage Machine Learning License CC-BY 4.0 Conditions of use Freely Available Description The MAGPIE Corpus is a large sense-annotated corpus of potentially idiomatic expressions (PIEs), based on the British National Corpus (BNC). Potentially idiomatic expressions are like idiomatic expressions, but the term also covers literal uses of idiomatic expressions, such as 'I leave work at the end of the day.' for the idiom 'at the end of the day'. The corpus contains 56,622 instances, covering 1,756 different idiom types, all of which have crowdsourced meaning labels. Download from https://github.com/hslh/magpie-corpus Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.35.pdf Edition LREC 2020 -
Name Mandarinograd Resource type Corpus Size 154 entries Languages Mandarin Chinese (cmn) Production status Newly created-finished Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description Mandarinograd is a collection of 154 Winograd Schemas in Mandarin Chinese, mainly translated and adapted from English. Winograd Schemas represent particularly hard problems in natural language understanding, related to anaphora resolution. They are designed to involve common sense reasoning and to limit the biases and artefacts commonly found in natural language understanding datasets. Download from https://gitlab.com/vanTot/mandarinograd/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.3.pdf Edition LREC 2020 -
Name Mapudungun Corpus Resource type Corpus Size 142 hours Languages Mapudungun (arn), Spanish (spa) Production status Existing-updated Resource usage Speech Synthesis, Speech Recognition and Machine Translation License CC-BY-NC-SA 3.0 Conditions of use Freely Available Description 142 hours of conversations in Mapudungun, fully transcribed and translated in Spanish. Training, development, and test splits are provided for speech synthesis, speech recognition, and machine translation. Download from https://github.com/mingjund/mapudungun-corpus Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.350.pdf Edition LREC 2020 -
Name MASRI-HEADSET Corpus Resource type Corpus Size 8 hours Languages Maltese (mlt) Production status Newly created-finished Resource usage Speech Recognition/Understanding License <Not Specified> Conditions of use From Owner. The corpus is free for Research/Academic purposes. This corpus belongs to the University of Malta. All rights reserved. Description The MASRI-HEADSET CORPUS (MHC) is suitable for speech recognition and it has the following characteristics: - The MHC has an exact duration of 8 hours and 6 minutes. It has 3864 audio \n files. - The MHC has recordings from 25 different speakers: 13 women and 12 men. - Data in MHC is classified by speaker. It means, all the recordings of one single speaker are stored in one single directory. - Data is also classified according to the gender (male/female) of the \n speakers. - Audio files in the MHC are distributed in a 16khz@16bit mono format.\n\n- Every audio file has an ID that is compatible with ASR engines such as \n Kaldi and CMU-Sphinx. - Transcriptions in MHC are lowercase. No punctuation marks are permitted except dashes (-) and apostrophes (') because they belong to the Maltese \n orthography. Download from https://www.um.edu.mt/projects/masri/index.html Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.784.pdf Edition LREC 2020 -
Name MEL dataset Resource type Corpus Size 51 MByte Languages English (eng) Production status Newly created-finished Resource usage Information Extraction, Information Retrieval License CC BY-NC-SA 3.0 Conditions of use Freely Available Description This is an evaluation corpus for multimedia entity linking as described in (Adjali et al., 2020). It is made of tweet identifiers separated into 4 files: - kb: knowledge base containing 2657213 ids - mel_dev_ids: development dataset for evaluation (16599 ids) - mel_test_ids: test dataset for evaluation (36521 ids) - mel_train_ids: train dataset for evaluation (35976 ids) More information at https://github.com/OA256864/MEL_Tweets Download from https://drive.google.com/open?id=1kkRpVJpo-U6Gt_r4Ly-ciq4pAY03CoTg Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.528.pdf Edition LREC 2020 -
Name Minho Word Pool Resource type Lexicon Size 700 KByte Languages Portuguese Production status Existing-used Resource usage Acquisition License <Not Specified> Conditions of use Freely Available Description Minho Word Pool (MWP) is a dataset that provides normative values of imageability, concreteness and subjective frequency for 3,800 (European) Portuguese words, three subjective measures, which in spite of being extensively used in research, were still scarce for Portuguese. Data were collected with 2,357 college students who were native speakers of European Portuguese. Participants rated 100 words drawn randomly from the full set in each of the three subjective indices using a web survey procedure (via a URL link). Analyses comparing the MWP ratings with those obtained for the same words from other national and international databases showed that the MWP norms are reliable and valid, thus providing researchers with a useful tool to support research in all neuroscientific areas using verbal stimuli. Download from http://p-pal.di.uminho.pt/about/databases Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf Edition LREC 2020 -
Name MNLI DEV partition matched delexicalized Resource type Corpus Size 12 NULL Languages English (eng) Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_4.txt [12.92 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf Edition LREC 2020 -
Name MNLI DEV partition mis matched delexicalized Resource type Corpus Size 2.2 MByte Languages English (eng) Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1319_res_7.txt [2.33 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf Edition LREC 2020 -
Name MNLI training delexicalized Resource type Corpus Size 492 MByte Languages English (eng) Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description This is the masked version of the datasets FEVER,FNC,MEDNLI, MNLI, , delexicalized with the OA-NER technique presented in the paper referenced herewith. Download from https://osf.io/szdkn/?view_only= 4845641a80624ac493ca14df34e68e8c Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.850.pdf Edition LREC 2020 -
Name Monsanto Trial Document Resource type Treebank Size 15000 sentences Languages English (eng) Production status Newly created-finished Resource usage Sensitive Information Detection License CC-BY 4.0 Conditions of use Freely Available Description Constituent parse trees over labeled sentences. Corpus is publicized documents from the Monsanto trial. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/219_res_1.zip [4.54 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.158.pdf Edition LREC 2020 -
Name Movie MPAA Information Corpus Resource type Corpus Size 5562 entries Languages English (eng) Production status Existing-updated Resource usage Machine Learning License <Not Specified> Conditions of use From Owner Description The data set of this work is an expansion of the existing movie script dataset (http://ritual.uh.edu/1493-2/) to include the MPAA rating for the movies. Other important features of our dataset are: (1) 50 more movies compared to the original dataset (2) up to 12 most similar movies for each movie in the dataset along with the MPAA rating of those similar movies (3) users' vote for MPAA components (violence, profanity, sex, drug/alcohol, frightening) (4) reason of the MPAA rating published by the MPAA organization (5) poster images for movies. PS: The whole dataset is a 5G file. So, we uploaded the meta-information (similar movies, metadata file, reasons, MPAA component ) directly to this page, and we provided the whole dataset (meta-information + posters + script) through a link. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/548_res_1.zip [1.27 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.166.pdf Edition LREC 2020 -
Name Multi sentenced Algebraic Mathematical word problems in English Resource type Corpus Size NULL sentences Languages English Production status Newly created-finished Resource usage Machine Learning License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_2.txt [234 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf Edition LREC 2020 -
Name Multi sentenced simple Mathematical word problems in English Resource type Corpus Size NULL sentences Languages English Production status Newly created-finished Resource usage Machine Learning License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_1.txt [140 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf Edition LREC 2020 -
Name Multi sentenced simple Mathematical word problems in Sinhala Resource type Corpus Size NULL sentences Languages Sinhala Production status Newly created-finished Resource usage Machine Learning License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_3.txt [250 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf Edition LREC 2020 -
Name Multi sentenced simple Mathematical word problems in Tamil Resource type Corpus Size NULL sentences Languages Tamil Production status Newly created-finished Resource usage Machine Learning License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1220_res_4.txt [234 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.579.pdf Edition LREC 2020 -
Name Multi-Party Dialogue Dataset Resource type Corpus Size 6.7 MByte Languages Chinese Production status Newly created-finished Resource usage Dialogue License <Not Specified> Conditions of use Freely Available Description The Multi-Party Dialogue Dataset (MPDD) consists of two files, dialogue.json and metadata.json. The file dialogue.json contains the dialogues. Each dialogue has a unique case index value in the json file, and is a list composed of the utterances in speaking order. Every utterance in the list contains the speaker, content, and annotated labels shown in data format. The list of the listener in the utterance contains all listeners in this utterance with their relation type. The data format of dialogue.json is shown as follows. {case index:[{"speaker": speaker's name,"utterance": utterance, "listener": [{"name": listener's name,"relation": relation type between speaker and listener}, ...],"emotion": speaker's emotion type}, ...]} The metadata is given in metadata.json. The file defines all the emotion, relation types, and the sub-classes in the two perspectives, position, and field. The data format of metadata.json is shown as follows. {"relation":["parent", ...],"field":{"family":["parent", ...],"school":["teacher", ...],"company":["boss", ...],"others":["couple", ...]},"position":{"superior":["parent",...],"peer":["spouse", ...],"inferior":["child",...]},"emotion":["fear",...]} Download from http://lrec2020.lrec-conf.org/sharedlrs2020/641_res_1.zip [953 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.76.pdf Edition LREC 2020 -
Name Multi-Sense Dataset (MSD-1030) Resource type Evaluation Data Size 1030 word pairs Languages English (eng) Production status Newly created-finished Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description MSD-1030 is an English semantic similarity dataset designed to facilitate more reliable evaluations of sense embeddings. Unlike previous word similarity datasets, most words in MSD-1030 are multi-sense words, which requires sense embedding models to capture multiple meanings instead of only the most dominant sense. In addition, because of the carefully designed word pair selection process, the distribution of the similarity scores are balanced over the entire range. MSD-1030 also does not contain phrases, which are usually skipped in evaluations of word and sense embedding models. MSD-1030 is more suitable for evaluating multi-sense representations and may shed light on room for improvement in related research. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/567_res_1.zip [19 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.711.pdf Edition LREC 2020 -
Name multiCEFRLex Resource type Lexicon Size 41425 entries Languages English (eng), French (fra), Swedish (swe) Production status Newly created-finished Resource usage Language Learning/Grading License <Not Specified> Conditions of use Freely Available Description Lexical resources for foreign language (L2) learning, teaching and research graded following the Common European Framework of Reference for Languages (CEFR). Download from http://pub.cl.uzh.ch/purl/multiCEFRLex Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.43.pdf Edition LREC 2020 -
Name Multilingual Hate Speech Dataset for Fairness Evaluation Resource type Corpus Size 7.4 MByte Languages English (eng), Italian (ita), Polish (pol), Portuguese (por), Spanish (spa) Production status Newly created-finished Resource usage Document Classification, Text categorisation License Apache 2.0 Conditions of use Freely Available Description First corpus for fairness evaluation of document classifiers in the field of hate speech detection. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1174_res_1.zip [7.39 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.180.pdf Edition LREC 2020 -
Name Multilingual Task-Oriented Dialog Data-Basque Resource type Corpus Size 6404 sentences Languages Basque (eus) Production status Newly created-finished Resource usage Dialogue License CC-BY-NC-SA 4.0 Conditions of use Freely Available Description The resource contains manually and automatically generated and annotated utterances for Basque from Multilingual Task-Oriented Dialog Dataset. 34,18 utterances were manually translated and annotated, and 2,986 automatically translated by a neuralMT system from Spanish ones and annotated by projecting the slot tags using a word aligner. Download from https://hizkuntzateknologiak.elhuyar.eus/assets/files/fmtodelh.tgz Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.340.pdf Edition LREC 2020 -
Name MultiMWE corpora Resource type Corpus Size 3302268 entries Languages Chinese, English, German Production status Newly created-finished Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use Freely Available. Free for Research Only Description The size of the bilingual MWE pairs is 3,159,226 and 143,042 for German-English and Chinese-English respectively after filtering. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/78_res_1.zip [14.95 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.363.pdf Edition LREC 2020 -
Name Multiword Expressions dataset with type and complexity annotation Resource type Evaluation Data Size 1.3 MByte Languages English (eng) Production status Newly created-finished Resource usage Evaluation/Validation License CC-BY-NC-SA 4.0 Conditions of use Freely Available Description We release a re-annotated Complex Word Identification Shared Task 2018 dataset originally provided by Yimam et al. (2017). The multiword expressions (MWE) in this data is annotated with MWE type, which helps drawing conclusions about the complexity of each MWE type for native and non-native readers. This work demonstrates that incorporating the type of MWE based on this dataset into a lexical complexity assessment system improves its performance. Together with the original complexity labels, this dataset represents a valuable resource for the text simplification community. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1446_res_1.tsv [1.30 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.545.pdf Edition LREC 2020 -
Name MyFixit Dataset Resource type Corpus Size 313 MByte Languages English Production status Newly created-finished Resource usage Information Extraction, Information Retrieval License CC-BY-NC-SA 3.0 Conditions of use Freely Available Description MyFixit is a collection of repair manuals, collected from the iFixit website. There are in total 31,601 repair manuals in 15 device categories. Each step in the manuals of the 'Mac Laptop' category is annotated with the required tool, disassembled parts, and the removal verbs (1,497 manuals with 36,659 steps). Download from https://github.com/rub-ksv/MyFixit-Dataset Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.260.pdf Edition LREC 2020 -
Name Norm It Resource type Corpus Size 12822 words Languages Italian Production status Existing-updated Resource usage Lexical Normalization License <Not Specified> Conditions of use Freely Available Description <Not Specified> Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1079_res_1.tgz [69 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.769.pdf Edition LREC 2020 -
Name North Sámi active learning morphological segmentation annotations Resource type Evaluation Data Size 796 types Languages Northern Sami (sme) Production status Existing-used Resource usage Morphological Analysis License CC-BY 4.0 Conditions of use Freely Available Description Morphological segmentation data set, for unsupervised and weakly supervised learning. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/872_res_4.tgz [45 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.486.pdf Edition LREC 2020 -
Name NTCIR14-QALab-PoliInfo-FormalRunDataset Resource type Evaluation Data Size 45 MByte Languages Japanese (jpn) Production status Newly created-finished Resource usage Information Extraction, Information Retrieval License <Not Specified> Conditions of use Freely Available Description NTCIR14-QALab-PoliInfo-FormalRunDataset. Download from https://github.com/kmr-y/NTCIR14-QALab-PoliInfo-FormalRunDataset Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.253.pdf Edition LREC 2020 -
Name NZZ Black Letter Ground Truth Resource type Corpus Size 304286 words Languages German Production status Newly created-finished Resource usage Machine Learning License CC-BY-NC 4.0 Conditions of use Freely Available Description The Neue Zürcher Zeitung (NZZ) has been publishing in black letter from its very first issue in 1780 until 1947. From this time period, one frontpage per year was randomly sampled, resulting in a total of 167 pages. Frontpages were chosen because they typically contain highly relevant material and to make sure not to sample pages containing exclusively advertisements or stock information. During certain periods, the NZZ was published several times a day, and there were supplements, too. Due to incomplete metadata, the sampling included frontpages from supplements. Download from https://github.com/impresso/NZZ-black-letter-ground-truth Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.436.pdf Edition LREC 2020 -
Name Odia Sentiment Dataset Resource type Corpus Size 582 KByte Languages Odia Production status Newly created-finished Resource usage Machine Learning License <Not Specified> Conditions of use From Owner Description This dataset consists of 2045 sentences in Odia annotated with sentiments labelled as +1(positive), 0(neutral), -1(negative). This sentences were picked from news domain, with source being "Samaja" Archives. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/789_res_1.txt [582 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.339.pdf Edition LREC 2020 -
Name OLID Resource type Corpus Size 14100 entries Languages English Production status Existing-used Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description OLID contains a collection of annotated tweets using an annotation model that encompasses the following three levels: A: Offensive Language Detection B: Categorization of Offensive Language C: Offensive Language Target Identification OLID was used in the OffensEval: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 - Task 6) shared task. Download from https://competitions.codalab.org/competitions/20011 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf Edition LREC 2020 -
Name OpenWordNet.PT Resource type Lexicon Size 39000 synsets Languages Portuguese Production status Existing-used Resource usage Acquisition License CC-BY 4.0 Conditions of use Freely Available Description The OpenWordnet-PT (abbreviated as OpenWN-PT or simply OWN-PT) is an open access wordnet for Portuguese. Download from http://openwordnet-pt.org/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf Edition LREC 2020 -
Name Optimized Twitter Search Phrases Resource type Corpus Size 20000 phrases Languages <Not Specified> Production status Finished Resource usage <Not Specified> License <Not Specified> Conditions of use <Not Specified> Description This resource contains optimized search phrases for the 50 most prevalent languages on Twitter. Each list consists of key phrases that identify the most target language tweets while limiting the number of other-language tweets retrieved. Use a list in combination with the Twitter Streaming API to obtain the language-specific stream of tweets. When collecting Twitter data in this way, please refer to the accompanying paper to know the limits of your collection. Download from https://www.clips.uantwerpen.be/twitter/phrases.zip Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/WAC-II/pdf/2020.wac-1.8.pdf Edition WACXII 2020 -
Name PAN-SemEval-Hyperpartisan-News-Detection-19 Resource type Evaluation Data Size 1000000 documents Languages English Production status Newly created-finished Resource usage Document Classification, Text categorisation License CC-BY 4.0 Conditions of use From Owner Description <Not Specified> Download from https://pan.webis.de/semeval19/semeval19-web/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.159.pdf Edition LREC 2020 -
Name Parallel Sentiment Resource type Corpus Size 300 KByte Languages ar,cs,de,en,es,fi,fr,hi,id,it,ja,ko,pl,pt,ru,sv,th,tr,zh Production status Newly created-finished Resource usage Emotion Recognition/Generation License <Not Specified> Conditions of use Freely Available Description Sentence-level sentiment annotations (positive or negative) for 19 languages semi-automatically generated from PUD (parallel corpora in Universal Dependencies). Download from http://lrec2020.lrec-conf.org/sharedlrs2020/288_res_1.zip [13 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.500.pdf Edition LREC 2020 -
Name ParaPat Resource type Corpus Size 68000000 sentences Languages Chinese (zho), English (eng), French (fra), Japanese (jpn), Russian (rus), Korean, German,Spanish Production status Newly created-finished Resource usage Machine Translation, SpeechToSpeech Translation License GNU-GPL v3.0 Conditions of use Freely Available Description ParaPat is a corpus of parallel sentences from patents extracted from Google Patents. Download from https://github.com/soares-f/parapat Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.465.pdf Edition LREC 2020 -
Name PO-EMO Resource type Corpus Size 4191 lines Languages English, German Production status Newly created-finished Resource usage Emotion Recognition/Generation License <Not Specified> Conditions of use Freely Available Description PO-EMO is a corpus of German and English Poetry, with fine-grained multi-label annotation of aesthetic emotions. Download from https://github.com/tnhaider/poetry-emotion Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.205.pdf Edition LREC 2020 -
Name Prague Czech English Dependency Treebank 2.0 Resource type Treebank Size 50000 sentences Languages Czech, English Production status Existing-updated Resource usage <Not Specified> License CC BY-NC-SA 3.0 Conditions of use Freely Available Description The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part. Download from http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf Edition LREC 2020 -
Name Prague Dependency Treebank - Consolidated 1.0 Resource type Treebank Size 4000000 tokens Languages Czech Production status Newly created-finished Resource usage <Not Specified> License <Not Specified> Conditions of use Restricted Access Description The Prague Dependency Treebank - Consolidated (PDT-C) is a consolidated release of the existing PDT-corpora of Czech data with manual annotation at all three PDT-annotation layers (morphological, surface syntax and deep syntax layer). Download from http://hdl.handle.net/11234/1-3185 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf Edition LREC 2020 -
Name Prague Dependency Treebank 3.5 Resource type Treebank Size 50000 sentences Languages Czech Production status Existing-updated Resource usage <Not Specified> License CC BY-NC-SA 4.0 Conditions of use Freely Available Description The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (over 800 thousand nodes) on all layers, from tectogrammatical to words, and additional sentences on the analytical (surface dependency syntax) and morphological layers of annotation (approx. 2 million words in total). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations. Download from http://hdl.handle.net/11234/1-2621 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.641.pdf Edition LREC 2020 -
Name PT Lexical-Semantic Relations Resource type Lexicon Size NULL NULL Languages Portuguese Production status Existing-used Resource usage Acquisition License <Not Specified> Conditions of use Freely Available Description Lexical-Semantic relations acquired from ten computational lexical resources in Portuguese (PAPEL, Dicionário Aberto, Wiktionary.PT, TeP, OpenThesaurus.PT, OpenWordnet-PT, PULO, WordNet.Br, Port4Nooj, ConceptNet) together with the number of resources where they were extracted from. Download from http://ontopt.dei.uc.pt/index.php?sec=download_outros Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.160.pdf Edition LREC 2020 -
Name pwnalign Resource type Evaluation Data Size 1M entries Languages English (eng) Production status Newly created-finished Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description A set of 3 goldstandards for the evaluation of alignments between WordNet and Wikipedia. Download from https://sourceforge.net/projects/pwnalign/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.597.pdf Edition LREC 2020 -
Name Read Speech Corpus (RSC) Resource type Corpus Size 8.3 GByte Languages Romanian (ron) Production status Newly created-finished Resource usage Speech Recognition/Understanding License CC-BY-NC-ND 3.0 Conditions of use From Owner Description Read Speech Corpus (RSC) is the largest publicly available Romanian speech corpora for training and evaluating Automatic Speech Recognition (ASR) systems, comprising 100 hours from 164 speakers. Download from https://speed.pub.ro/downloads/speech-datasets/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.814.pdf Edition LREC 2020 -
Name Renmin-OCR-NER Resource type Evaluation Data Size 438249 tokens Languages Chinese (cmn) Production status Newly created-finished Resource usage Named Entity Recognition License <Not Specified> Conditions of use Freely Available Description Newspaper images transcribed with text locations and transcriptions for OCR evaluation and annotated for named entities. Download from https://github.com/hltcoe/cmn-renmin-ocr-ner-dataset Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.570.pdf Edition LREC 2020 -
Name Resources for Academic Writing Resource type Corpus Size 8.6 MByte Languages English (eng) Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License <Not Specified> Conditions of use Freely Available Description The dataset contains keyphrases and n-grams extracted from reference corpus, training, and test dataset fro Informal word identification task, and training and test for academic paraphrasing task. A README file is included explaining the details. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1020_res_1.zip [8.59 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.722.pdf Edition LREC 2020 -
Name Rigor-Mortis-LREC2020 Resource type Corpus Size 41820 tokens Languages French (fra) Production status Newly created-finished Resource usage Multiword Expressions License CC-BY-NC 4.0 Conditions of use Freely Available Description The resource is an export of the Multiword expressions annotated by players the the Rigor-Mortis Game. See https://github.com/gwaps4nlp/rigor-mortis/blob/master/LREC_2020/LREC2020_RM.pdf Download from https://github.com/gwaps4nlp/rigor-mortis/tree/master/LREC_2020 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.541.pdf Edition LREC 2020 -
Name Romanian Named Entity Corpus Resource type Corpus Size 26377 entities Languages Romanian Production status Newly created-finished Resource usage Named Entity Recognition License MIT License Conditions of use Freely Available Description RONEC - the Romanian Named Entity Corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition. It is available in BRAT and CoNLL-U Plus formats, and it is free to use and extend at github.com/dumitrescustefan/ronec. Download from https://github.com/dumitrescustefan/ronec Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.546.pdf Edition LREC 2020 -
Name RusLTC Resource type Corpus Size 2.3 million tokens Languages English (eng), Russian (rus) Production status Existing-updated Resource usage Document Classification, Text categorisation License CC-BY-SA 3.0 Conditions of use Freely Available Description Russian Learner Translator Corpus (RusLTC) is a bi-directional parallel corpus, which stores English-Russian translations produced by university students majoring in translation studies in Russia (Russian L1). The English-Russian subcorpus has multiple translations to 402 sources, while Russian-English part includes 125 source-targets sets. The number of translations to a single source averages at 8 (min=1, max=67). Many texts are translations of excerpts from larger sources; the median lengths of translations in the English-to-Russian subcorpus is 380 words. RusLTC is downloadable as a customized TMX-file and a plaintext archive (https://www.rus-ltc.org/static/html/about.html). The TMX contains the sentence-aligned version of the corpus with the automatic alignment errors corrected manually. It also carries 10 metadata fields, including the translator's gender and affiliation, education type and level, grade for the translation, year and conditions of translation (routine/exam; home/classroom) and source text genre. RusLTC includes texts in 10 genres: academic, informational, educational, essay, technical, fiction, educational, speech, letters, advertisement texts in 10 genres: academic, informational, educational, essay, technical, fiction, educational, speech, letters, advertisement. Detailed and automatically updated statistics is available from https://dev.rus-ltc.org/statistics/. Download from https://www.rus-ltc.org/static/html/about.html Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.505.pdf Edition LREC 2020 -
Name Scholarly Documents Deduplication Dataset Resource type Corpus Size 204 MByte Languages English Production status Newly created-finished Resource usage Document Classification, Text categorisation License <Not Specified> Conditions of use Freely Available Description A labelled dataset of 100K scholarly documents categorised into non-duplicates, near-duplicates and exact duplicates groups. 50K examples are non-duplicates; 25K examples belong to the exact-duplicate groups and the remaining 25K example belong to the near-duplicate groups. Download from https://core.ac.uk/documentation/dataset/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.113.pdf Edition LREC 2020 -
Name Seed corpus for entity coreference in email conversations Resource type Corpus Size 866 coreference chains Languages English (eng) Production status Newly created-finished Resource usage Anaphora, Coreference License Apache 2.0 Conditions of use Freely Available Description Corpus for submission entitled: A study on Entity Resolution in Email Conversations. The seed corpus contains 46 email threads comprising of 245 email messages. The annotations are provided in CoNLL format. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/899_res_1.tar.xz [83 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.8.pdf Edition LREC 2020 -
Name Semantic verb classes Resource type Evaluation Data Size 17 entries Languages English Production status Newly created-finished Resource usage Evaluation/Validation License CC BY 4.0 Conditions of use Freely Available Description 17 semantic verb classes collected for English via a semantic clustering task, where 825 verbs are placed one by one into circles visible on the computer screen by an annotator, each circle corresponding to a grouping of similar and related verbs. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/444_res_2.txt [6 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.705.pdf Edition LREC 2020 -
Name SemEU-R CMV Corpus Resource type Corpus Size NULL NULL Languages English (eng) Production status Newly created-finished Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use From Owner Description <Not Specified> Download from http://katfuji.lab.tuat.ac.jp/nlp_datasets/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.143.pdf Edition LREC 2020 -
Name SentiEcon Resource type Lexicon Size 6470 entries Languages English Production status Newly created-finished Resource usage Opinion Mining/Sentiment Analysis License CC BY-NC 3.0 Conditions of use Freely Available for Academic Purposes Description Domain-specific computational lexicon designed for sentiment analysis applications, we compiled from our own corpus of online business news. It was created as a plug-in lexicon for the sentiment analysis tool Lingmotif, and thus it follows its data structure requirements and presupposes the availability of a general-language core sentiment lexicon that covers non-specific sentiment-carrying terms and phrases. It contains 6,470 entries, both single and multi-word expressions, each with tags denoting their semantic orientation and intensity. Download from http://tecnolengua.uma.es/sentiecon Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.623.pdf Edition LREC 2020 -
Name Serial Speakers: a Dataset of TV Series Resource type Corpus Size 45 MByte Languages English Production status Newly created-finished Resource usage Corpus Creation/Annotation License CC-BY 4.0 Conditions of use Freely Available Description "Serial Speakers" is a dataset of 161 annotated episodes from three popular American TV serials, "Breaking Bad" (62 annotated episodes), "Game of Thrones" (73), and "House of Cards". "Serial Speakers" is suitable for addressing both high level multimedia retrieval tasks in real world scenarios, and lower level speech processing tasks in challenging conditions. The boundaries, speaker and textual content of every speech turn, along with all scene boundaries, have been manually annotated for the whole set of episodes; the shot boundaries and the recurring shots for the first season of each of the three series; and the interacting speakers for a subset of 10 episodes. Because of copyright restrictions, the textual content is encrypted in the public version of the dataset, but we provide the users with a simple tool for recovering the text from their own subtitle files (https://github.com/bostxavier/Serial-Speakers). Download from https://figshare.com/articles/TV_Series_Corpus/3471839 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.525.pdf Edition LREC 2020 -
Name SpA-Verb Resource type Evaluation Data Size 29721 entries Languages English Production status Newly created-finished Resource usage Evaluation/Validation License CC BY 4.0 Conditions of use Freely Available Description An English intrinsic evaluation verb similarity dataset containing similarity scores for 29,721 unique verb pairs and 825 target verbs, obtained via a two-phase data collection design: a rough semantic clustering phase, where verbs are placed in broad semantic classes, and a spatial multi-arrangement phase, where fine-grained spatial similarity judgments are elicited on related verbs, clustered together in Phase 1. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/444_res_1.txt [746 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.705.pdf Edition LREC 2020 -
Name SQuAD-es v1.1 Resource type Corpus Size 87595 (context, question, answer) tuples Languages Spanish (spa) Production status Newly created-finished Resource usage Question Answering License MIT License Conditions of use Freely Available Description Automatic Spanish translation of the training set of the Stanford Question Answering Dataset (SQuAD) v1.1 Download from https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.677.pdf Edition LREC 2020 -
Name Statement Classification Dataset, arXMLiv 08.2018 Resource type Corpus Size 10555689 paragraphs Languages English (eng) Production status Newly created-finished Resource usage Document Classification, Text categorisation License SIGMathLing Non-Disclosure-Agreement (research-only use) Conditions of use Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself. Membership is free and granted on the honor system. Description This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph, immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations (e.g. Definition, Proof, Remark). A control dataset of the same statements is also included with all mathematical symbolism omitted (nomath), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. It may be considered as a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance. Community feedback is welcome on all of: data quality, representation issues, as well as organization and archival best practices. Jointly release new versions of this data are planned together with new releases of the arXMLiv corpus. Download from https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.153.pdf Edition LREC 2020 -
Name STEM-ECR Resource type Corpus Size 3.3 MByte Languages English (eng) Production status Newly created-finished Resource usage Information Extraction, Information Retrieval License CC-BY-SA 3.0 Conditions of use Freely Available Description The STEM ECR v1.0 dataset introduces the task of Scientific Entity Extraction, Classification, and Resolution on scholarly publications in STEM (Science, Technology, Engineering, and Medicine) disciplines. It comprises annotated scholarly abstracts from 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. The annotated data includes: phrase-based scientific entities, and their corresponding disambiguated references in Wikipedia and Wiktionary as applicable. The purpose of the dataset is to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/1130_res_1.zip [4.93 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.268.pdf Edition LREC 2020 -
Name SVALex Resource type Lexicon Size 15619 entries Languages Swedish (swe) Production status Existing-used Resource usage Language Learning/Grading License CC BY-NC-SA 4.0 Conditions of use Freely Available Description Lexical resources for foreign language (L2) learning, teaching and research graded following the Common European Framework of Reference for Languages (CEFR) Download from http://pub.cl.uzh.ch/purl/multiCEFRLex Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.43.pdf Edition LREC 2020 -
Name TableBank Resource type Evaluation Data Size 26,6 GByte Languages <Not Specified> Production status Newly created-finished Resource usage Information Extraction, Information Retrieval License CC-BY-NC-ND 4.0 Conditions of use From Owner Description TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables. Download from https://github.com/doc-analysis/TableBank Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.236.pdf Edition LREC 2020 -
Name TamilMemes Resource type Corpus Size NULL NULL Languages English, Tamil, Tamil-English code mixed Production status <Not Specified> Resource usage <Not Specified> License <Not Specified> Conditions of use <Not Specified> Description TamilMemes folder contains troll and non-troll images in the directory with the same name. Download from https://github.com/sharduls007/TamilMemes Referring paper http://www.lrec-conf.org/proceedings/lrec2020/workshops/WILDRE-5/pdf/2020.wildre-1.2.pdf Edition WILDRE-5 2020 -
Name TaPaCo Resource type Corpus Size 2789631 sentences Languages English, Esperanto, Italian, Russian, Turkish, French, German, Portuguese, Spanish + 88 other languages Production status Newly created-finished Resource usage Textual Entailment and Paraphrasing License CC-BY 2.0 Conditions of use Freely Available Description a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Download from https://doi.org/10.5281/zenodo.3707949 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.848.pdf Edition LREC 2020 -
Name Taxonomic Word Embeddings - Trained on English WordNet Random Walk Pseudo-Corpora Resource type Corpus Size 71 models Languages English Production status Newly created-finished Resource usage Language Modelling License CC BY-NC-SA 4.0 Conditions of use Freely Available Description This archive contains a collection of computational models called word embeddings. These are vectors that contain numerical representations of words. They have been trained on pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy, and thus reflect taxonomic knowledge about words (rather than contextual). A separate embedding model was trained for 72 random walk corpora that were generated, and thus 72 different embedding models are made available (4.4GB). For training an off-the-shelf implementation of pytorch was used and no major parameters were changed, essentially using it 'as is'. Each model has been trained for 30 epochs. As the corpus files differed with regards to the parameters used in their generation, these are also reflected in the models. The parameters are: - size: number of sentences/lines in the training corpus - direction: the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both) - minimal sentence length: the shortest length sentence (in number of words) Download from https://arrow.dit.ie/datas/12/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.602.pdf Edition LREC 2020 -
Name The Alice Datasets: fMRI & EEG Observations of Natural Language Comprehension Resource type Corpus Size 75 participants Languages English Production status Newly created-finished Resource usage Corpus Creation/Annotation License CC0 Conditions of use Freely Available Description The Alice Datasets are a set of datasets based on magnetic resonance data and electrophysiological data, collected while participants heard a story in English. Along with the datasets and the text of the story, there exists a variety of different linguistic and computational measures ranging from prosodic predictors to predictors capturing hierarchical syntactic information included in the supplementary materials. These ecologically valid datasets can be easily reused to replicate prior work and to test new hypotheses about natural language comprehension in the brain. Download from https://openneuro.org/datasets/ds002322/versions/1.0.3 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.15.pdf Edition LREC 2020 -
Name Time-Aware Word Embeddings for Three Lebanese News Archives Resource type Corpus Size 3.7 GByte Languages Arabic Production status Newly created-finished Resource usage Text Mining License CC-BY 4.0 Conditions of use Freely Available Description The resource consists of various Arabic word embeddings trained using three large Lebanese news archives, namely Assafir, Annahar and Alhayat. The archives consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. To train the word embeddings, Google's Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt. Download from https://zenodo.org/record/3538880 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.580.pdf Edition LREC 2020 -
Name TOROT Resource type Treebank Size 1155577 words Languages Church Slavic (chu), Old East Slavonic (orv), Russian (rus) Production status Existing-updated Resource usage Corpus Creation/Annotation License CC-BY-NC-SA 3.0 Conditions of use Freely Available Description Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) spans from the earliest Old Church Slavonic to modern Russian texts, covering more than a thousand years of continuous language history. It is a member of the larger PROIEL family of treebanks of ancient languages. Download from http://torottreebank.github.io/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.646.pdf Edition LREC 2020 -
Name TRANSLIT Resource type Corpus Size 1.6M entries Languages <Not Specified> Production status Newly created-finished Resource usage Machine Translation, SpeechToSpeech Translation License CC0 Conditions of use Freely Available Description A large-scale name transliteration corpus. It is a merging of large transliteration resources such as JRC, Geonames, amanzon transliteration and wiki-lang-all (new) names. The resources were unified and the name variants were merged. Download from https://github.com/fbenites/TRANSLIT Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.399.pdf Edition LREC 2020 -
Name Turku NER corpus Resource type Corpus Size 202116 tokens Languages Finnish (fin) Production status Newly created-finished Resource usage Named Entity Recognition License CC-BY-SA 4.0 Conditions of use Freely Available Description The Turku NER corpus is a manually annotated broad-coverage corpus for Finnish NER, covering over 10,000 annotations for person, organization, location, product and event mentions as well as dates in 754 documents (approx. 200,000 tokens) representing ten different genres of text. The corpus builds on and extends the existing annotations of the Turku Dependency Treebank corpus included in the Universal Dependencies repository. Download from https://turkunlp.org/turku-ner-corpus Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.567.pdf Edition LREC 2020 -
Name TutorialVQA Resource type Corpus Size 4 GByte Languages English (eng) Production status Newly created-finished Resource usage Question Answering License CC-BY-NC 4.0 Conditions of use Freely Available Description The VideoQA corpus contains questions and answers pertaining to Photoshop helpx tutorial videos. The questions were collected on Amazon Mechanical Turk (AMT) and pertain to helpx tutorial video segments. These video segments were manually annotated. In total there are approximately 6,000 questions collected, pertaining to 408 total video segments, and relating to 76 videos. Turkers were shown the video segments and asked to input questions which the video segment can answer. To expand the question set, a separate task asked Turkers to paraphrase the original questions. The videos originate from https://helpx.adobe.com/. Download from https://github.com/acolas1/TutorialVQAData Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.670.pdf Edition LREC 2020 -
Name TV-AfD_Imperative_Corpus Resource type Corpus Size 3 MByte Languages English (eng) Production status Newly created-finished Resource usage Language Modelling License GNU GPL v3.0 Conditions of use Freely Available Description This is a corpus with imperative and non-imperative sentences from the TV show The Big Bang Theory and Wikipedia Articles for Deletion. We used rule-based classification and manual annotation to classify imperative sentences. 1. Data from the TV show Both raw data and classified data are included in this repository. Each file in the raw data is from one episode with each line representing one dialogue from that specific episode. In the tab-separated classification data file, each line represents one dialogue. The columns respectively are: data source (in the formant of nxmm with 'n' indicating the season of the show and 'mm' representing the episode), text, imperative classification (1 for imperative and 0 for non-imperative), imperative category, and whether the imperative sentence has affixal negative markers or not. 2. Data from Wikipedia Articles for Deletion Classified data for sentences only are included from this source. The file is tab separated with two columns of text and classification (1 for imperative and 0 for non-imperative) respectively. Each line represent one instance of data. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/235_res_1.zip [1.07 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.805.pdf Edition LREC 2020 -
Name Twitter Datasets Resource type Corpus Size 1904 stance-annotated tweets Languages German (deu) Production status Newly created-finished Resource usage Opinion Mining/Sentiment Analysis License CC-BY 4.0 Conditions of use Freely Available Description The German Titling Twitter Corpus consists of 1904 stance-annotated tweets (125KB) collected in June/July 2018 mentioning 24 German politicians with a doctoral degree. The Addendum contains an additional 296 stance-annotated tweets from each month of 2018 mentioning 6 left-leaning and 4 right-leaning politicians with a doctoral degree. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/181_res_1.zip [125 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.606.pdf Edition LREC 2020 -
Name Twitter Users with City Population Size Resource type Corpus Size 2000000 sentences Languages English Production status Newly created-finished Resource usage Document Classification, Text categorisation License <Not Specified> Conditions of use Freely Available Description 2 million twitter user IDs and tweets IDs for tweets written by each user. Can be used to retrieve original data from the Twitter API subject to Twitters terms of service. Download from http://lit.eecs.umich.edu/downloads.html Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.771.pdf Edition LREC 2020 -
Name Universal Dependencies Resource type Treebank Size 25 million words Languages Afrikaans, Akkadian, Amharic, Ancient Greek, Arabic, Armenian, Assyrian, Bambara, Basque, Belarusian, Bhojpuri, Breton, Bulgarian, Buryat, Cantonese, Catalan, Chinese, Classical Chinese, Coptic, Croatian, Czech, Danish, Dutch, English, Erzya, Estonian, Faroese, Finnish, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hindi English, Hungarian, Indonesian, Irish, Italian, Japanese, Karelian, Kazakh, Komi Permyak, Komi Zyrian, Korean, Kurmanji, Latin, Latvian, Lithuanian, Livvi, Maltese, Marathi, Mbya Guarani, Moksha, Naija, North Sami, Norwegian, Old Church Slavonic, Old French, Old Russian, Persian, Polish, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Skolt Sami, Slovak, Slovenian, Spanish, Swedish, Swedish Sign Language, Swiss German, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Upper Sorbian, Urdu, Uyghur, Vietnamese, Warlpiri, Welsh, Wolof, Yoruba Production status Existing-updated Resource usage Parsing and Tagging License Universal Dependencies 2.6 License Agreement Conditions of use Freely Available Description Universal Dependencies is an initiative to create cross-linguistically consistent morphosyntactic annotation for many languages. The latest release, UD v2.5, contains 157 treebanks in 90 languages. The resources are available from https://lindat.mff.cuni.cz/. Documentation can be found at https://universaldependencies.org. Download from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3105 Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.497.pdf Edition LREC 2020 -
Name Urban Dictionary Emeddings Resource type Corpus Size 3 GByte Languages English Production status Newly created-finished Resource usage Document Classification, Text categorisation License <Not Specified> Conditions of use Freely Available Description Word embeddings trained on the entirety of the content of Urban Dictionary as of October 16, 2019. A skip-gram architecture with negative sampling was used within the fastText framework. Download from http://smash.inf.ed.ac.uk/ud-embeddings/ Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.586.pdf Edition LREC 2020 -
Name Urdu-News-Augmented-Dataset Resource type Corpus Size 5.04 MByte Languages English, Urdu (urd) Production status Newly created-finished Resource usage Document Classification, Text categorisation License <Not Specified> Conditions of use Freely Available Description This is a language resource which contains a dataset of 900 news articles originally in Urdu annotated as real or fake. Additionally, it contains a 400 news article as an augmentation dataset generated using Google Translate MT system from English to Urdu, as well as a number of combinations of these datasets for exploration of the augmentation effect. The original English Fake News dataset is available from https://web.eecs.umich.edu/~mihalcea/downloads.html#FakeNews. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/66_res_1.zip [5.29 Mb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.309.pdf Edition LREC 2020 -
Name WFM mathematical definitions dataset Resource type Corpus Size NULL NULL Languages English Production status Newly created-finished Resource usage Definition extraction License GNU GPL v3.0 Conditions of use Freely Available Description The WFM dataset (Vanetik et al., 2019) was created by collecting and processing 2352 articles from Wolfram Mathworld (Weisstein and others, 2007). The dataset contains 1793 sentences, of which 811 are definitions and 982 are non-definitions. Sentences were extracted automatically and then manually separated into two categories: definitions and statements (non-definitions). Download from https://github.com/uplink007/FinalProject/tree/master/data/wolfram Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.256.pdf Edition LREC 2020 -
Name WikiFactCheck-English Resource type Corpus Size 124821 entries Languages English (eng) Production status Newly created-finished Resource usage Fact checking License <Not Specified> Conditions of use Freely Available Description WikiFactCheck-English, a dataset of 124k triples consisting of a claim, context and an evidence document extracted from English Wikipedia articles and citations, as well as 34k manually written claims that are refuted by the evidence documents. This is the largest fact checking dataset consisting of real claims and evidence to date; it will allow the development of fact checking systems that can better process claims and evidence in the real world. Download from https://github.com/WikiFactCheck-English Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.849.pdf Edition LREC 2020 -
Name word2word Resource type Lexicon Size 41 GByte Languages af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw Production status Newly created-finished Resource usage Machine Translation, SpeechToSpeech Translation License GNU GPL v3.0 Conditions of use Freely Available Description A collection of bilingual lexicons for 3,564 language pairs. Download from https://github.com/Kyubyong/word2word Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.371.pdf Edition LREC 2020 -
Name Yarega Resource type Corpus Size 92848 words Languages Telugu Production status Newly created-finished Resource usage Opinion Mining/Sentiment Analysis License <Not Specified> Conditions of use Freely Available Description The sentences in the dataset are in Telugu Script. The annotated data for aspect term identification is in the form of word/tag. Annotated data for aspect polarity classification and aspect term categorisation is in JSON format. Each JSON file contains an array of JSON objects. Each object consists of two properties, "sentence" and "aspectTerms". The property "sentence" has the review sentence. The property "aspectTerms" has an array of JSON objects where each object has the properties of "aspectTerm", "start", "end", "polarity" and "category". "aspectTerm" has the aspect term itself. "start" and "end" have the beginning and ending indexes of the aspect term."polarity" and "category" are the annotated polarity of the aspect term and the annotated category of the aspect term respectively. The dataset contains 5027 sentences and 92848 tokens. The total number of aspect terms in all those sentences is 7130. Of which, 3521 aspect terms are of positive polarity, 2480 are of negative polarity and 1129 are of neutral polarity. Download from http://lrec2020.lrec-conf.org/sharedlrs2020/706_res_1.zip [53 Kb] Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.617.pdf Edition LREC 2020 -
Name Yorùbá and Twi Corpus Resource type Corpus Size NULL NULL Languages Twi, Yorùbá Production status Newly created-finished Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use Freely Available Description This is a collection of some noisy and fully diacritized Yorùbá texts from different sources. Also, it includes some Twi texts. Also included is annotated Yorùbá texts for Named Entity recognition tasks. Download from https://github.com/ajesujoba/YorubaTwi-Embedding Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf Edition LREC 2020 -
Name Yorùbá and Twi FastText Embedding Resource type Corpus Size NULL NULL Languages Twi, Yorùbá Production status Newly created-finished Resource usage Natural Language Processing Downstream tasks License <Not Specified> Conditions of use Freely Available Description Collection of FastText per-trained Embeddings for and Twi which can be used for downstream tasks in Natural Language Processing. Download from https://drive.google.com/drive/folders/1jkwLBkxJhnfVvf1yd7PyZw0nY8aNYaNN?usp=sharing Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf Edition LREC 2020 -
Name Yorùbá and Twi WordSim-353 dataset Resource type Evaluation Data Size NULL NULL Languages Twi, Yorùbá Production status Newly created-finished Resource usage Evaluation/Validation License <Not Specified> Conditions of use Freely Available Description This is the collection of the WordSim-353 word pairs dataset for Yorùbá and Twi. All the already existing WordSim-353 word pairs were translated from English to Yorùbá and Twi. Download from https://github.com/ajesujoba/YorubaTwi-Embedding Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.335.pdf Edition LREC 2020 -
Name Zamboanga Chabacano Spelling Errors Ontology Resource type Ontology Size NULL NULL Languages Chavacano (cbk) Production status Newly created-finished Resource usage Corpus Creation/Annotation License <Not Specified> Conditions of use Freely Available Description A Zamboanga Chabacano spelling errors ontology in Web Ontology Language (OWL) format. Download from https://research.chavacano.org/cwzcc.owl Referring paper http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.327.pdf Edition LREC 2020