Local Language Portal

Lexical Resources

Path Nirvana Sinhala TTS Dataset

High Quality Sinhala dataset for Text to speech algorithm training – specially designed for deep learning algorithms.

A new dataset that can be used for building new Sinhala TTS voices using deep learning algorithms is now available below:
https://github.com/pathnirvana/sinhala-tts-dataset

LANGUAGE TECHNOLOGY RESEARCH LABORATORY – UCSC

10 Million word contemporary Sinhala text corpus for language research
UCSC mini corpus contains 10 million Sinhala words collected from Sinhala newspaper articles. There are around 135,000 distinct words in the corpus and it comprises 2794 text files containing editorials, feature articles, foreign news and sports news.
To download

100K word English, Sinhala parallel corpus
English-Sinhala parallel corpus is for language researchers who are involved in English-Sinhala machine translation. The corpus contains 4,301 English sentences along with corresponding Sinhala translations.
To download

500k Sinhala tagged corpus
UCSC tagged corpus contains 500K words, manually tagged by Sinhala linguists using UCSC Sinhala POS tagset (version 1). Words that do not belong to any defined tag are tagged with a question mark (?)
To download

1300 word Sinhala WordNet for language technology improvement
UCSC Sinhala wordnet (version 1) contains 1,075 word senses and each sense includes synsets along with the corresponding English word, Princeton ID for the synset, POS Category and the Gloss.
To download
UCSC Sinhala POS tagset
A Part of Speech Tagset for Sinhala (version 1). There are 28 different word class tags not including the punctuation marks. A punctuation mark itself is considered as a separate tag for that particular mark.
To download
List of proper names for language research
A list of Sinhala proper names including country names, Sinhala personal names, names of Sri Lankan and international cities, names of Sinhala artists, Sri Lankan rivers and reservoirs. Currently there are around 20,800 proper name entries.
To download

NamedEntity Tagged Corpus
Sinhala Named Entity Tagged Corpus consists around 83K words that have been tagged for person names, location names and organization names as Named Entities.
To download<
List of Sinhala Functional Words
A list of 425 Sinhala functional words with Sinhala conjunctions, determinants, interjections, particles and post positions.
To download
Ingiya English-Sinhala dictionary database
The English-Sinhala dictionary database used in the ingiya English-Sinhala dictionary add-on. This database consists of ≈36,000 English word entries and the corresponding Sinhala meanings.
To download
400K Distinct word list
A list of 400K distinct words extracted from the UCSC Sinhala text corpus.
To download
Speech corpora for Sinhala speech processing
Female voice corpus

Speech corpus with 3000 Sinhala utterances spoken by a single female speaker. This corpus was initially designed to built an Automatic Speech Recognition System (ASR) for Sinhala. Spoken utterances were selected considering the most frequently used words in Sinhala.

Male voice corpus

Speech corpus with 625 Sinhala utterances spoken by a single male speaker. This corpus was initially designed to built a Text to Speech Syatem (TTS) for Sinhala.

2000 voice corpus

Speech corpus with 74,000 Sinhala utterances spoken by various speakers representing both male and female in different age groups. This corpus was initially designed to built a song request application for mobile phones.

Sinhala NEWS Corpus

A speech corpus with 8000 utterances of recorded Sinhala NEWS from both male and female announcers. This is still an ongoing project.
To download