Sjur Moshagen

Also published as: Sjur N. Moshagen, Sjur Nørstebø Moshagen


2023

pdf bib
GiellaLT — a stable infrastructure for Nordic minority languages and beyond
Flammie Pirinen | Sjur Moshagen | Katri Hiovain-Asikainen
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Long term language technology infrastructures are critical for continued maintenance of language technology based software that is used to support the use of languages in digital world. In Nordic area we have languages ranging from well-resourced national majority languages like Norwegian, Swedish and Finnish as well as minoritised, unresourced and indigenous languages like Sámi languages. We present an infrastructure that has been build in over 20 years time that supports building language technology and tools for most of the Nordic languages as well as many of the languages all over the world, with focus on Sámi and other indigenous, minoritised and unresourced languages. We show that one common infrastructure can be used to build tools from keyboards and spell-checkers to machine translators, grammar checkers and text-to-speech as well as automatic speech recognition.

2022

pdf bib
Unmasking the Myth of Effortless Big Data - Making an Open Source Multi-lingual Infrastructure and Building Language Resources from Scratch
Linda Wiechetek | Katri Hiovain-Asikainen | Inga Lill Sigga Mikkelsen | Sjur Moshagen | Flammie Pirinen | Trond Trosterud | Børre Gaup
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Machine learning (ML) approaches have dominated NLP during the last two decades. From machine translation and speech technology, ML tools are now also in use for spellchecking and grammar checking, with a blurry distinction between the two. We unmask the myth of effortless big data by illuminating the efforts and time that lay behind building a multi-purpose corpus with regard to collecting, mark-up and building from scratch. We also discuss what kind of language technology minority languages actually need, and to what extent the dominating paradigm has been able to deliver these tools. In this context we present our alternative to corpus-based language technology, which is knowledge-based language technology, and we show how this approach can provide language technology solutions for languages being outside the reach of machine learning procedures. We present a stable and mature infrastructure (GiellaLT) containing more than hundred languages and building a number of language technology tools that are useful for language communities.

pdf bib
Building Open-source Speech Technology for Low-resource Minority Languages with SáMi as an Example – Tools, Methods and Experiments
Katri Hiovain-Asikainen | Sjur Moshagen
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

This paper presents a work-in-progress report of an open-source speech technology project for indigenous Sami languages. A less detailed description of this work has been presented in a more general paper about the whole GiellaLT language infrastructure, submitted to the LREC 2022 main conference. At this stage, we have designed and collected a text corpus specifically for developing speech technology applications, namely Text-to-speech (TTS) and Automatic speech recognition (ASR) for the Lule and North Sami languages. We have also piloted and experimented with different speech synthesis technologies using a miniature speech corpus as well as developed tools for effective processing of large spoken corpora. Additionally, we discuss effective and mindful use of the speech corpus and also possibilities to use found/archive materials for training an ASR model for these languages.

2019

pdf bib
Is this the end? Two-step tokenization of sentence boundaries
Linda Wiechetek | Sjur Nørstebø Moshagen | Thomas Omma
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Seeing more than whitespace — Tokenisation and disambiguation in a North Sámi grammar checker
Linda Wiechetek | Sjur Nørstebø Moshagen | Kevin Brubeck Unhammer
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2018

pdf bib
Modeling Northern Haida Verb Morphology
Jordan Lachler | Lene Antonsen | Trond Trosterud | Sjur Moshagen | Antti Arppe
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
A Morphological Parser for Odawa
Dustin Bowers | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2014

pdf bib
Modeling the Noun Morphology of Plains Cree
Conor Snoek | Dorothy Thunder | Kaidi Lõo | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

2013

pdf bib
Building an Open-Source Development Infrastructure for Language Technology Projects
Sjur N. Moshagen | Tommi Pirinen | Trond Trosterud
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2007

pdf bib
Usage of XSL Stylesheets for the Annotation of the Sámi Language Corpora.
Saara Huhmarniemi | Sjur N. Moshagen | Trond Trosterud
Proceedings of the Linguistic Annotation Workshop

1996

pdf bib
A Sign Expansion Approach to Dynamic, Multi-Purpose Lexicons
Jon Atle Gulla | Sjur Nørstebø Moshagen
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics