By Kevin Scanell, Saint Louis University
On Monday, August 11th, 9:00
The Crúbadán project began more than 10 years ago as an effort to develop useful computational resources for speakers of indigenous and minority languages: keyboard input methods, spelling and grammar checkers, dictionaries and thesauri. The foundational idea was to crawl the web, create large monolingual text corpora in these languages, and then use techniques from statistical language processing to do the rest. To date, we've created text collections for almost 2000 languages, and have used these in collaboration with native speakers to produce end-user resources like those above for about 150 languages. Along the way, we've had to come to grips with issues that are now well-known to researchers working with web data: copyright difficulties, attribution and data provenance, metadata (including proper language identification), encoding and font issues, dynamically updated content, non-standard orthographies and other types of "noise" only found on the web. I will discuss some of these issues, and also talk about the ways we've tried to "add value" to texts found on the web so that they are (1) easily discoverable, (2) suitable for linguistic research, and (3) usable for a variety of NLP applications.
By Peter Grasch, Open Speech Initiative
On Wednesday, August 13th, 9:00
In this workshop, you will learn how to best design speech recognition software for any given use case. To this end, this course will teach you about the fundamentals of speech recognition, survey the available free software landscape and show you how you can use them to build a state-of-the-art speech recognition sytem that truly works - no matter the conditions.
On Thursday, August 14th, 9:00
An overview on main challenges of minorized varieties or minority languages and how to open source NLP tools are able to fix them. We will use modern Galician-Portuguese as an example.
By Dorothee Beermann, Norwegian University of Science and Technology/Trondheim, and Pavel Mihaylov
On Friday, August 15th, 9:00
As Twitter has it, “ #ELAN and #TypeCraft (are) two tools that have been around for a while but are holding up really well “ (#dh2012). TypeCraft development started almost 10 years ago in the context of a North-South cooperation between the Norwegian University of Science and Technology and the University of Ghana, Legon. The development started as a close cooperation between user communities in Africa and a team of developers in the North, and, as a matter of fact, a close cooperation of linguists and developers has become one of the main characteristics of the TypeCraft development.
TypeCraft is a tool that allows its users to create, retrieve, and exchange Interlinear Glossed Text online in a setting that also supports the collaborative editing, and the exchange of information relating to data-oriented linguistic work. In fact, a very direct approach to using the expertise of language communities in the development of structured linguistic data is at the heart of the TypeCraft application. In our presentation, we will talk about the challenges we faced during TypeCraft development, and describe the solutions that we found. We will address the following topics: (1) The role of encodings, keyboards, and fonts necessary to deal with multi-lingual data online. (2) “My text is gone!” Needs of different user communities and their translation into software specs. (3) Annotation of Uncertainty – How can we leverage standardisation and linguistic exploration to keep a tool going that supports linguistic developments?
Throughout the presentation we will talk about linguistic as well as technical challenges, and discuss the leverage of linguistic and technical needs as central for linguistic tool development.