Sum-School-22

The summer school offers four four-day courses covering statistical modelling and corpus annotation. There will then be four courses on special research topics (two days each) : sign-language corpora, geo-localized data, syntax annotation, and eye-tracking. There are both theoretical and experimental courses covering the topic from cross-linguistic and cross-modular perspectives. Participants will have the opportunity to present their work as a talk or a poster at a workshop at the end of the first week.

time table

Week 1

time	Mo 19.09	Tu 20.09	We 21.09	Fr 23.09
8:45-9:30	8:45-9:15: registration 9:15-9:30: introduction			Workshop Corpus annotation and data analysis program
9:30-10:30	foundational course 1: data analysis Introduction to Inferential Statistics by Maik Thalmann
10:30-11:00	coffee break
11:00-12:00	foundational course 2: corpus data Information- and discourse-structure analysis with questions under discussion by Arndt Riester & Kordula De Kuthy
12:00-14:00	lunch break
14:00-15:00	foundational course 1-2 Practice sessions Option 1: foundational course 1 (by M. Thalmann) Option 2: foundational course 2 (by A. Riester & K. De Kuthy)
15:00-15:30	coffee break
15:30-17:00	research topic 1: Sign languages Using sign language corpora for linguistic research by Marloes Oomen		research topic 2: Language & Space Language variation and geo-localized data by Olga Kellert
17:30/18:00		17:30 Guided Tour		18:00 Linguists Tour
20:00/20:30		20:00 Dinner 1		20:30 Dinner 2

Week 2

time	Su 25.09	Mo 26.09	Tu 27.09	Th 29.09

9:30-10:30	excursion: Grimmwelt etc. in Kassel	free	foundational course 3: data analysis Introduction to generalized linear modeling for linguists by Alexandra Lorson and Vinicius Macuch
10:30-11:00			coffee break
11:00-12:00			foundational course 4: corpus annotation Automatic methods for corpus-based linguistic research by Stefanie Dipper
12:00-14:00			lunch break
14:00-15:00			foundational course 3-4 Practice sessions Practice sessions Option 1: foundational course 3 (by A. Lorson and V. Macuch) Option 2: foundational course 4 (by S. Dipper)
15:00-15:30			coffee break
15:30-17:00			research topic 3: Eye tracking Eye tracking with silent reading and visual world by Daniele Panizza	research topic 4: Syntactic annotation Tagged Corpora, Universal Dependencies Treebanking, and NLP with Deep Learning by So Miyagawa
19:00				Dinner 3

foundational courses

Foundational Course 1: statistics

Introduction to Inferential Statistics

by Maik Thalmann (University of Göttingen)
19–22 September, lecture 9:30–10:30, practice 13:00-14:00
Course Information and Materials

Abstract
This class will take as its start the t- and the χ2-tests, which, together with the way to compute them in R will also form the assumed background knowledge. From there, we will work our way through other commonplace ways to analyze data from linguistic experiments and from corpus studies alike. Along the way, we will discuss not only how to perform these tests in R but also frequent problems analysts face when deciding which test to run and what kinds of parameters to use. Our central goal will be to equip attendees with the knowledge necessary for Alexandra Lorson and Vinicius Macuch’s class Introduction to generalized linear modeling for linguists in the second week of this summer school.

Requirements and materials
Requirements and materials will be offered at the following repository: https://mkthalmann.github.io/intro-stats/.

Foundational Course 2: corpus analysis

Information- and discourse-structure analysis with questions under discussion

by Arndt Riester (Universität Bielefeld) & Kordula De Kuthy (University of Tübingen)
19–22 September, lecture 11:00–12:00, practice 13:00-14:00
Course Information and Materials

Abstract
This class is an introduction to the QUD-tree framework (Riester, Brunetti and De Kuthy 2018). Determining the so-called *questions under discussion* of a discourse is a means to analyze both the information structure (i.e. focus-background divide) of the discourse segments, as well as the overall topical organization of the discourse itself. QUD trees are, therefore, a means to represent discourse structure, different from but compatible with rhetorical relations. The class is also devoted to the practical task of annotating information-structural categories in texts from different languages, and to distinguish at-issue from non-at-issue segments.

Requirements/Preparation
In the class we will make use of the QUDA tool, which can be freely downloaded at: https://github.com/MMLangner/QUDA. Participants are kindly asked to install this tool on their laptops prior to the class, which should be possible based on the installation guidelines. The process is a bit cumbersome, and some people might experience technical problems. In this case, do not worry: we will address technical problems during the practice session. But please give it a try!

Materials
Additional materials will be offered at the following webpage: https://intro-qud.github.io/

Foundational Course 3: statistics

Introduction to generalized linear modeling for linguists

by Alexandra Lorson (University of Birmingham) and Vinicius Macuch (University of Birmingham)
27–30 September, lecture 9:30–10:30, practice 13:00-14:00
Course Information and Materials

Abstract
Welcome to the linear modelling course for the Göttingen Summer School!
In this workshop, you will learn how to analyse linguistic data with R, RStudio, and the tidyverse. We will introduce you to statistical models with an emphasis on the (generalized) linear model framework, including mixed models. More specifically, we will be covering topics such as interpreting interactions, fitting logistic and Poisson regression models, random intercepts and random slopes, as well as convergence issues. Importantly, you will also learn how to do data analysis in a way that is open and reproducible.

Requirements/Preparation
In this course we will be using R and RStudio which you will have to install prior to the start of the course. Even if you are an experienced R User, it may still be good to re-install R, RStudio, and the specified R packages to make sure that we're all working with the most up-to-date version.
If you are new to R it would be worth checking out this free datacamp course:
www.datacamp.com/courses/free-introduction-to-r
For keen beans, the following video will help you with installing R and RStudio and introduce you to the world of R:
https://www.youtube.com/watch?v=lVKMsaWju8w

Materials
We will provide you with more information on how to install R and RStudio in due time here:
https://osf.io/4cpyz/
This will also be the place where you can find the course materials (data files, resources etc.).

Foundational Course 4: corpus analysis

Automatic methods for corpus-based linguistic research

by Stefanie Dipper (Ruhr-University Bochum)
27–30 September, lecture 11:00–12:00, practice 13:00-14:00
Course information and Materials

Corpus linguistics aims at investigating linguistic research questions by means of digital corpora. An important step in corpus creation is corpus annotation, i.e. enriching the corpus with linguistic information, e.g. parts of speech (e.g. nouns) or syntactic categories (e.g. NPs). Annotations thus allow us to search for complex phenomenon, such as constituent order in main vs. subordinate clauses. This course is about creating annotations as well as using them in corpus searches, with a focus on tools that support (semi-)automatic annotation. Furthermore, selected aspects of the theoretical sessions are put into practice.

Materials

research topics

Research topic 1: sign languages

Using sign language corpora for linguistic research

by Marloes Oomen (University of Amsterdam)
19–20 September, 15:30–17:00
Course Information and Materials

Abstract
Recent years have seen a boom in corpus-based linguistic research on sign languages, following the creation of multiple sign language corpora. These corpora are invaluable for the documentation of sign languages as well as continued linguistic research on their grammatical structure and degree of intra-linguistic variation. At the same time, given the relatively small size of all existing sign language corpora, the time-consuming annotation process, and the present scarcity of automatic annotation tools, using sign language corpora for linguistic research involves different challenges than working with (large) (majority) spoken language corpora. We will look at examples from the literature to explore what type of research questions are ideally tackled by corpus-based studies on sign languages. You will then get some hands-on experience working with data from the German Sign Language Corpus. Using ELAN Linguistic Annotator, you will create and analyze your own annotations in a small subset of the data in this corpus to answer a basic research question. Familiarity with sign languages and sign language linguistics is not required.

Requirements/Preparation
In this course, we will be using ELAN Linguistics Annotator, which you will have to install before the course starts. The latest version can be downloaded here: https://archive.mpi.nl/tla/elan. While you’re at it, feel free to peruse the full manual and/or how-to guide under the ‘documentation’ tab.
We will be working with data from the German Sign Language Corpus (DGS Corpus), available at https://www.sign-lang.uni-hamburg.de/meinedgs/ling/start_en.html. I recommend reading the general information on the homepage prior to the start of the course.

Materials
Some 50 hours of material from the DGS Corpus, including transcription files, are freely available on the DGS Corpus website. A selection of these data will be used during the course and will be made available to you in due time.

Research topic 2: language and space

Language variation and geo-localized data

by Olga Kellert (University of Göttingen)
21–22 September, 15:30-17:00
Course Information and Materials

Abstract

This class will give an introduction into geolinguistics, which is a branch of linguistics and geography. We will look into social media text messages that are associated with location information. This location information can be expressed by an exact address, e.g. Humboldtallee 19, Göttingen. We will first look at language distribution in multilingual countries to trace language borders, e.g. where French is spoken in Belgium and where Dutch is spoken in the same country. We will do the same with multilingual cities such as the city of New York and see what languages are spoken (the most) in Chinatown and other parts of Manhattan, The Bronx, and Queens. We will then look at the distribution of regional language varieties or dialects and trace the border of the dialectal word Semmel ‘bread roll’ and Brötchen ‘bread roll’ in German. Finally, we will visualise the distribution of loanwords from indigenous languages in South America (e.g. pucho ‘cigarette’, cancha ‘pitch’, palta ‘avocado’, etc.). Students will learn the techniques required to visualise language and dialect distribution on geographic maps.

Materials and recommendations

In this particular course, we are going to learn how to work with geolocation associated with natural language data from social media platforms like Twitter.

I recommend to getting familiar with geolocation information on Twitter:

https://www.tweetbinder.com/blog/twitter-geolocation-map/
https://developer.twitter.com/en/developer-terms/geo-guidelines
introduction of the article in https://par.nsf.gov/servlets/purl/10106302

For application of geolocation in linguistics, I recommend reading at least Mocanu et al. 2013:

Bland Justin & Terrel A. Morgan, 2020, Geographic variation of voseo on Spanish Twitter. Guillermo Lorenzo (ed.) Issues in Hispanic and Lusophone Linguistics 27. 7-38. John Benjamins.
Gonçalves, Bruno & David Sánchez, 2014, ‘Crowdsourcing dialect characterization through Twitter’, PloS ONE 9: e112074.
Grieve, Jack; Chris Montgomery; Andrea Nini; Akira Murakami & Diansheng Guo, 2019, ‘Mapping lexical dialect variation in British English Using Twitter’, Front. Artif. Intell. 2(11). doi: 10.3389/frai.2019.00011.
Lansley G, Longley PA, 2016, ‘The geography of Twitter topics in London.’ Comput Environ Urban Syst. 58:85–96.
Leemann, Adrian; Marie-José Kolly; Ross Purves; David Britain & Elvira Glaser, 2016, ‘Crowdsourcing Language Change with Smartphone Applications’, PLoS ONE 11(1): e0143060. doi: 10.1371/journal.pone.0143060.
Levy Abitbol, Jacob; Márton Karsai; Jean-Philippe Magué; Jean-Pierre Chevrot & Eric Fleury, 2018, ‘Socioeconomic dependencies of linguistic patterns in Twitter: a multivariate analysis’, Proceedings of the 2018 World Wide Web Conference WWW’18, 1125–1134.
Mocanu, Delia; Baronchelli, Andrea; Perra, Nicola; Gonçalves, Bruno; Zhang, Qian & Vespignani, Alessandro, 2013, The Twitter of Babel: Mapping World Languages through Microblogging Platforms. PLoS ONE 8(4): e61981. DOI: 10.1371/journal.pone.0061981
Wieling M, Nerbonne J, Baayen RH, 2011, Quantitative Social Dialectology: Explaining Linguistic Variation Geographically and Socially. PLoS ONE 6(9): e23613. https://doi.org/10.1371/journal.pone.0023613

Maps and software packages we will use for geolocation analysis, I recommend studying:

OpenStreetMap. OpenStreetMap License; 2017. Available from: http://wiki.openstreetmap.org/wiki/Open_Database_License.
Met Office. 2010–2015. Cartopy: a cartographic python library with a Matplotlib interface. Online: https://scitools.org.uk/cartopy/ (last access: 24.01.2022). https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location

Research topic 3: eye tracking

An overview on eye tracking experiments with silent reading and visual world paradigm

by Daniele Panizza (University of Göttingen)
27–28 September, 15:30-17:00
Course information and Materials

In this course we will briefly overview the core dynamics of the eye tracking methodology applied to psycholinguistic investigations. By presenting some experimental works employing silent reading during eye tracking (first part) and visual world paradigm (second part) we will discuss the building blocks of an eye tracking study: how it is implemented, how the results can be analysed, what hypotheses it is geared to test, and what the main motivations behind choosing this technique are.

Eye tracking, Day 1: reading

Eye tracking, Day 2: visual world

Research topic 4: syntactic annotation

Introduction to Computational Corpus Linguistics: Making Tagged Corpora, Universal Dependencies Treebanking, and Natural Language Processing with Deep Learning via BERT

by So Miyagawa (National Institute for Japanese Language and Linguistics)
29–30 September, 15:30-17:00
Course information and Materials

This course introduces participants to the world of computational corpus linguistics. Especially, this course focuses on the cutting-edge technologies and frameworks in this field such as Universal Dependencies and BERT (or RoBERTa, T5, etc.). Participants will build and analyze corpora on their own computers, so they need to bring laptops. This course includes many hands-on sub-sessions in groups. The groups will present their results at the end of the course.
Participants will acquire knowledge in the history of computational corpus linguistics and current trends of computational corpus linguistics and related fields such as natural language processing. Students will become accustomed to using and making linguistically tagged corpora, Universal Dependencies treebanks, and natural language processing tools such as Transformer and BERT models.

Slides at So's homepage

workshop

A workshop on "Corpus annotation and data analysis" will take place on Friday 23.09 and Saturday 24.09, hosting talks by invited speakers and talks/poster presentations by the participants of the summer school.

Download program and abstracts

social events

Guided Tour through Göttingen on Tu 20.09, 17:30-19:30
(no expenses)

Meeting point, 17:30: in front of the tourist information at the Gänseliesel, Markt 8,
Route in Googlemaps

Dinner on Tu 20.09, 20:00: Villa Cuba
(on own expenses)

Zindelstr. 2 | 37073 Göttingen
https://www.villacuba.de/
Route in Googlemaps

Linguists Tour through Göttingen on Fr 23.09, 18:00
(no expenses)

Meeting point in front of the observatory

Dinner on Fr 23.09, 20:30: Gamie
(on own expenses)

Weender Str. 29 | 37073 Göttingen
https://gamie-restaurant.de/
Route in Googlemaps

Excursion to the Grimmwelt, etc., Kassel, on Su 25.09
(on own expenses)

further details will be announced in the summer school.

Dinner on Th 29.09, 19:00: Le Feu
(on own expenses)

Weender Landstraße 23 | 37073 Göttingen
https://www.lefeu.de/le-feu-flammkuchen-goettingen/
Route in Googlemaps

announcements

Communication

You may use the e-mail address of the summer school (expired) also during the event.
Summer school desk, every day during the morning break, 10:30-11:00
Talk to Nermin Gürkan, she coordinates our group and passes your request to the right person.

Covid-19 regulations

Current covid-19 regulations at the University of Göttingen: level 0

"...masks are optional in buildings and at other events. From 13 June 2022, there is just a recommendation to wear a mask at teaching events and committee meetings. Until the end of the semester, we still recommend wearing a mask as well as getting tested regularly at Campus Covid Screen (CCS)."

Beyond the summer school

Student Life in Göttingen (University website)
City of Göttingen Event Calender: festivals, events, music, kino, theater, galeries, museums, nightlife...

time table

Week 1

Corpus annotation and data analysis

foundational course 1: data analysis

foundational course 2: corpus data

foundational course 1-2

research topic 1: Sign languages

research topic 2: Language & Space