COMPUTATIONAL LINGUISTICS

Enrollment year

2018/2019

Academic year

2020/2021

Regulations

DM270

Academic discipline

L-LIN/01 (GLOTTOLOGY AND LINGUISTICS)

Department

DEPARTMENT OF HUMANITIES

Course

HUMANITIES

Curriculum

PERCORSO COMUNE

Year of study

3°

Period

1st semester (28/09/2020 - 23/12/2020)

ECTS

Lesson hours

36 lesson hours

Language

Italian

Activity type

ORAL TEST

Teacher

JEZEK ELISABETTA (titolare) - 6 ECTS

Prerequisites

Familiarity with basic notion in general linguistics, which will be reviewed in class at the beginning of the course.

Learning outcomes

The automatic analysis of texts is today essential for research purposes in the social sciences and the humanities and for applications of various kinds, from automatic translation, to opinion mining, to the construction of conversational agents.
The course introduces the fundamental concepts, methodologies and tools of computational linguistics and natural language processing, providing students with skills to automatically or semi-automatically analyze textual data of various kinds (literary, historical, scientific, socio-political, journalistic). It also provides the methodological basis of the linguistic annotation of texts for supervised machine learning.

Course contents

The course is an updated and comprehensive introduction to the fundamentals of computational linguistics and natural language processing.

It will cover the following topics:

- Linguistic fundamentals for computational analysis of texts
- Basics of statistics
- Natural language processing
- Machine learning methods
- Data annotation for machine learning
- Tasks in natural language processing (with an in-depth look at the automatic recognition of proper names - Named Entity Recognition, at the automatic identification of temporal information and event types - Temporal Information Extraction and Event Detection - and at the automatic detection of opinions (Opinion Mining, Stance Detection and Sentiment Analysis) in texts.

Two lectures with an interdisciplinary approach will focus on the topic of "Machine Learning for the Social Sciences and the Humanities" and will be held in English.

The course will include a lab component with assignments regarding the automatic analysis of texts. Students will acquire skills related to the use of the command-line interface for the manipulation of texts, and of a selection of automatic tools for the extraction of linguistic information. In particular, the following tools will be introduced: UDpipe, Tint, a script to perform lexicon-based Sentiment Analysis, and the basic functions of the Natural Language Processing ToolKit (NLTK). The latter requires basic programming skills in Python, which students will need to learn quickly during the first week of the class.

Teaching methods

Face-to-face interactive Lectures.
Slides.
Lab with group activities.

Reccomended or required readings

Readings:

Jezek, Elisabetta 2016. The Lexicon: An Introduction, Oxford, Oxford University Press. Ch 1 "Basic Notions".

Jurafsky, Dan, and James H. Martin. 2018. Speech and language processing. Ed. 3. URL: https://web.stanford.edu/~jurafsky/slp3/17.pdf - Ch 6, "Vector Semantics".

Jurafsky, Dan, and James H. Martin. 2018. Speech and language processing. Ed. 3. URL: https://web.stanford.edu/~jurafsky/slp3/17.pdf - Ch 17.1 "Named Entity Recognition".

Liu, B. 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), 1-167. Ch 1 "Sentiment Analysis: A Fascinating Problem" and Ch 2 "The Problem of Sentiment Analysis". Available through Linkup, University of Pavia, https://www.morganclaypool.com/action/ssostart?redirectUri=%2F.

Lu, X., 2014. Computational methods for corpus annotation and analysis. Dordrecht, Springer. Ch 2 "Text Processing with the Command Line Interface".

Straka, Milan, Jan Hajic, and Jana Straková. 2016. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proceedings of LREC 2016. URL: http://ufal.mff.cuni.cz/~straka/papers/2016-lrec_udpipe.pdf

Pustejovsky, James, José M. Castano, Robert Ingria, Roser Sauri, Robert J. Gaizauskas, Andrea Setzer, Graham Katz, and Dragomir R. Radev. 2003. "TimeML: Robust specification of event and temporal expressions in text." New directions in question answering 3: 28-34. URL: https://www.aaai.org/Papers/Symposia/Spring/2003/SS-03-07/SS03-07-005.pdf

Pustejovsky J. and A. Stubbs. 2012. Natural Language Annotation for Machine Learning, 0'Reylly Media, Ch 7 Training: Machine Learning.

Natural Language ToolKit. URL: https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63

Optional:

Hürriyetoğlu, A., Zavarella, V., Tanev, H., Yörük, E., Safaya, A. and Mutlu, O., 2020. Automated extraction of socio-political events from news (AESPEN): Workshop and shared task report. Proceedings of AESPEN 2020, Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020, p. 1-6. https://www.aclweb.org/anthology/2020.aespen-1.1.pdf

Additional readings will be introduced in class and made available on the KIRO platform online.

Assessment methods

Final oral exam covering material from the entire course.
Final assignment (5 pages including references and excluding tables and figures) reporting the results of an in-depth investigation of a linguistic phenomenon (morphological, syntactic, semantic, lexical or discourse) or of an historical, cultural, social phenomenon (through linguistic analysis) performed using the tools introduced in class, previously agreed during office hours. The text must be sent to jezek@unipv.it 7 days before the exam.

Further information

Material for the course is available on the KIRO platform (access with personal username and password).

Sustainable development goals - Agenda 2030

$lbl_legenda_sviluppo_sostenibile