GeoEDdA: A Gold Standard Dataset for Geo-semantic Annotation of Diderot & d’Alembert’s Encyclopédie

March 24, 2024

Talk, Second International Workshop on Geographic Information Extraction from Texts (GeoExT), Glasgow, Scotland, UK

Authors: Ludovic Moncla, Denis Vigier, Katherine McDonough
Abstract: This paper describes the methodology for creating GeoEDdA, a gold standard dataset of geo-semantic annotations from entries in Diderot and d’Alembert’s eighteenth-century Encyclopédie. Aiming to explore spatial information beyond toponyms identified with the commonly used Named Entity Recognition (NER) task, we test the newer span categorization task as an approach for retrieving complex references to places, generic spatial terms, other entities, and relations. We test an active learning method, using the Prodigy web-based tool to iteratively train a machine learning span categorization model. The resulting dataset includes labeled spans from 2,200 paragraphs. As a preliminary experiment, a custom spaCy spancat model demonstrates strong overall performance, achieving an F-score of 86.42%. Evaluations for each span category reveal strengths in recognizing spatial entities and persons (including nominal entities, named entities and nested entities).

PhD jury member

January 24, 2024

PhD jury member, INRIA, Sophia-Antipolis, France

Jury member for Lucie Cadorel’s Ph.D. Defense at INRIA Sophia-Antipolis, France.

7th ACM SIGSPATIAL International Workshop on Geospatial Humanities

November 13, 2023

Workshop organization, 7th ACM SIGSPATIAL International Workshop on Geospatial Humanities, Hamburg, Germany

7th ACM SIGSPATIAL International Workshop on Geospatial Humanities

Following the success of previous editions, this workshop concerns with the use of geographic information systems and other spatial technologies in humanities research, placing an emphasis on new methodologies that leverage the aforementioned technical developments. The standard tools from geographic information systems, as well as more advanced methods such as text- and image-based geographical analysis or spatial simulation, can all benefit from innovative approaches leveraging machine learning, parallel and/or distributed computation, semantic technologies, etc. on humanities sources like archival manuscripts, maps, encyclopedias, newspapers, correspondence collections and more. These kinds of documents pose new challenges for identifying and analyzing spatial information. The workshop aims to bring together researchers and practitioners from different sub-fields of computer science and the geographical information sciences interested in the application of spatial methods and technology to the humanities to discuss how to address these issues in ways that generate new knowledge in multiple disciplines. Participants will demonstrate their contributions and explore how modern GIS and other technologies can inform, and be inspired by, the digital humanities.
Organized by Ludovic Moncla, Bruno Martins, Katherine McDonough, and Xuke Hu

Perdido: Python library for geoparsing and geocoding French texts

April 02, 2023

Talk, First International Workshop on Geographic Information Extraction from Texts (GeoExT), Dublin, Ireland

Authors: Ludovic Moncla, Mauro Gaio
Abstract: This paper introduces the Perdido Python library for geoparsing and geocoding French texts. The architecture of the Perdido Geoparser, which includes three layers: back-office, API, and Python library, is outlined. We also provide details on the methods used in the development of the processing chain and the various tasks covered, such as named entity recognition and classification (NERC), and toponym resolution. Lastly, we showcase the different features of the Python library and explain how to use it. The library is built as an overlay using API services, enabling users to manipulate, visualize, and export the results of geoparsing and geocoding. A Jupyter notebook is also provided to demonstrate all the functionalities implemented in the library.

6th ACM SIGSPATIAL International Workshop on Geospatial Humanities

November 02, 2022

Workshop organization, 6th ACM SIGSPATIAL International Workshop on Geospatial Humanities, Seattle, WA, USA

6th ACM SIGSPATIAL International Workshop on Geospatial Humanities

Following the success of previous editions, this workshop concerns with the use of geographic information systems and other spatial technologies in humanities research, placing an emphasis on new methodologies that leverage the aforementioned technical developments. The standard tools from geographic information systems, as well as more advanced methods such as text- and image-based geographical analysis or spatial simulation, can all benefit from innovative approaches leveraging machine learning, parallel and/or distributed computation, semantic technologies, etc. on humanities sources like archival manuscripts, maps, encyclopedias, newspapers, correspondence collections and more. These kinds of documents pose new challenges for identifying and analyzing spatial information. The workshop aims to bring together researchers and practitioners from different sub-fields of computer science and the geographical information sciences interested in the application of spatial methods and technology to the humanities to discuss how to address these issues in ways that generate new knowledge in multiple disciplines. Participants will demonstrate their contributions and explore how modern GIS and other technologies can inform, and be inspired by, the digital humanities.
Organized by Ludovic Moncla, Bruno Martins, and Katherine McDonough

Formation ANF TDM CNRS 2022

October 05, 2022

Talk, CNRS, Paris, France

Atelier Librairies Python et Services Web pour la reconnaissance d’entités nommées et la résolution de toponymes organisé dans le cadre de la formation ANF TDM 2022 du CNRS (Exploration documentaire et extraction d’information).

Le support de formation est disponible ici : https://gitlab.liris.cnrs.fr/lmoncla/tutoriel-anf-tdm-2022-python-geoparsing

Présentation :
Cet atelier a pour objectif de présenter l’utilisation de librairies Python (ie. NLTK, Stacy, Stanza) et de services Web (ie. PERDIDO) pour l’extraction d’entités nommées à partir de textes. Nous nous intéresserons en particulier au repérage des noms de lieux et à leur localisation sur une carte géographique. Nous mettrons en avant la simplicité d’utilisation de ces outils mais également leur limites.
Programme :
Introduction et comparaison de différents outils de NER : librairies Python (NTLK, Spacy et Stanza), et Services Web (Perdido) Sélection des outils en fonction des corpus (nature des textes, choix de la langue, etc) Les expérimentations seront réalisées sur 2 cas d’application : descriptions de randonnées et articles encyclopédiques Notebook en ligne (Google Collab’) pour développer des prototypes d’applications faciles à utiliser et intuitifs en Python

Tutorial - Natural Language Processing (NLP) for historical texts

June 23, 2022

Tutorial, (online),

Materials for the SunoikisisDC Summer 2022 Course on Natural Language Processing (NLP) for historical texts (Session 9)

Tutorial: https://github.com/ludovicmoncla/SunoikisisDC-Summer2022-Session9
Youtube link: https://youtu.be/7NK2KyP2BYs

In this tutorial, we demonstrate how to use a custom version of the Perdido geoparser python library developed in the framework of the GEODE project. We will use texts from Diderot and d’Alembert’s Encyclopédie as a case study for querying a corpus and wrangling geoparsed data. We will also compare Perdido’s NER annotations (e.g. its output) to the results of other well-known python NER libraries (spaCy and Stanza).
Organized by Ludovic Moncla and Katherine McDonough

Séminaire au laboratoire ICAR (ENS Lyon)

January 17, 2022

Talk, online,

Séminaire du laboratoire ICAR sur le thème de la “Combinaison d’approches qualitative et quantitative pour le repérage et la classification des entités nommées dans l’Encyclopédie de Diderot et d’Alembert (1751-1772)”

Séminaire au laboratoire ERIC (Lyon)

November 16, 2020

Talk, online,

Séminaire du laboratoire ERIC sur le thème du TAL et de l’apprentissage automatique appliqués au geparsing et à l’analyse géo-sémantique de textes.

Workshop GAST 2020

January 18, 2020

Workshop organization, Conférence Extraction et Gestion des Connaissances (EGC) 2020, Bruxelles, Belgique

Le sixième atelier — Gestion et Analyse des données Spatiales et Temporelles (GAST) — sera organisé lors d’EGC 2020. Cet atelier, s’appuyant sur le Groupe de Travail GAST, vise à regrouper les chercheurs, du domaine académique et de l’industrie, qui s’intéressent aux problématiques liées à la prise en compte de l’information temporelle ou spatiale – quantitative ou qualitative – dans leurs processus de gestion et d’analyse de données (méthodes et application de l’extraction, la gestion, la représentation, l’analyse et la visualisation d’informations).

Adapting and integrating existing open source projects

January 09, 2020

Talk, University of Nevada, Reno, NV, USA

I lead the session about ‘Adapting and integrating existing open source projects’.

Workshop: Ethical Visualization in the Age of Big Data. Contemporary Cultural Implications of Pre-Twentieth-Century French Texts. A workshop to seek interdisciplinary expert perspectives on ethically and visually representing the historical place of misrepresented peoples and locales.

13th Workshop on Geographic Information Retrieval (GIR)

December 08, 2019

Workshop organization, 13th Workshop on Geographic Information Retrieval (GIR), Lyon, France

The 13th Workshop on Geographic Information Retrieval will be held in Lyon, France from the 28th-29th November 2019. This workshop will address all aspects of Geographic Information Retrieval - including but not limited to the provision of methods to retrieve and analyse geo-spatial textual content, identify the geographic scope and relevance rank documents or other resources from both unstructured and partially structured collections.
Organized by Ross Purves, Chris Jones, Ludovic Moncla and Mauro Gaio

GeoDISCO: Encyclopedic Geographical Discourse in France from the Enlightenment to Wikipedia

December 08, 2019

Talk, 13th Workshop on Geographic Information Retrieval (GIR), Lyon, France

Authors: Denis Vigier, Thierry Joliveau, Ludovic Moncla, Katherine McDonough, and Alice Brenon
Abstract: The GeoDISCO project aims at studying the major changes in encyclopedic geographical discourse in France between 1751 (when the first volume of the Encyclopédie ou dictionnaire raisonné des sciences, des arts et des métiers, by Diderot and D’Alembert, was published) and today (Wikipedia-France, 2018). Using linguistic and GIS methods to investigate patterns in geographical content will help us understand why authors deployed language in such ways that use place as a scaffold for ideas and practices. The spatial history of French encyclopedias is a foundation for asking broader questions about the relationship between early modern geographical information and digital geographical resources.

Workshop: 13th Workshop on Geographic Information Retrieval (GIR)
Organized by Ross Purves, Chris Jones, Ludovic Moncla and Mauro Gaio

Spatial Entity Matching with GeoAlign (demo paper)

November 07, 2019

Talk, 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Chicago, IL, USA

Authors: Nelly Barret, Fabien Duchateau, Franck Favetta, and Ludovic Moncla
Abstract: Points of interest (POI) are central in many applications such as tourism, itinerary search, crisis management. Cartographic providers usually represent these POI with a spatial entity. However, the description of these entities may significantly vary from one provider to another (e.g., missing properties, outdated information, conflicting values). Spatial entity matching (or record linkage) aims at detecting correspondences between entities referring to the same POI. Most existing approaches have a fixed function for combining similarity measures, thus limiting customization. Besides, evaluating the matching quality is a difficult task since a ground truth dataset cannot be built for all entities and providers. In this paper, we describe GeoAlign, an application that allows fine-grained tuning for spatial entity matching. A merging step is also provided using different strategies. Finally, we propose to estimate the quality of correspondences based on the differences between combination functions and to visualize this estimation in GeoAlign.

3rd ACM SIGSPATIAL International Workshop on Geospatial Humanities

November 05, 2019

Workshop organization, 3rd ACM SIGSPATIAL International Workshop on Geospatial Humanities, Chicago, IL, USA

Following the success of previous editions in 2017 and 2018, this workshop concerns with the use of geographic information systems and other spatial technologies in humanities research, placing a strong emphasis on new methodologies that leverage the aforementioned technical developments (e.g., the above-mentioned standard tools from geographic information systems, as well as more advanced methods such as text-based geographical analysis or spatial simulation, can all benefit from innovative approaches leveraging machine learning, parallel and/or distributed computation, semantic technologies, etc.). The workshop aims to bring together researchers and practitioners from different sub-fields of computer science and the geographical information sciences, interested in the application of spatial methods and technology to the humanities, to discuss progress in the field. Participants will explore and demonstrate the contributions to knowledge that modern GIS technologies can enable within and beyond the digital humanities.
Organized by Bruno Martins, Ludovic Moncla and Patricia Murrieta-Flores

Toponym Disambiguation in Historical Documents Using Network Analysis of Qualitative Relationships

November 05, 2019

Talk, 3rd ACM SIGSPATIAL International Workshop on Geospatial Humanities, Chicago, IL, USA

Authors: Ludovic Moncla, Katherine McDonough, Denis Vigier, Thierry Joliveau, and Alice Brenon
Abstract: In this paper we use network analysis to identify qualitative “neighbors” for toponyms in an eighteenth-century French encyclopedia, but could apply to any entry-based text with annotated toponyms. This method draws on relations in a corpus of articles, which improves disambiguation at a later stage with an external resource. We suggest the network as an alternative to geospatial representation, a useful proxy when no historical gazetteer exists for the source material’s period. Our first experiments have shown that this approach goes beyond a simple text analysis and is able to find relations between toponyms that are not co-occurring in the same documents. Network relations are also usefully compared with disambiguated toponyms to evaluate geographical coverage, and the ways that geographical discourse is expressed, in historical texts.
Organized by Bruno Martins, Ludovic Moncla and Patricia Murrieta-Flores

Towards the geoparsing and geocoding of enviromental narratives

April 10, 2019

Talk, Environmental Narratives Workshop, Stels, Switzerland

Abstract: In this talk I briefly describe some of our previous and current works on geographic information retrieval. Then, I introduce some first results that show how our works can be linked to English narratives and particularly how it can be used for geoparsing and geocoding environmental narratives.
Organized by Ross Purves, Olga Koblet, and Ben Adams,

Cartographier les odonymes de Paris citées dans les romans du XIXème siècle

November 06, 2018

Talk, Atelier Humanités Numériques Spatialisées, SAGEO 2018, Montpellier, France

Authors: Ludovic Moncla, Mauro Gaio, Thierry Joliveau
Abstract: In this article, we address two gaps in NLP research: working with his- torical French and working with complex textual structures moving beyond running text or lists of place names. Our methodology is based on the evaluation of the results of two spatial named entity recognition tools in the context of early modern document analysis structured as dictionaries.
Organized by Carmen Brando, Francesca Frontini, and Mathieu Roche.

Expérimentation de méthodes d’extraction d’informations géographiques pour les documents historiques.

November 06, 2018

Talk, Atelier Humanités Numériques Spatialisées, SAGEO 2018, Montpellier, France

Authors: Katherine McDonough, Ludovic Moncla, and Matje van de Camp
Abstract: In this article, we address two gaps in NLP research: working with his- torical French and working with complex textual structures moving beyond running text or lists of place names. Our methodology is based on the evaluation of the results of two spatial named entity recognition tools in the context of early modern document analysis structured as dictionaries.
Organized by Carmen Brando, Francesca Frontini, and Mathieu Roche.

Automated geoparsing of paris street names in 19th century novels.

November 07, 2017

Talk, 1st ACM SIGSPATIAL International Workshop on Geospatial Humanities, Redondo Beach, CA, USA

Authors: Ludovic Moncla, Mauro Gaio, Thierry Joliveau, and Yves-François Le Lay
Abstract: Our project involves building a platform able to retrieve, map and analyze the occurrences of place names in fictional novels published between 1800 and 1914 and whose action occurs wholly or partly in Paris. We describe a proof of concept using queries made via the TXM textual analysis platform for the extraction of street names. Then, we propose a fully automatic process using the named entity recognition (NER) components of the PERDIDO platform. This paper describes some encouraging initial results obtained by combining NLP approaches (NER methods) with textometric tools for the automated geoparsing of street names.
Organized by Bruno Martins and Patricia Murrieta-Flores

Extended Named Entity Recognition Using Finite-State Transducers: An Application To Place Names.

November 07, 2017

Talk, 9th International Conference on Advanced Geographic Information Systems, Applications, and Services, Nice, France

Authors: Mauro Gaio, Ludovic Moncla
Abstract: The textual geographical information is frequently or- ganized around spatial named entities. Such entities have intrinsic ambiguities and Named Entity Recognition and Classification methods should be improved in order to handle this problem. This article describes a knowledge-based method implementing a full process with the aim of annotating in a more precise way the spatial information in the textual documents. This gain in accuracy guarantees a better analysis of the spatial information and a better disambiguation of places. The backbone of our proposal is a construction grammar and a cascaded finite-state transducers. The evaluation shows that the introduced concept of hierarchical overlapping, is very helpful to detect a local context associated with Named Entities.

Pluridisciplinary aspects of NLP and GIS: an application to itinerary reconstruction

September 20, 2017

Poster, RDA 10th Plenary Meeting, Montréal, Canada

Authors: Ludovic Moncla
Abstract: One of the main challenge of this work is to connect text with geographicspaceand to provide a map-based representation of itineraries described intextual documents. The main objectives are:

  • data mining forGeographic Information Retrieval(GIR),
  • toponym resolution and disambiguation,
  • extract and retrieve displacement fromtextual documents.

Geocoding for texts with fine-grain toponyms

November 05, 2014

Talk, 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, TX, USA

Authors: Ludovic Moncla, Walter Renteria-Agualimpia,Javier Nogueras-Iso, and Mauro Gaio
Abstract: Geoparsing and geocoding are two essential middleware ser- vices to facilitate final user applications such as location- aware searching or different types of location-based services. The objective of this work is to propose a method for es- tablishing a processing chain to support the geoparsing and geocoding of text documents describing events strongly lin- ked with space and with a frequent use of fine-grain topo- nyms. The geoparsing part is a Natural Language Proces- sing approach which combines the use of part of speech and syntactico-semantic combined patterns (cascade of transdu- cers). However, the real novelty of this work lies in the geoco- ding method. The geocoding algorithm is unsupervised and takes profit of clustering techniques to provide a solution for disambiguating the toponyms found in gazetteers, and at the same time estimating the spatial footprint of those other fine-grain toponyms not found in gazetteers. The fea- sibility of the proposal has been tested with a corpus of hiking descriptions in French, Spanish and Italian.

Automatic itinerary reconstruction from texts

September 25, 2014

Talk, 8th International Conference on Geographic Information Science (GIScience 2014), Vienna, Austria

Authors: Ludovic Moncla, Mauro Gaio, and Sébastien Mustière
Abstract: This paper proposes an approach for the reconstruction of itineraries extracted from narrative texts. This approach is divided into two main tasks. The first extracts geographical information with natural language processing. Its outputs are annotations of so called expanded entities and expressions of displacement or perception from hiking descriptions. In order to reconstruct a plausible footprint of an itinerary described in the text, the second task uses the outputs of the first task to compute a minimum spanning tree.

Topographic subtyping of place named entities: a linguistic approach

May 15, 2013

Talk, 16th AGILE conference on Geographic Information Science, Leuven, Belgium

Authors: Van Tien Nguyen, Mauro Gaio, and Ludovic Moncla
Abstract: The aim of this work is to find sub-types for Place Named Entities, from the analysis of relations between Place Names and a nominal group within a specific phrasal context. The proposed method combines the use of specific intra-sentential lexico-syntactic relations and external resources like gazetteers, thesauri, or ontologies. It relies on expanded spatial named entities recognition transcribed into a symbolic representation expressed in terms of semantic features. This symbolic representation will then be associated with a geo-coded representation, depending on the available resources. Our method is completely implemented and has been tested on a corpus of travelogues.