Data Pipeline Globalvat

Discussion David Francesco 9.2.2024

Way 1

factoid import

  1. Definition of features (there exist a list)

  2. Clarify/validate proposals & features with researcher

  3. Prepare excel files based on defined features to be filled by Jacopo

  4. Produce data for the span of 10 years or 1 year every 5 years? (who does this?)

  5. Excel import and linking of places, persons and institutions in Geovistory (ongoing)

  6. Create factoid mappings in toolbox

  7. Automatically produce audiences in Toolbox

  8. Produce in Toolbox identified places, identified most important persons & institutions

  9. Config of kafka-streams to produce graph out of factoid mapping

  10. Develop webcomponent to display mash-up information from factoid mapping and factual information

  11. Prepare webpage for Globalvat for public access (ev with interactive maps)

 

 

Way 2 -

full import

  1. finish excel transcriptions

  2. clean errors in data sheets (add participations where needed)

    1. define standards to do so

  3. define key entities of key classes

  4. decide which key entities to be identified

    1. decide which persons, groups, places have to be identified

    2. create them in Geovistory to produce identifiant - by doing so, check whether they exist already in Geovistory database

  5. prepare data sheets for import

    1. interlink all identified places, people and groups with GV identifier

  6. prepare import tables clearly identifying for each column the corresponding class

 

Discussion 23.02.

present: Jacopo, David, Francesco

 

API to query the entities in GV e.g with open-refine

 

 

produce audiences with the following information

"Date audience",                         "Jour semaine",                         "Identifiant audience\ncalculé",                         "Heure",                         "Modalité de renseignement heure (Liste fixe)",                         "Durée maximale (minutes)",                         "Modalité de renseignement durée (Liste fixe)",                         "Type audience\nselon la source",                         "Type audience (catégorie recherche)",                         "Lieu", 'Folio dans le fascicule' partCol = [ "Identifiant audience\ncalculé", "Personne reçue (comme indiquée)", "Qualité personne (comme indiquée)", "Groupe reçu\n(comme indiqué)", "Qualité groupe\n(comme indiquée)", "Mention d’accompagnement\n(comme indiquée)", "Accompagnement", "Détails"

see also deepnote here: https://deepnote.com/workspace/jacopo-cossu-f66a-69f0c44c-de05-429f-8c28-79b5017ff8b9/project/GLOBALVAT-526f87ec-0b51-4baf-b4c9-20a43afa690a/notebook/Tabelle-34cc4b750ae744c0a2f201ba691c6cec

Procedure:

  1. prepare two tables in sql logic

    1. 1 table on audience

    2. 1 on participations

      1. if participant is non-identified: propositional object

        1. prepare a “participant description”

      2. if participant is identified: preparation of this table includes disamniguation on the level of groups and persons

      3. for this:

        1. option 1

          1. prepare an aggregated list of all persons from project data using group by function in OpenRefine

          2. check with GV API - does it already exist?

            1. if yes, add Identifier

            2. if no: add definition for these persons

        2. option 2

          1. prepare an aggregated list of all persons from project data using group by function in OpenRefine

          2. check in the GV Toolbox if the person exists, create it manually

          3. copy the ID into the “participation” table

  2. send two tables to KL in sql logic

reflections:

  • first important step is to define a “rule” to differenciate between participations that are identified (with persons or groups) or participations with non-identified “participant descriptions”

  • maybe have more tables

    • audiences

    • participant (with foreign key to “table person” and to “audience table”)

    • identified persons

on the level of identified persons :

  • option 1 or 2 can be chosen once the aggregated list under 1.a and 2.a is ready

  • Option 2 might be suitable for the “cardinals”, but not for 1000 of persons

 

next steps:

  • have a discussion with Laura on

    • the “content” auf audiences & participations (see above)

    • the characteristics of “participation description” that is relevant to be kept

      • number of person

      • nationality

  • produce code for 3 table, including table on persons (currently code to produce 2 tables right now)

    • udienze

    • partizpationi

    • participant description

  • test code on “march 1939 table”

    • produce the 3 tables

    • test to identify persons, identify groups, identify participation descriptions

  • have a meeting to discuss together next steps on the 6th of March at 13h30

 

6.3.2024 Meeting

Vincent, Francesco

Discussion in Rome with Laura

  • attributes of audience selected & discussed

udienzeCol = [     "Folio dans le fascicule",     "Date audience",     "Jour semaine",     "Identifiant audience\ncalculé",     "Heure",     "Modalité de renseignement heure (Liste fixe)",     "Durée maximale (minutes)",     "Modalité de renseignement durée (Liste fixe)",     "Type audience\nselon la source",     "Type audience (catégorie recherche)",     "Lieu",     "Détails",     "Recommandation", ] partCol = [     "Identifiant audience\ncalculé",     "Personne reçue (comme indiquée)",     "Qualité personne (comme indiquée)",     "Groupe reçu\n(comme indiqué)",     "Qualité groupe\n(comme indiquée)",     "Mention d’accompagnement\n(comme indiquée)",     "Accompagnement",

two kinds of participants

  • identified participants (persons or groups)

  • un-identified participants: (sets of persons)

 

aim:

  • rich data from famous

  • information on anything else

 

threshold

  • identified vs unidentified

 

Challenge with Mentions:

  • how to identify them

  • table for all mentions

 

What is goal of webpage?

  • website that can be searched to some degree

    • audiences

    • who participates

      • some identified high-level persons/groups

        • cardinals, bishops

      • participant descriptions, with associated attributes:

        • with type (religious other),

        • number,

        • nationality,

        • gender,

      • → allow for analysis of some particular aspects

 

 

  • mentions/participant description table (set or a person)

    • string with everything as per source (Cardinal y, accompagnied by 3 jesuit brothers

    • link to jesuits

    • number “3”

    • male/female

 

  • table mention

image-20240306-125920.png
image-20240306-141156.png

 

also discussed:

  • Special audiences

  • accompaniment

    •  

 

 

Notebook Jacopo allows to create three tables:

(using google API)

  • audiences:

  • participants

  • mentions

 

possibilities:

  • mentions with similiarities

  •  

 

in case it is difficult:

  • fewer participants and richer descriptions