Data Pipeline Globalvat
Discussion David Francesco 9.2.2024
Way 1
factoid import
Definition of features (there exist a list)
Clarify/validate proposals & features with researcher
Prepare excel files based on defined features to be filled by Jacopo
Produce data for the span of 10 years or 1 year every 5 years? (who does this?)
Excel import and linking of places, persons and institutions in Geovistory (ongoing)
Create factoid mappings in toolbox
Automatically produce audiences in Toolbox
Produce in Toolbox identified places, identified most important persons & institutions
Config of kafka-streams to produce graph out of factoid mapping
Develop webcomponent to display mash-up information from factoid mapping and factual information
Prepare webpage for Globalvat for public access (ev with interactive maps)
Way 2 -
full import
finish excel transcriptions
clean errors in data sheets (add participations where needed)
define standards to do so
define key entities of key classes
decide which key entities to be identified
decide which persons, groups, places have to be identified
create them in Geovistory to produce identifiant - by doing so, check whether they exist already in Geovistory database
prepare data sheets for import
interlink all identified places, people and groups with GV identifier
prepare import tables clearly identifying for each column the corresponding class
Discussion 23.02.
present: Jacopo, David, Francesco
API to query the entities in GV e.g with open-refine
produce audiences with the following information
"Date audience",
"Jour semaine",
"Identifiant audience\ncalculé",
"Heure",
"Modalité de renseignement heure (Liste fixe)",
"Durée maximale (minutes)",
"Modalité de renseignement durée (Liste fixe)",
"Type audience\nselon la source",
"Type audience (catégorie recherche)",
"Lieu",
'Folio dans le fascicule'
partCol = [
"Identifiant audience\ncalculé",
"Personne reçue (comme indiquée)",
"Qualité personne (comme indiquée)",
"Groupe reçu\n(comme indiqué)",
"Qualité groupe\n(comme indiquée)",
"Mention d’accompagnement\n(comme indiquée)",
"Accompagnement",
"Détails"
see also deepnote here: https://deepnote.com/workspace/jacopo-cossu-f66a-69f0c44c-de05-429f-8c28-79b5017ff8b9/project/GLOBALVAT-526f87ec-0b51-4baf-b4c9-20a43afa690a/notebook/Tabelle-34cc4b750ae744c0a2f201ba691c6cec
Procedure:
prepare two tables in sql logic
1 table on audience
1 on participations
if participant is non-identified: propositional object
prepare a “participant description”
if participant is identified: preparation of this table includes disamniguation on the level of groups and persons
for this:
option 1
prepare an aggregated list of all persons from project data using group by function in OpenRefine
check with GV API - does it already exist?
if yes, add Identifier
if no: add definition for these persons
option 2
prepare an aggregated list of all persons from project data using group by function in OpenRefine
check in the GV Toolbox if the person exists, create it manually
copy the ID into the “participation” table
send two tables to KL in sql logic
reflections:
first important step is to define a “rule” to differenciate between participations that are identified (with persons or groups) or participations with non-identified “participant descriptions”
maybe have more tables
audiences
participant (with foreign key to “table person” and to “audience table”)
identified persons
on the level of identified persons :
option 1 or 2 can be chosen once the aggregated list under 1.a and 2.a is ready
Option 2 might be suitable for the “cardinals”, but not for 1000 of persons
next steps:
have a discussion with Laura on
the “content” auf audiences & participations (see above)
the characteristics of “participation description” that is relevant to be kept
number of person
nationality
produce code for 3 table, including table on persons (currently code to produce 2 tables right now)
udienze
partizpationi
participant description
test code on “march 1939 table”
produce the 3 tables
test to identify persons, identify groups, identify participation descriptions
have a meeting to discuss together next steps on the 6th of March at 13h30
6.3.2024 Meeting
Vincent, Francesco
Discussion in Rome with Laura
attributes of audience selected & discussed
udienzeCol = [
"Folio dans le fascicule",
"Date audience",
"Jour semaine",
"Identifiant audience\ncalculé",
"Heure",
"Modalité de renseignement heure (Liste fixe)",
"Durée maximale (minutes)",
"Modalité de renseignement durée (Liste fixe)",
"Type audience\nselon la source",
"Type audience (catégorie recherche)",
"Lieu",
"Détails",
"Recommandation",
]
partCol = [
"Identifiant audience\ncalculé",
"Personne reçue (comme indiquée)",
"Qualité personne (comme indiquée)",
"Groupe reçu\n(comme indiqué)",
"Qualité groupe\n(comme indiquée)",
"Mention d’accompagnement\n(comme indiquée)",
"Accompagnement",
two kinds of participants
identified participants (persons or groups)
un-identified participants: (sets of persons)
aim:
rich data from famous
information on anything else
threshold
identified vs unidentified
Challenge with Mentions:
how to identify them
table for all mentions
What is goal of webpage?
website that can be searched to some degree
audiences
who participates
some identified high-level persons/groups
cardinals, bishops
participant descriptions, with associated attributes:
with type (religious other),
number,
nationality,
gender,
→ allow for analysis of some particular aspects
mentions/participant description table (set or a person)
string with everything as per source (Cardinal y, accompagnied by 3 jesuit brothers
link to jesuits
number “3”
male/female
table mention
also discussed:
Special audiences
accompaniment
Notebook Jacopo allows to create three tables:
(using google API)
audiences:
participants
mentions
possibilities:
mentions with similiarities
in case it is difficult:
fewer participants and richer descriptions