MusicBrainz Knowledge Graph

Motivation

The master’s course “Building and Mining Knowledge Graphs” required us to work on an individual project. I decided to challenge myself with accessing and converting the MusicBrainz dataset into a knowledge graph before linking it to a few metrics for each country existing in it. Thereby deploying the MusicBrainz’ postgreSQL database using docker, and focussing on the research of conversion by using three different approaches. The linking part on the other hand turned out to be too easy, hence I plan to connect the MusicBrainz dataset to my Spotify listening history in a follow-up project.

The code is on github: https://github.com/arthurhaas/um_project_knowledge_graph

Steps

The below figure illustrates my process of gathering the three datasets, converting them into triples a.k.a. creating a knowledge graph, linking all three, and storing them inside GraphDB for querying with SPARQL.

Project setup — Technical setup of the project from data ingest to querying GraphDB.

/1 Gather data

As you see in the above illustration, I used three datasets. All of these datasets contain the country as a common entity, which will be used for linking later.

MusicBrainz is a large collection of music data about artists, their albums, songs, etc.
The Happiness Report indicates the population’s happiness.
The report of the GDP per capita shows the economic status of a country.

Downloading the three datasets and making them usable was the first step of the project. While the Happiness Report and GDP Data were combined around 0.5 MB, the size of the MusicBrainz dataset totals roughly 50 GBs. Hence, I had to make a choice about which part of the MusicBrainz data to use. I decided to use the core data for the knowledge graph. Furthermore, to make the MusicBrainz dataset usable, a setup concerning docker, postgreSQL, and mbdata was required. MusicBrainz recommends using mbdata instead of their own MusicBrainz Server System, in case the webserver is not required. Since I only need the dataset, mbdata was sufficient. To achieve this, I’ve set up an ubuntu docker base image and followed the tutorial on mbdata’s GitHub repository. This was not complete in terms of which commands to execute, hence I had to do a great amount of engineering research to make the database runnable. Details won’t be mentioned here but can be found in the corresponding Dockerfiles and readmes.

/2 Conversion

Before starting to convert the datasets, it is wise to conceptionalize the final knowledge graph. Next figure shows that.

Knowledge Graph Conceptionalisation — Concept of the knowledge graph to build to plan entity IRIs, relations and literals

Conversion method 1: YARRRML

For converting the happiness report csv file into the turtle syntax YARRRML was used, which is a YAML syntax for RML rules.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# column: Score
hasHappinessScore:
sources:
    - ["../../data/happiness_report/2019_preprocessed.csv~csv"]
s: project_ontology:hasHappinessScore
po:
    - [a, owl:DatatypeProperty]
    - [rdfs:label, 'Happiness score', en~lang]
    - [rdfs:comment, 'The world happiness score of a country, see
                      https://www.wikidata.org/wiki/Q56258795.
                      Part of the World Happiness Report, see
                      https://www.wikidata.org/wiki/Q17090395.', en~lang]
    - [rdfs:domain, schema:AdministrativeArea~iri]
    - [rdfs:range, xsd:float~iri]

Conversion method 2: Python with rdflib

For converting the GDP data, I have decided to experiment with a python library called rdflib. Using this was very flexible due to the programming language python. To convert the dataset, pandas was used to store the csv file and iterate it line by line. The tricky part was the binding of a namespace to make it appear as a correctly written prefix in turtle syntax.

Source table:

Country Name	Country Code	Indicator Name	…	2019
Germany	DEU	GDP per capita (current US$)	…	46445.2491012508

Python code:

1
2
3
4
5
6
7
# Create triples
for index, row in df.iterrows():
    
    # data
    g.add((URIRef(geonames[row['Country Code']+"/"]), URIRef(RDF.type), URIRef(schema.Country)))
    g.add((URIRef(geonames[row['Country Code']+"/"]), URIRef(RDFS.label), Literal(row['Country Name'], lang='en')))
    g.add((URIRef(geonames[row['Country Code']+"/"]), URIRef(wikidata_predicate.P2132), Literal(row['2019'], datatype=XSD.float)))

Resulting turtle syntax:

1
2
3
<https://www.geonames.org/countries/DEU/> a schema:Country ;
    rdfs:label "Germany"@en ;
    wikidata_predicate:P2132 "46445.25"^^xsd:float .

Conversion method 3: plain RML on postgreSQL database

The most challenging dataset was MusicBrainz due to its complexity with over 200 tables. I have decided to use plain RML. To create a working connection to a postgreSQL server I had to look into RML test cases because the documentation is too sparse. In addition, I decided to split the RML mapping into files corresponding to entities like country, releases, release groups, and artists to avoid one large monolith job.

A small extract from mapping the artists from postgreSQL to turtle using a query:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
<#LogicalSource>
  rml:source <#DB_source>;
    rr:sqlVersion rr:SQL2008;
    rml:query
    """
    -- Musicbrainz Query 4
    -- artists
    select
      rg.gid as release_group_gid,
      a.gid as gid,
      a.name as name,
      at2.name as type,
      g2.name as gender,
      a.comment as comment
    from release_group rg
    left join artist_credit_name acn on acn.artist_credit = rg.artist_credit
    left join artist a on a.id = acn.artist 
    left join artist_type at2 on at2.id = a."type"
    left join gender g2 on g2.id = a.gender 
    where
      -- only release_groups with releases in year 2019
      rg.id in (
        select r.release_group
        from "release" r
        where id in (select "release" from release_country rc where rc.date_year = 2019 group by "release")
      );
    """;
    rml:referenceFormulation ql:CSV

.


<TriplesMap_Artists>
  a rr:TriplesMap;

  rml:logicalSource <#LogicalSource>;

  rr:subjectMap [
      rr:template "https://musicbrainz.org/artist/{gid}";
      rr:class music:MusicArtist;
    ];

    rr:predicateObjectMap [
      rr:predicate rdfs:label ;
      rr:objectMap [
        rml:reference "name";
        rr:datatype xsd:string
      ]
    ];

    rr:predicateObjectMap [
      rr:predicate music:artist_type ;
      rr:objectMap [
        rr:template "https://w3id.org/um/ken4256/project/mb/artistType/{type}";
        rr:termType rr:IRI
      ]
    ];

.


<#DB_source> a d2rq:Database;
  d2rq:jdbcDSN "jdbc:postgresql://0.0.0.0:5432/musicbrainz";
  d2rq:jdbcDriver "org.postgresql.Driver";
  d2rq:username "musicbrainz";
  d2rq:password "it_works" .

/3 Linking

For the link between GDP and MusicBrianz a mapping file from geonames was used because the link was made between and iso-2 and an iso-3 country code, straightforward. For the link between the happiness report and MusicBrainz on the other hand, LIMES was applied.

/4 Query using SPARQL on GraphDB

After conversion and linking, the data was loaded into GraphDB. The title illustration shows an overview of some entities and relations as they are displayed in GraphDB’s web app.

As an example, the following SPARQL query returns the total number of releases by country in descending order.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Query - Country with most releases
PREFIX music: <http://purl.org/ontology/mo/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX project_onto: <https://w3id.org/um/ken4256/project/mb/ontology/>
PREFIX schema: <https://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

select ?country_name (count(distinct ?release) as ?release_count)
where { 
    ?release a music:Release;
             schema:addressCountry ?country.
    ?country a schema:Country;
             foaf:name ?country_name.
}
group by ?country_name
order by desc(?release_count)
limit 100

Next up

Expanding the linking step from simple metrics by country to my Spotify listening history is the next step of this side-project. I collect my listening history by “scrobbling” to https://www.last.fm.

Further Spotify API’s I am interested in trying out:

Motivation#

Steps#

/1 Gather data#

/2 Conversion#

/3 Linking#

/4 Query using SPARQL on GraphDB#

Next up#