Music Recommendation with Spark

Aug 9, 2022

MSD (Million Song Dataset)

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features.

Data Processing

Familiarize HDF5 and its related libraries
Use Apache Avro and Snappy Codec to compress data
Visualize dataset to gain an intuitive knowledge

Drill and SQL

Range of dates covered by the songs
The hottest and shortest song with highest energy and lowest tempo
Album with the most tracks
Band that recorded the longest song

BFS with Spark

Use k-hops to define the similarity
Relationship in the Graph (artist, song, similar_artists)
Input: Adjacency list (artist, similar_artists)
Output: similar artists within k-hop

Algorithm

Compared with bfs in mapreduce, Spark can implement mapper and reducer function with lambda function.
Mark the visited node and the most recently visited (MRV) node.
FromMRVnode to visit their neighbours in parallel
Mark those neighbours as MRV

Implementation

use the local variable to record MRV and broadcast to executors
use rdd to record how nodes are visited with each record (artist, distance).
For bfs part, v1 runs about 6 s, while v2 runs about 1 s with 2GB data.
However, when exporting the results to local files, v1 is much faster.

Diverse Recommendation

Use the Page Rank to analyze the sub-graph around one artist
Page Rank is a link analysis algorithm measuring the relative importance within the set.
Markov chain model: a random walk model to detect the potential interests
Implement the graph visualization with networkX and show the potential
related artist to the user.

Conclusion

BFS in Spark is much faster than BFS in MapReduce
Spark takes full advantage of memory in BFS.
From the k-hop graph, we can generally get the relationship between similar
artists.
The system can recommend the songs from those similar artists to the user.

More related experience with big data tools are shown in Methods_and_Tools_for_Big_Data

Big Data