extrapolating similar/filtered data
One of the biggest problems we face with Audiomatch is data purity. Many different songs are labeled with different variations on names ("The Beatles" vs. "Beatles"). Recently, I turned on case-sensitivity for artists, so this exacerbates the problem quite significantly (for Audiomatch, "oasis" and "Oasis" and "OASIS" are all different artists).
When we first launched Audiomatch, Neeraj wrote a filtering algorithm that essentially looked at the long tail of obscure artist names (we make the assumption that the most common occurance of an artist name will be "correct" version) and consolidated what Audiomatch thought were the same artists. This works great, but the results are all automated so there will be problems. Any person could tell you that BRITTANY SPEERS and Britney Spears were meant to be the same thing, but the system would have difficulty figuring that out.
The second step we're taking to clean up the data is offering users the ability to fix their own items. (This feature isn't live on Audiomatch yet). Currently, the "My Music" page looks something like this:
The idea here is that people will be able to edit these values inline; you will be able to update the artist/album/song name using Ajax. Audiomatch will then store these changes; if the artist name differs, Audiomatch will remember this. The very fact that a human changed this value will be of significant value when Audiomatch goes through and refilters the artists name.
For example, if there's a song that is listed as: "metric - come back, baby", but the user realizes this is wrong and changes it to "Metric - Combat, Baby", Audiomatch will remember this, and whenever the former is played, the latter information will display. If enough people say "metric" is really "Metric," Audiomatch can determine with confidence that metric is actually Metric.
This can be expanded even further - an interesting idea Neeraj had was to turn the plugin itself into a retagging tool; if Audiomatch knew for certain the song playing had incorrect IDv3 tags (the stuff that tells you the artist/song/album), then Audiomatch would automagically retag the information for you. (Obviously how high of a confidence value would be a user option, and a log would exist so you could undo any changes).
. . .
Audiomatch, is at heart, a music suggestion engine. We're still building the platform for this (the problem is that in order to extrapolate any useful automated data, we need a HUGE dataset), but eventually we want to take into account both the keywords that users set as their favorite artists, playlists that are created (more on this future feature later), and our hypothesis (which I'll explain shortly) to suggest musical artists to users. I'm not even talking about obvious stuff like "If you like John Mayer, listen to some Jimmy Eat World" type of suggestions, but stuff that's really avant-garde. The idea is that we can take into account new artists and take a look at who they are similar to so that users can track new artists - not get suggestions on just older artists.
In any case, our general hypothesis is that on a large enough dataset, similar artists will be played in sequential order more often than non-similar artists. Obviously there will be people (like me) who listen to Beethoven then load up some Tupac Shakur, but there will be TON more people who listen to Britney Spears then Justin Timberlake. Neeraj has actually written an algorithm for this; I saw the rough output of it during the last iteration of Audiomatch and was quite impressed.
So, that's it for now regarding Audiomatch. If you want to try it out, feel free to ask me for an invite code!
Comment with Facebook
Want to comment with Tabulas?. Please login.