June 1, 2003
audiomatch
this weekend was devoted to audiomatch development. basically i decided to start the filtering portion of audiomatch (seeing as to how we had 200,000 songs in our database with 12,000 artists ... it would be a good time to start).
but if you want to hear something amusing, i drove a car for the first time today. i got my first "lesson" in driving from my dad. i learned how to park, make a three point turn, and uhh... drive in general, i guess. man, i sound like a 15-year old.
i just have to remember the golden rule to driving: red means go (ask yush what this means).
if you want to read some computer crap about audiomatch, read below. it's pretty technical.
first, random PHP thoughts:
man, execution time limitations suck. you realize how hard it is to write optimization scripts with php when they keep timing out cause of max_exec_time?! grrr. i tried to make a work around with meta refresh in http headers, but it wouldn't work for some reason.
but i have to say that either my server has an assload of ram or something... cause the data optimizations were pretty fast (filled with eregis and some data calculation of word similiarities)... awesome!
audiomatch technical explanation of what i did:
the problem with audiomatch is that the data recorded is from the idv3 tags in the mp3s; and most of these are faulty. what i worked on this weekend, in essence, tried to 'clean up' the artists and 'linking' artists together (e.g. those mislabelled would be linked in the database to the right artist.
basically there were a few tiers:
tier 1 was the general 'cleaning up' of the artists. stripping of non-alphanumeric characters and conversion to all lowercase characters for regexp simplicty ... i also threw out all entries with URLS in their artist (hate that promotion crap) as well as artists with one character names.
the next level entailed comparison of the refined artists against themselves. there's a mathematical way of figuring out how close two words are and outputting it as percentage; any two artists that had a match of greater than 80% OR if one of contained words from the other (specifically: "britney spears" and "britney spears oops i did it again" would not match on word percentage, but would be caught by my reg exp), it would be flagged. to avoid a o(n^2) run on a database of 11,000, i created a quick index on the DB by the first letter of the artist name (removing common words like "the" and "a " from the beginning of the artist). i did verify really quickly to make sure this was pretty good, and it seemed to be about right. this essentially split the 11,000 field database into 11,000/28 size databases.
from a total of 12,000 artists, the first tier knocked out about 1000 artists that are considered "unverifiable." after the second tier, 2000 artists remained for 'matching.' the other 8000 are actually still ... sitting there, but they are single matches and are most likely crap (if they weren't single matches, the 2nd tier would of caught them).
so basically my automated optimization caught 5/6 of the data. at first, this caused great headaches for me (HOW CAN I EVER DO DATA ANALYSIS WHEN 5/6 OF THE DATA SUCKS?!) when i realized that this crap data may take up 5/6 of the artist DB, but may only represent a small percentage of the actual data (the script checks for existing artists whenever you submit a new one, so it doesn't create extraneous data). to make this more clear, follow a bit more below.
tier three involves human interaction; basically you go through and say "these are the same," "these aren't the same," "this is the right one," etc. i ran through about 1000 of the 3000 ... and then wrote a quick script to output it directly from the db. (you can see it here. please do not reload it too often; it's being dynamically generated so it's kinda hellish on the DB). scroll to the bottom and you'll see the total number of entries the existing optimization script leaves behind with verifiable artists (btw, on that page, the bolded artists are the ones verified by the script as real artists, and the text underneath each bolded text are the artists the script links to as the real artist).
manually editing 1/3 of 1/6 of the actual data (1/18 of the data) results catches 1/5 of all the actual data. my spirits were immediately buoyed. if i finish the rest, i should get more than half the data, which would be pretty amazing despite the crap that gets sent to the database.
soooooooooooo. i still need to work a little bit more on the optimization. i may just manually edit in the existing 2000 entries to see what the final result is.
trial and error!
edit in 6/3:
i am a tardmuffin. i just added a little functionality that counts the number of times each entry comes out; this is to remove any entries that have one or two songs (which really is nothing in a database of 200,000+ songs. i'm noticing that the ones i'm working on now are a lot more pure!
yes!
but if you want to hear something amusing, i drove a car for the first time today. i got my first "lesson" in driving from my dad. i learned how to park, make a three point turn, and uhh... drive in general, i guess. man, i sound like a 15-year old.
i just have to remember the golden rule to driving: red means go (ask yush what this means).
if you want to read some computer crap about audiomatch, read below. it's pretty technical.
first, random PHP thoughts:
man, execution time limitations suck. you realize how hard it is to write optimization scripts with php when they keep timing out cause of max_exec_time?! grrr. i tried to make a work around with meta refresh in http headers, but it wouldn't work for some reason.
but i have to say that either my server has an assload of ram or something... cause the data optimizations were pretty fast (filled with eregis and some data calculation of word similiarities)... awesome!
audiomatch technical explanation of what i did:
the problem with audiomatch is that the data recorded is from the idv3 tags in the mp3s; and most of these are faulty. what i worked on this weekend, in essence, tried to 'clean up' the artists and 'linking' artists together (e.g. those mislabelled would be linked in the database to the right artist.
basically there were a few tiers:
tier 1 was the general 'cleaning up' of the artists. stripping of non-alphanumeric characters and conversion to all lowercase characters for regexp simplicty ... i also threw out all entries with URLS in their artist (hate that promotion crap) as well as artists with one character names.
the next level entailed comparison of the refined artists against themselves. there's a mathematical way of figuring out how close two words are and outputting it as percentage; any two artists that had a match of greater than 80% OR if one of contained words from the other (specifically: "britney spears" and "britney spears oops i did it again" would not match on word percentage, but would be caught by my reg exp), it would be flagged. to avoid a o(n^2) run on a database of 11,000, i created a quick index on the DB by the first letter of the artist name (removing common words like "the" and "a " from the beginning of the artist). i did verify really quickly to make sure this was pretty good, and it seemed to be about right. this essentially split the 11,000 field database into 11,000/28 size databases.
from a total of 12,000 artists, the first tier knocked out about 1000 artists that are considered "unverifiable." after the second tier, 2000 artists remained for 'matching.' the other 8000 are actually still ... sitting there, but they are single matches and are most likely crap (if they weren't single matches, the 2nd tier would of caught them).
so basically my automated optimization caught 5/6 of the data. at first, this caused great headaches for me (HOW CAN I EVER DO DATA ANALYSIS WHEN 5/6 OF THE DATA SUCKS?!) when i realized that this crap data may take up 5/6 of the artist DB, but may only represent a small percentage of the actual data (the script checks for existing artists whenever you submit a new one, so it doesn't create extraneous data). to make this more clear, follow a bit more below.
tier three involves human interaction; basically you go through and say "these are the same," "these aren't the same," "this is the right one," etc. i ran through about 1000 of the 3000 ... and then wrote a quick script to output it directly from the db. (you can see it here. please do not reload it too often; it's being dynamically generated so it's kinda hellish on the DB). scroll to the bottom and you'll see the total number of entries the existing optimization script leaves behind with verifiable artists (btw, on that page, the bolded artists are the ones verified by the script as real artists, and the text underneath each bolded text are the artists the script links to as the real artist).
manually editing 1/3 of 1/6 of the actual data (1/18 of the data) results catches 1/5 of all the actual data. my spirits were immediately buoyed. if i finish the rest, i should get more than half the data, which would be pretty amazing despite the crap that gets sent to the database.
soooooooooooo. i still need to work a little bit more on the optimization. i may just manually edit in the existing 2000 entries to see what the final result is.
trial and error!
edit in 6/3:
i am a tardmuffin. i just added a little functionality that counts the number of times each entry comes out; this is to remove any entries that have one or two songs (which really is nothing in a database of 200,000+ songs. i'm noticing that the ones i'm working on now are a lot more pure!
yes!
Comment with Facebook
Want to comment with Tabulas?. Please login.
KLS
gloriousdayz (guest)
(guest)
COOL PEOPLE UNITE!
Allen
(guest)
but hey, if you want to die, that's your call. where do you want me to take you :-*
jandro