It’s been a while since I’ve posted here. The semester is in full swing, and things at the newspaper have been keeping me busy. A few weeks ago I started reading about document fingerprinting for plagiarism detection, and I’ve made some progress.I was hoping to create an online lyrics aggregator and software to build a lyrics database. Basically, server software would search google for “Better than Ezra Closer Lyrics”, visit the top pages, and use plagiarism detection algorithms to “find” the lyrics on the pages. The basic assumption was that sections of the HTML containing lyrics would be nearly identical. Taking this area of each page, the data could be further refined by identifying phrases and punctuation that was inconsistent across the results. I hate going to sites that offer incorrect, damaged lyrics. Lookup lyrics for any rap song and you’ll see the problem. Every site is slightly different. Using plagiarism and document fingerprinting, it would be possible to find the “average” of these versions – hopefully coming to a more accurate version. Read more