It’s been a while since I’ve posted here. The semester is in full swing, and things at the newspaper have been keeping me busy. A few weeks ago I started reading about document fingerprinting for plagiarism detection, and I’ve made some progress.I was hoping to create an online lyrics aggregator and software to build a lyrics database. Basically, server software would search google for “Better than Ezra Closer Lyrics”, visit the top pages, and use plagiarism detection algorithms to “find” the lyrics on the pages. The basic assumption was that sections of the HTML containing lyrics would be nearly identical. Taking this area of each page, the data could be further refined by identifying phrases and punctuation that was inconsistent across the results. I hate going to sites that offer incorrect, damaged lyrics. Lookup lyrics for any rap song and you’ll see the problem. Every site is slightly different. Using plagiarism and document fingerprinting, it would be possible to find the “average” of these versions – hopefully coming to a more accurate version.Since the document fingerprinting / plagiarism detection algorithms were central to the idea, I started looking into them first. An hour of Google searching turned up some interesting results. Namely, I found this paper:

 Winnowing: a Document Fingerprinting Algorithm   

“Among digital data, documents are the easiest to copy and remove any signatures or fingerprints embedded, which make the pirating the hardest to detect. Anyone can just retype a document or copy a part of it. Document fingerprinting is concerned with accurately identifying and copying, including small partial copies, within large sets of documents.” [From the Abstract]

 It does a great job of explaining the idea behind document fingerprinting, the different approaches, and their usefulness in different contexts. So be sure to check it out.I wrote several MATLAB functions to test the algorithms in the paper. After taking Image Processing last semester, I’m pretty comfortable implementing stuff like this. So here’s what it boils down to:

%@author: Ben Gotow
%@project: Document Fingerprinting
%@date: 2/28/08

%@Feel free to reproduce this code in any way.

%This function takes a character array as an argument. This implementation
%generates kgrams of a constant size (13) and then converts them to hashes.
%The hashes are in turn compressed into "windows" of (5) hashes. The
%compression approach may be changed. This version returns the lowest value
%from the hashes in the window. This method was dicussed in
%'Winnowing, a Document Fingerprinting Algorithm'.

function f = generateFingerprints(chars)

    kgram_size = 13;
    s = 1;

    [t char_count] = size(chars)
    k_count = char_count - kgram_size + 1;

    %parse into k-grams of ascii values
    kgrams = zeros(k_count,kgram_size);

    for s = 1:k_count
        kgrams(s,:) = chars(s:(s+kgram_size-1));
    end


    %convert k-grams into hashes
    b = 6;

    s = 1;
    khash = 0;
    for i = 1:kgram_size
        khash = khash + kgrams(s,i)*b^(kgram_size-i);
    end

    for s = 2:k_count
        khash = (khash - kgrams(s-1,1)*b^(kgram_size-1) )*b + kgrams(s,kgram_size);
        khashes(s) = khash;
    end

    %create windows of hashes of length W

    w_size = 5;

    %find minimums in windows. in case of repeated minimum values, choose the
    %rightmost one. Also, store the global location of the min hash.

    prev_min_position = -1;
    f_count = 0;

    for position = 1:(k_count - (w_size-1))
        w = khashes(position:(position+w_size-1));
        m = min(w);
        if (m ~= 0)
            for i = w_size:-1:1
                if (w(i) == m)

                    %we have the minimum value. is it the same as the one in the
                    %previous window? check if the position is the same

                    if (prev_min_position ~= position)
                       prev_min_position = position;

                       f_count = f_count + 1;
                       f(f_count,1) = m;
                       f(f_count,2) = position;
                    end

                    break
                end
            end
        end
    end

end