Copied and pasted code is usually bad

But it can be hard to find, especially in a large project. So we wrote a utility - CPD - to find it for us. First we wrote it using a variant of Michael Wise's Greedy String Tiling algorithm (our variant is described here ). Then it was completely rewritten by Brian Ewins using the Burrows-Wheeler transform - or, at least, the first part of it.

Here's a screenshot of CPD after running on the JDK java.lang package.

Note that CPD works with Java, C, C++, and PHP code.

If you have Java Web Start , you can run CPD by clicking here .

Here are the duplicates CPD found in the JDK 1.4 source code.

Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache (just the httpd-2.0/server/ directory).

Andy Glover wrote an Ant task for CPD; here's how to use it:


<target name="cpd">
    <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" />
    <cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt">
        <fileset dir="/home/tom/tmp/ant">
            <include name="**/*.java"/>
        </fileset>
    </cpd>
</target>

       

Also, you can get verbose output from this task by running ant with the -v flag; i.e., ant -v -f mybuildfile.xml cpd .

There's also a JavaSpaces version available for splitting the CPD effort across a farm of machines. I usually post news on that here and the releases are here . This project is pretty much dead, though, since Brian's rewrite is fast enough to just run it on one machine.

Suggestions? Comments? Post them here . Thanks!