The original version of this program was retrieved from http://www.cs.su.oz.au/~scilect/sherlock which was then ported to Plan 9 by myself (../README). As such, this version properly handles unicode. See ./diff for a diff(1) output between the original and Plan 9 versions. Manual page forthcoming... The following are notes that were included in the source of the original version of this program: * This program takes filenames given on the command line, * and reads those files into memory, then compares them * all pairwise to find those which are most similar. * * It uses a digital signature generation scheme to randomly * discard information, thus allowing a better match. * Essentially it hashes up N adjacent 'words' of input, * and semi-randomly throws away many of the hashed * values so that it become hard to hide the plagiarised text. -- FUNCTIONS: char *read_word(Biobuf *bin, int *length, char *ignore, char *punct) * read_word: read a 'word' from the input, ignoring leading characters * which are inside the 'ignore' string, and stopping if one of * the 'ignore' or 'punct' characters is found. * Uses memory allocation to avoid buffer overflow problems. -- * Let f1 == filesize(file1) == A+B * and f2 == filesize(file2) == A+C * where A is the similar section and B or C are dissimilar * * Similarity = 100 * A / (f1 + f2 - A) * = 100 * A / (A+B + A+C - A) * = 100 * A / (A+B+C) * * Thus if A==B==C==n the similarity will be 33% (one third) * This is desireable since we are finding the ratio of similarities * as a fraction of (similarities+dissimilarities). * * The other way of doing things would be to find the ratio of * the sum of similarities as a fraction of total file size: * Similarity = 100 * (A+A) / (A+B + A+C) * This produces higher percentages and more false matches.