com.darwinsys.diff.Diff Maven / Gradle / Ivy
Go to download
Show more of this group Show more artifacts with this name
Show all versions of darwinsys-api Show documentation
Show all versions of darwinsys-api Show documentation
Ian Darwin's assorted Java stuff,
assembled as an API.
package com.darwinsys.diff;
import java.util.ArrayList;
import java.util.Hashtable;
/**
* BSD-licensed Java implementation of "An O(ND) Difference Algorithm
* and its Variations" by Eugene Myers, published in
* Algorithmica Vol. 1 No. 2, 1986, p 251.
*
* C# version written by Mathias Hertel, http://www.mathertel.de
* Mathias Hertel's version ported to Java by Ian Darwin, http://www.darwinsys.com/
* Comments below this line are from Hertel's original.
* ----------------------------------------------------
*
* There are many C, Java, Lisp implementations public available but they all seem to come
* from the same source (diffutils) that is under the (unfree) GNU public License
* and cannot be reused as a sourcecode for a commercial application.
* There are very old C implementations that use other (worse) algorithms.
* Microsoft also published sourcecode of a diff-tool (windiff) that uses some tree data.
* Also, a direct transfer from a C source to C# is not easy because there is a lot of pointer
* arithmetic in the typical C solutions and i need a managed solution.
* These are the reasons why I implemented the original published algorithm from the scratch and
* make it avaliable without the GNU license limitations.
* I do not need a high performance diff tool because it is used only sometimes.
* I will do some performace tweaking when needed.
*
* The algorithm itself is comparing 2 arrays of numbers so when comparing 2 text documents
* each line is converted into a (hash) number. See DiffText().
*
* Some changes to the original algorithm:
* The original algorithm was described using a recursive approach and comparing zero indexed arrays.
* Extracting sub-arrays and rejoining them is very performance and memory intensive so the same
* (readonly) data arrays are passed arround together with their lower and upper bounds.
* This circumstance makes the LCS and SMS functions more complicate.
* I added some code to the LCS function to get a fast response on sub-arrays that are identical,
* completely deleted or inserted.
*
* The result from a comparisation is stored in 2 arrays that flag for modified (deleted or inserted)
* lines in the 2 data arrays. These bits are then analysed to produce a array of Item objects.
*
* Further possible optimizations:
* (first rule: don't do it; second: don't do it yet)
* The arrays DataA and DataB are passed as parameters, but are never changed after the creation
* so they can be members of the class to avoid the paramter overhead.
* In SMS is a lot of boundary arithmetic in the for-D and for-k loops that can be done by increment
* and decrement of local variables.
* The DownVector and UpVector arrays are always created and destroyed each time the SMS gets called.
* It is possible to reuse them when transfering them to members of the class.
* See TODO: hints.
*
* diff.cs: A port of the algorithm to C#
* Copyright (c) by Matthias Hertel, http://www.mathertel.de
* This work is licensed under a BSD style license. See http://www.mathertel.de/License.aspx
*
* Hertel's version Changes:
* 2002.09.20 There was a "hang" in some situations.
* Now I undestand a little bit more of the SMS algorithm.
* There have been overlapping boxes; that where analyzed partial differently.
* One return-point is enough.
* A assertion was added in CreateDiffs when in debug-mode, that counts the number of equal (no modified) lines in both arrays.
* They must be identical.
*
* 2003.02.07 Out of bounds error in the Up/Down vector arrays in some situations.
* The two vetors are now accessed using different offsets that are adjusted using the start k-Line.
* A test case is added.
*
* 2006.03.05 Some documentation and a direct Diff entry point.
*
* 2006.03.08 Refactored the API to static methods on the Diff class to make usage simpler.
* 2006.03.10 using the standard Debug class for self-test now.
* compile with: csc /target:exe /out:diffTest.exe /d:DEBUG /d:TRACE /d:SELFTEST Diff.cs
* 2007.01.06 license agreement changed to a BSD style license.
* 2007.06.03 added the Optimize method.
* 2007.09.23 UpVector and DownVector optimization by Jan Stoklasa ().
* 2008.05.31 Adjusted the testing code that failed because of the Optimize method (not a bug in the diff algorithm).
* 2008.10.08 Fixing a test case and adding a new test case.
*/
public class Diff {
/**details of one difference. */
public static class Item {
/**Start Line number in Data A. */
public int startA;
/**Start Line number in Data B. */
public int startB;
/**Number of changes in Data A. */
public int deletedA;
/**Number of changes in Data B. */
public int insertedB;
}
/**
* Shortest Middle Snake Return Data
*/
private static class SMSRD {
int x, y;
// int u, v; // 2002.09.20: no need for 2 points
}
/**
* Find the difference in 2 texts, comparing by textlines.
* @param TextA A-version of the text (usually the old one)
* @param TextB B-version of the text (usually the new one)
* @return Returns a array of Items that describe the differences.
*/
public static Item[] diffText(String TextA, String TextB) {
return (diffText(TextA, TextB, false, false, false));
}
/**
* Find the difference in 2 text documents, comparing by textlines.
* The algorithm itself is comparing 2 arrays of numbers so when comparing 2 text documents
* each line is converted into a (hash) number. This hash-value is computed by storing all
* textlines into a common hashtable so i can find dublicates in there, and generating a
* new number each time a new textline is inserted.
* @param TextA A-version of the text (usualy the old one)
* @param TextB B-version of the text (usualy the new one)
* @param trimSpace When set to true, all leading and trailing whitespace characters are stripped out before the comparation is done.
* @param ignoreSpace When set to true, all whitespace characters are converted to a single space character before the comparation is done.
* @param ignoreCase When set to true, all characters are converted to their lowercase equivivalence before the comparation is done.
* @return Returns a array of Items that describe the differences.
*/
public static Item[] diffText(String TextA, String TextB, boolean trimSpace, boolean ignoreSpace, boolean ignoreCase) {
// prepare the input-text and convert to comparable numbers.
Hashtable h = new Hashtable<>(TextA.length() + TextB.length());
// The A-Version of the data (original data) to be compared.
DiffData DataA = new DiffData(DiffCodes(TextA, h, trimSpace, ignoreSpace, ignoreCase));
// The B-Version of the data (modified data) to be compared.
DiffData DataB = new DiffData(DiffCodes(TextB, h, trimSpace, ignoreSpace, ignoreCase));
h = null; // free up hashtable memory (maybe)
int MAX = DataA.Length + DataB.Length + 1;
// vector for the (0,0) to (x,y) search
int[] DownVector = new int[2 * MAX + 2];
// vector for the (u,v) to (N,M) search
int[] UpVector = new int[2 * MAX + 2];
LCS(DataA, 0, DataA.Length, DataB, 0, DataB.Length, DownVector, UpVector);
Optimize(DataA);
Optimize(DataB);
return CreateDiffs(DataA, DataB);
} // DiffText
/**
* If a sequence of modified lines starts with a line that contains the same content
* as the line that appends the changes, the difference sequence is modified so that the
* appended line and not the starting line is marked as modified.
* This leads to more readable diff sequences when comparing text files.
* @param data A Diff data buffer containing the identified changes.
*/
private static void Optimize(DiffData data) {
int startPos, endPos;
startPos = 0;
while (startPos < data.Length) {
while ((startPos < data.Length) && (data.modified[startPos] == false))
startPos++;
endPos = startPos;
while ((endPos < data.Length) && (data.modified[endPos] == true))
endPos++;
if ((endPos < data.Length) && (data.data[startPos] == data.data[endPos])) {
data.modified[startPos] = false;
data.modified[endPos] = true;
} else {
startPos = endPos;
} // if
} // while
} // Optimize
/**
* Find the difference in 2 arrays of integers.
* @param arrayA A-version of the numbers (usualy the old one)
* @param arrayB B-version of the numbers (usualy the new one)
* @return Returns a array of Items that describe the differences.
*/
public Item[] DiffInt(int[] arrayA, int[] arrayB) {
// The A-Version of the data (original data) to be compared.
DiffData dataA = new DiffData(arrayA);
// The B-Version of the data (modified data) to be compared.
DiffData dataB = new DiffData(arrayB);
int MAX = dataA.Length + dataB.Length + 1;
// vector for the (0,0) to (x,y) search
int[] downVector = new int[2 * MAX + 2];
// vector for the (u,v) to (N,M) search
int[] upVector = new int[2 * MAX + 2];
LCS(dataA, 0, dataA.Length, dataB, 0, dataB.Length, downVector, upVector);
return CreateDiffs(dataA, dataB);
}
/**
* This function converts all textlines of the text into unique numbers for every unique textline
* so further work can work only with simple numbers.
* @param aText the input text
* @param h This extern initialized hashtable is used for storing all ever used textlines.
* @param trimSpace ignore leading and trailing space characters
* @return a array of integers.
*/
private static int[] DiffCodes(String aText, Hashtable h, boolean trimSpace, boolean ignoreSpace, boolean ignoreCase) {
// get all codes of the text
String[] Lines;
int[] Codes;
int lastUsedCode = h.size();
Integer aCode;
String s;
// strip off all cr, only use lf as textline separator.
aText = aText.replace("\r", "");
Lines = aText.split("\n");
Codes = new int[Lines.length];
for (int i = 0; i < Lines.length; ++i) {
s = Lines[i];
if (trimSpace)
s = s.trim();
if (ignoreSpace) {
s = s.replaceAll("\\s+", " "); // TODO: optimize IF NEEDED: faster blank removal.
}
if (ignoreCase)
s = s.toLowerCase();
aCode = h.get(s);
if (aCode == null) {
lastUsedCode++;
h.put(s, lastUsedCode);
Codes[i] = lastUsedCode;
} else {
Codes[i] = aCode;
} // if
} // for
return (Codes);
} // DiffCodes
/**
* This is the algorithm to find the Shortest Middle Snake (SMS).
* @param DataA sequence A
* @param LowerA lower bound of the actual range in DataA
* @param UpperA upper bound of the actual range in DataA (exclusive)
* @param DataB sequence B
* @param LowerB lower bound of the actual range in DataB
* @param UpperB upper bound of the actual range in DataB (exclusive)
* @param DownVector a vector for the (0,0) to (x,y) search. Passed as a parameter for speed reasons.
* @param UpVector a vector for the (u,v) to (N,M) search. Passed as a parameter for speed reasons.
* @return a MiddleSnakeData record containing x,y and u,v
*/
private static SMSRD SMS(DiffData DataA, int LowerA, int UpperA, DiffData DataB, int LowerB, int UpperB,
int[] DownVector, int[] UpVector) {
SMSRD ret = new SMSRD();
int MAX = DataA.Length + DataB.Length + 1;
int DownK = LowerA - LowerB; // the k-line to start the forward search
int UpK = UpperA - UpperB; // the k-line to start the reverse search
int Delta = (UpperA - LowerA) - (UpperB - LowerB);
boolean oddDelta = (Delta & 1) != 0;
// The vectors in the publication accepts negative indexes. the vectors implemented here are 0-based
// and are access using a specific offset: UpOffset UpVector and DownOffset for DownVektor
int DownOffset = MAX - DownK;
int UpOffset = MAX - UpK;
int MaxD = ((UpperA - LowerA + UpperB - LowerB) / 2) + 1;
// System.out.println(2, "SMS", String.format("Search the box: A[{0}-{1}] to B[{2}-{3}]", LowerA, UpperA, LowerB, UpperB));
// init vectors
DownVector[DownOffset + DownK + 1] = LowerA;
UpVector[UpOffset + UpK - 1] = UpperA;
for (int D = 0; D <= MaxD; D++) {
// Extend the forward path.
for (int k = DownK - D; k <= DownK + D; k += 2) {
// System.out.println(0, "SMS", "extend forward path " + k.ToString());
// find the only or better starting point
int x, y;
if (k == DownK - D) {
x = DownVector[DownOffset + k + 1]; // down
} else {
x = DownVector[DownOffset + k - 1] + 1; // a step to the right
if ((k < DownK + D) && (DownVector[DownOffset + k + 1] >= x))
x = DownVector[DownOffset + k + 1]; // down
}
y = x - k;
// find the end of the furthest reaching forward D-path in diagonal k.
while ((x < UpperA) && (y < UpperB) && (DataA.data[x] == DataB.data[y])) {
x++; y++;
}
DownVector[DownOffset + k] = x;
// overlap ?
if (oddDelta && (UpK - D < k) && (k < UpK + D)) {
if (UpVector[UpOffset + k] <= DownVector[DownOffset + k]) {
ret.x = DownVector[DownOffset + k];
ret.y = DownVector[DownOffset + k] - k;
// ret.u = UpVector[UpOffset + k]; // 2002.09.20: no need for 2 points
// ret.v = UpVector[UpOffset + k] - k;
return (ret);
} // if
} // if
} // for k
// Extend the reverse path.
for (int k = UpK - D; k <= UpK + D; k += 2) {
// System.out.println(0, "SMS", "extend reverse path " + k.ToString());
// find the only or better starting point
int x, y;
if (k == UpK + D) {
x = UpVector[UpOffset + k - 1]; // up
} else {
x = UpVector[UpOffset + k + 1] - 1; // left
if ((k > UpK - D) && (UpVector[UpOffset + k - 1] < x))
x = UpVector[UpOffset + k - 1]; // up
} // if
y = x - k;
while ((x > LowerA) && (y > LowerB) && (DataA.data[x - 1] == DataB.data[y - 1])) {
x--; y--; // diagonal
}
UpVector[UpOffset + k] = x;
// overlap ?
if (!oddDelta && (DownK - D <= k) && (k <= DownK + D)) {
if (UpVector[UpOffset + k] <= DownVector[DownOffset + k]) {
ret.x = DownVector[DownOffset + k];
ret.y = DownVector[DownOffset + k] - k;
// ret.u = UpVector[UpOffset + k]; // 2002.09.20: no need for 2 points
// ret.v = UpVector[UpOffset + k] - k;
return (ret);
} // if
} // if
} // for k
} // for D
throw new IllegalStateException("the algorithm should never come here.");
} // SMS
/**
* This is the divide-and-conquer implementation of the longest common-subsequence (LCS)
* algorithm.
* The published algorithm passes recursively parts of the A and B sequences.
* To avoid copying these arrays the lower and upper bounds are passed while the sequences stay constant.
* @param DataA sequence A
* @param LowerA lower bound of the actual range in DataA
* @param UpperA upper bound of the actual range in DataA (exclusive)
* @param DataB sequence B
* @param LowerB lower bound of the actual range in DataB
* @param UpperB upper bound of the actual range in DataB (exclusive)
* @param DownVector a vector for the (0,0) to (x,y) search. Passed as a parameter for speed reasons.
* @param UpVector a vector for the (u,v) to (N,M) search. Passed as a parameter for speed reasons.
*/
private static void LCS(DiffData DataA, int LowerA, int UpperA, DiffData DataB, int LowerB, int UpperB, int[] DownVector, int[] UpVector) {
// System.out.println(2, "LCS", String.format("Analyse the box: A[{0}-{1}] to B[{2}-{3}]", LowerA, UpperA, LowerB, UpperB));
// Fast walkthrough equal lines at the start
while (LowerA < UpperA && LowerB < UpperB && DataA.data[LowerA] == DataB.data[LowerB]) {
LowerA++; LowerB++;
}
// Fast walkthrough equal lines at the end
while (LowerA < UpperA && LowerB < UpperB && DataA.data[UpperA - 1] == DataB.data[UpperB - 1]) {
--UpperA; --UpperB;
}
if (LowerA == UpperA) {
// mark as inserted lines.
while (LowerB < UpperB)
DataB.modified[LowerB++] = true;
} else if (LowerB == UpperB) {
// mark as deleted lines.
while (LowerA < UpperA)
DataA.modified[LowerA++] = true;
} else {
// Find the middle snakea and length of an optimal path for A and B
SMSRD smsrd = SMS(DataA, LowerA, UpperA, DataB, LowerB, UpperB, DownVector, UpVector);
// System.out.println(2, "MiddleSnakeData", String.format("{0},{1}", smsrd.x, smsrd.y));
// The path is from LowerX to (x,y) and (x,y) to UpperX
LCS(DataA, LowerA, smsrd.x, DataB, LowerB, smsrd.y, DownVector, UpVector);
LCS(DataA, smsrd.x, UpperA, DataB, smsrd.y, UpperB, DownVector, UpVector); // 2002.09.20: no need for 2 points
}
} // LCS()
/**Scan the tables of which lines are inserted and deleted,
* producing an edit script in forward order.
* dynamic array
*/
private static Item[] CreateDiffs(DiffData DataA, DiffData DataB) {
ArrayList- a = new ArrayList
- ();
Item aItem;
Item[] result;
int startA, startB;
int lineA, lineB;
lineA = 0;
lineB = 0;
while (lineA < DataA.Length || lineB < DataB.Length) {
if ((lineA < DataA.Length) && (!DataA.modified[lineA])
&& (lineB < DataB.Length) && (!DataB.modified[lineB])) {
// equal lines
lineA++;
lineB++;
} else {
// maybe deleted and/or inserted lines
startA = lineA;
startB = lineB;
while (lineA < DataA.Length && (lineB >= DataB.Length || DataA.modified[lineA]))
// while (LineA < DataA.Length && DataA.modified[LineA])
lineA++;
while (lineB < DataB.Length && (lineA >= DataA.Length || DataB.modified[lineB]))
// while (LineB < DataB.Length && DataB.modified[LineB])
lineB++;
if ((startA < lineA) || (startB < lineB)) {
// store a new difference-item
aItem = new Item();
aItem.startA = startA;
aItem.startB = startB;
aItem.deletedA = lineA - startA;
aItem.insertedB = lineB - startB;
a.add(aItem);
} // if
} // if
} // while
result = a.toArray(new Item[a.size()]);
return (result);
}
/** Data on one input file being compared.
*/
static class DiffData
{
/**Number of elements (lines). */
private int Length;
/**Buffer of numbers that will be compared. */
private int[] data;
/**
* Array of booleans that flag for modified data.
* This is the result of the diff.
* This means deletedA in the first Data or inserted in the second Data.
*/
private boolean[] modified;
/**
* Initialize the Diff-Data buffer.
* @param data reference to the buffer
*/
protected DiffData(int[] initData) {
data = initData;
Length = initData.length;
modified = new boolean[Length + 2];
} // DiffData
} // class DiffData
} // class Diff