
uk.ac.standrews.cs.utilities.all_pairs.notesfrompaper.txt Maven / Gradle / Ivy
====
Copyright 2019 Systems Research Group, University of St Andrews:
This file is part of the module utilities.
utilities is free software: you can redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later
version.
utilities is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with utilities. If not, see
.
====
From Scaling Up All Pairs Similarity Search
Roberto J. Bayardo Google, Inc. [email protected]
Yiming Ma* U. California, Irvine [email protected]
Ramakrishnan Srikant Google, Inc. [email protected]
Appears in Proc. of the 16th Int’l Conf. on World Wide Web, Banff, Alberta, Canada, 131-140, 2007.
(Revised May 21, 2007. This version contains some minor clarifications not present in the original.)
Naive version of all pairs:
Results in a map from features to pairs documents that have those features and their similarity.
ALL-PAIRS( SetofVectors V , threshold t ) { // scans the dataset and incrementally builds the inverted lists
O = Map from feature to pairs of documents and their similarity.
MapI I1,I2,...,Im ← ∅ // from feature-index to Set of (doc, feature) pairs
foreach vector x ∈ V do {
O ← O ∪ FIND-MATCHES(x, MapI, t)
for each i s.t. x[i] > 0 do {
Ii ← Ii ∪ {(x, x[i])
}
}
return O
}
FIND-MATCHES(vector x, Map I, threshold t ) { // scans the inverted lists to perform score accumulation - returns a set of doc pairs and their similarity.
A ← empty map from vector id to weight - the weight accumulation map.
M←∅
for each i s.t. x[i] > 0 do // for each feature in x
foreach( document y, feature y[i] ) ∈ Map do // for each document y with that feature
A[y] ← A[y] + x[i] ⋅ y[i] // increment the weight accumulation map with the cosine similarity for document x and y
for each document y with non-zero weight in A do
if A[y] ≥ t then // add the two docs and their similarity if above threshold.
M ← M ∪ {(x, y, A[y])}
return M
}
/-----------------
ALL-PAIRS-1( SetofVectors V , threshold t ) { // scans the dataset and incrementally builds the inverted lists
// Reorder the dimensions 1...m such that dimensions with the most non-zero entries in V appear first.
// Denote the max. of x[i] over all x ∈ V as maxweighti(V) .
O = Map from feature to pairs of documents and their similarity.
MapI I1,I2,...,Im ← ∅ // from feature-index to Set of (doc, feature) pairs
foreach vector x ∈ V do {
O ← O ∪ FIND-MATCHES(x, MapI, t)
b←0
for each i s.t. x[i] > 0 in increasing order of i do {
b ← b + maxweighti(V) ⋅ x[i]
if b ≥ t then
Ii ← Ii ∪ {(x, x[i])}
x[i] ← 0 ;; create x'
}
}
return O
}
FIND-MATCHES-1( vector x, Map I, threshold t ) { // scans the inverted lists to perform score accumulation - returns a set of doc pairs and their similarity.
A ← empty map from vector id to weight
M←∅
for each i s.t. x[i] > 0 do
foreach(y,y[i])∈Ii do
A[y] ← A[y] + x[i] ⋅ y[i]
for each y with non-zero weight in A do
// Recall that y' is the unindexed portion of y s ← A[y] + dot(x, y')
if s ≥ t then
M ← M ∪ {(x, y, s)}
return M
}
/-----------------
Figure 3a provides our refinement of the All-Pairs procedure that exploits a particular dataset sort order to further minimize the number of indexed features. The ordering over the vectors guarantees any vector x that is produced after a vector y has a lower maximum weight.
ALL-PAIRS-2( V , t ) {
//Reorder the dimensions 1...m such that dimensions with
//the most non-zero entries in V appear first.
maxweighti(V) = max( x[i] over all x ∈ V )
maxweight(x) = max( of x[i] for i = 1...m )
Sort V in decreasing order of maxweight(x)
O = Map from feature to pairs of documents and their similarity.
MapI I1,I2,...,Im ←∅
foreach vector x ∈ V do {
O ← O ∪ FIND-MATCHES-2(x, MapI, t)
b ← 0
for each i s.t. x[i] > 0 in increasing order of i do {
b ← b + min( maxweighti(V), maxweight(x)) ⋅ x[i]
if b ≥ t then
Ii ← Ii ∪ {(x, x[i])}
x[i] ← 0
}
}
return O
}
FIND-MATCHES-2(x, I1,...,Im ,t) {
A ← empty map from vector id to weight 2
M←∅
remscore ← sum( x[i] ⋅ maxweighti(V))
minsize ← t ⁄ maxweight(x)
for each i s.t. x[i] > 0 in decreasing order of i do {
// Iteratively remove (y, w) from the front of Ii while y < minsize.
foreach(y,y[i])∈Ii do {
if A[y] ≠ 0 or remscore ≥ t then
A[y] ← A[y] + x[i] ⋅ y[i]
}
remscore ← remscore – x[i] ⋅ maxweighti(V)
}
for each y with non-zero weight in A do {
if A[y] + min( y' , x ) ⋅ maxweight(x) ⋅ maxweight(y') ≥ t then {
s ← A[y] + dot(x, y')
if s ≥ t then
M ← M ∪ {(x, y, s)}}
}
}
return M
}
© 2015 - 2025 Weber Informatics LLC | Privacy Policy