Yifei Kong

May 30, 2017

Jaccard coefficient(杰拉德距离)

the jaccard index is a simple measure of how similiar two sets are. it's simply the ratio of the size of the intersection of the sets and the size of the union of the sets.

eg.

if J(A,B) is jaccard index between sets A and B
and A = {1,2,3}, B = {2,3,4}, C = {4,5,6},
then J(A,B) = 2/4 = 0.5,
and J(A,C) = 0/6 = 0,
and J(B,C) = 1/5 = 0.2
so the most "similiar" sets are A and B and the least similiar are A and C
(note also J(A,A) = J(B,B) = J(C,C) = 1)

we have to consider what the values(jaccard coefficient) would be in real world. how are the distributed.

Jeccard coeffi is not transtive, so, for a set of n sentences, we have to calculate O(n2) times. There is a way to do it in O(n) time

map N-grams to a set of numbers, then use bit to express whether the N-gram exists in one word/sententce.

A = mat -> {ma, at} \                              / 101 \    
                    --> {ma(2), ca(1), at(0)} ->         --> &(001), |(111)
B = cat -> {ca, at} /                              \ 011 /

J(A, B) = and(A, B)/or(A, B)

let X = card(A ^ B), let U = card(A U B) let J(A, B) = X / (X + U)