What notion of similarity is TLSH preserving?

I'm having a tough time evaluating whether TLSH is appropriate for my application. I expect TLSH to correlate to some other semantically meaningful notion of document similarity such as Hamming or Levenshtein distance, but some simple experiments with random strings indicate that is not the case. Namely, I found that for pairs taken from 1000 randomly chosen strings each of 200 characters, the Pearson correlation coefficient between TLSH distances  (i.e., `tlsh.diff`) and either (bitwise) Hamming or (char-wise) Levenshtein distances ranged from 0.14 to 0.18, approximately. Meanwhile, the Hamming and Levenshtein distances has a Pearson correlation of approximately 0.72.

Is there a conventional distance or similarity metric that TLSH correlates with?

Edit: I forgot to remove dependent distance values from swapping the order of arguments when calculating the correlations. Doing so reduced the TLSH correlation to near zero and the Hamming/Levenshtein correlation to approximately 0.18. 

Edit 2: It also occurs to me that perhaps the correlation depends upon the magnitude of the distances, i.e., they are correlated for small distances but not for large. The random generation method above was biased towards large distances; I'll try to investigate this thread when I have time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What notion of similarity is TLSH preserving? #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What notion of similarity is TLSH preserving? #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions