Skip to content

What notion of similarity is TLSH preserving? #153

@a-gardner1

Description

@a-gardner1

I'm having a tough time evaluating whether TLSH is appropriate for my application. I expect TLSH to correlate to some other semantically meaningful notion of document similarity such as Hamming or Levenshtein distance, but some simple experiments with random strings indicate that is not the case. Namely, I found that for pairs taken from 1000 randomly chosen strings each of 200 characters, the Pearson correlation coefficient between TLSH distances (i.e., tlsh.diff) and either (bitwise) Hamming or (char-wise) Levenshtein distances ranged from 0.14 to 0.18, approximately. Meanwhile, the Hamming and Levenshtein distances has a Pearson correlation of approximately 0.72.

Is there a conventional distance or similarity metric that TLSH correlates with?

Edit: I forgot to remove dependent distance values from swapping the order of arguments when calculating the correlations. Doing so reduced the TLSH correlation to near zero and the Hamming/Levenshtein correlation to approximately 0.18.

Edit 2: It also occurs to me that perhaps the correlation depends upon the magnitude of the distances, i.e., they are correlated for small distances but not for large. The random generation method above was biased towards large distances; I'll try to investigate this thread when I have time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions