It’s an interesting question, OP. I wonder whether too much similarity will make a word less easy to learn, not more, due to the potential for confusion.
There can’t be a categorical difference for when a word switches from similar to dissimilar. It’s not like a distance of 3 means similar and a distance of 4 means dissimilar. But here’s some starting points:
- The Damerau–Levenshtein distance is a linguistically appropriate metric for how different two strings are: it allows for not only deletion and addition of letters, but also transposition and substitution.
- You will presumably want to include syllable count in your metric: two words will be more different if their syllable count is different (meaning additional vowels). So in the distance metric, vowels count for more than consonants.
- OTOH if the two words are related through morphology, e.g. derivational morphology, they belong to the same family, and all difference metrics are off: the two words are related through a morphological rule.