332

GitHub - orsinium/textdistance: Compute distance between sequences. 30+ algorith...

 6 years ago
source link: https://github.com/orsinium/textdistance
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

TextDistance

TextDistance logo

Build Status PyPI version Status Code size License

TextDistance -- python library for compare distance between two or more sequences by many algorithms.

Features:

  • 30+ algorithms
  • Pure python implementation
  • Simple usage
  • More than two sequences comparing
  • Some algorithms have more than one implementation in one class.
  • Optional numpy usage for maximum speed.

Algorithms

Edit based

Algorithm Class Functions Hamming Hamming hamming MLIPNS Mlipns mlipns Levenshtein Levenshtein levenshtein Damerau-Levenshtein DamerauLevenshtein damerau_levenshtein Jaro-Winkler JaroWinkler jaro_winkler, jaro Strcmp95 StrCmp95 strcmp95 Needleman-Wunsch NeedlemanWunsch needleman_wunsch Gotoh Gotoh gotoh Smith-Waterman SmithWaterman smith_waterman

Token based

Algorithm Class Functions Jaccard index Jaccard jaccard Sørensen–Dice coefficient Sorensen sorensen, sorensen_dice, dice Tversky index Tversky tversky Overlap coefficient Overlap overlap Tanimoto distance Tanimoto tanimoto Cosine similarity Cosine cosine Monge-Elkan MongeElkan monge_elkan Bag distance Bag bag

Sequence based

Algorithm Class Functions longest common subsequence similarity LCSSeq lcsseq longest common substring similarity LCSStr lcsstr Ratcliff-Obershelp similarity RatcliffObershelp ratcliff_obershelp

Compression based

Work in progress. Now all algorithms compare two strings as array of bits, not by chars.

NCD - normalized compression distance.

Functions:

  1. bz2_ncd
  2. lzma_ncd
  3. arith_ncd
  4. rle_ncd
  5. bwtrle_ncd
  6. zlib_ncd

Phonetic

Algorithm Class Functions MRA MRA mra Editex Editex editex

Simple

Algorithm Class Functions Prefix similarity Prefix prefix Postfix similarity Postfix postfix Length distance Length length Identity similarity Identity identity Matrix similarity Matrix matrix

Installation

Stable:

pip install textdistance

Dev:

pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistance

Usage

All algorithms have 2 interfaces:

  1. Class with algorithm-specific params for customizing.
  2. Class instance with default params for quick and simple usage.

All algorithms have some common methods:

  1. .distance(*sequences) -- calculate distance between sequences.
  2. .similarity(*sequences) -- calculate similarity for sequences.
  3. .maximum(*sequences) -- maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
  4. .normalized_distance(*sequences) -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
  5. .normalized_similarity(*sequences) -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.

Most common init arguments:

  1. qval -- q-value for split sequences into q-grams. Possible values:
    • 1 (default) -- compare sequences by chars.
    • 2 or more -- transform sequences to q-grams.
    • None -- split sequences by words.
  2. as_set -- for token-based algorithms:
    • True -- t and ttt is equal.
    • False (default) -- t and ttt is different.

Example

For example, Hamming distance:

import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2

Any other algorithms have same interface.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK