Introduction
Spaced Words is a new approach to alignment-free sequence comparison. While most alignment-free algorithms compare the word-composition of sequences, Spaced Words uses a pattern of care and don't care positions. The occurrence of a spaced word in a sequence is then defined by the characters at the match positions only, while the characters at the don't care positions are ignored (this was originally inspired by the PatternHunter algorithm for homology search in databases). Instead of comparing the frequencies of contiguous words in the input sequences, our new approach compares the frequencies of the spaced words according to the pre-defined pattern. An information-theoretic distance measure is then used to define pairwise distances on the set of input sequences based on their spaced-word frequencies. The original version of our spaced-words approach was published in Boden et al.(2013).In a recent paper, we proposed an extension of this approach (Leimeister et al., 2015). Instead of using one single pattern to define spaced words, our new approach creates a whole set of patterns, and the program then averages the distances calculated based on these patterns.
Systematic test runs on real and simulated sequence sets have shown that, for phylogeny reconstruction, this multiple-spaced-words approach is far superior to the classical alignment-free approach based on contiguous word frequencies.
Availability
- To make our new approach easily accessible to the scientific community, we set up a web interface (updated version which supports our new distance function now) at Göttingen Bioinformatics Compute Server (GOBICS).
- In addition, the source code of our approach can be freely downloaded for protein sequences and for DNA sequences with the new distance measurement here.