University of Göttingen | Faculty of Biology | Inst. of Microbiology and Genetics | Dep. of Bioinformatics | Spaced Words Submission Form

The Spaced Words approach
to alignment-free sequence comparison

Introduction

Spaced Words is a new approach to alignment-free sequence comparison. While most alignment-free algorithms compare the word-composition of sequences, Spaced Words uses a pattern of care and don't care positions. The occurrence of a spaced word in a sequence is then defined by the characters at the match positions only, while the characters at the don't care positions are ignored (this was originally inspired by the PatternHunter algorithm for homology search in databases). Instead of comparing the frequencies of contiguous words in the input sequences, our new approach compares the frequencies of the spaced words according to the pre-defined pattern. An information-theoretic distance measure is then used to define pairwise distances on the set of input sequences based on their spaced-word frequencies. The original version of our spaced-words approach was published in Boden et al.(2013).

In a recent paper, we proposed an extension of this approach (Leimeister et al., 2015). Instead of using one single pattern to define spaced words, our new approach creates a whole set of patterns, and the program then averages the distances calculated based on these patterns.

Systematic test runs on real and simulated sequence sets have shown that, for phylogeny reconstruction, this multiple-spaced-words approach is far superior to the classical alignment-free approach based on contiguous word frequencies.

Availability

To make our new approach easily accessible to the scientific community, we set up a web interface (updated version which supports our new distance function now) at Göttingen Bioinformatics Compute Server (GOBICS).
In addition, the source code of our approach can be freely downloaded for protein sequences and for DNA sequences with the new distance measurement here.

Both, the web server and the command-line version of our program return a distance matrix for the input sequences.

Usage

Sequence input

Sequences can be uploaded as a (multiple) FASTA sequence file. The size of the sequence file must not exceed 10MB and 500 sequences.
Both DNA and protein sequences are supported. DNA Example:

>Sequence1
ATGATGAGTAGT
>Sequence2
AAATTGTGGTGTGTC
>Sequence3
CGATCATCGTA

For DNA, A,T,G,C,U,N will be accepted. For protein sequences the following letters are allowed: A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y. Both inputs are case-insensitive.

Generating patterns

A set of patterns can be automatically generated by our server. Alternatively, the set of patterns can be defined by the user, or the parameters (weight, length, number of patterns) can be specified by the user, and a set of patterns is then generated by the server. The weight of a pattern is defined as the number of match positions, its length is the total length including match and don't care positions.

If patterns are automatically generated, the optimal weight and length is determined depending on the length of the input sequences. Our server then generates a set of 10 random patterns with this weight and length. If this option is used, the user only needs to upload the set of input sequences.
If the user wants to specify a set of patterns, these patterns can be uploaded as a text file. Here, each line defines one pattern where "1" stands for match position and "0" stands for a don't care position. Note that the first and the last positions in a pattern need to be match positions. A maximum of 100 patterns is allowed per program run. Example:
```
10110001
10111
1001011
11011
11100001
110011
11101
10100101
101011
11000011
```
Note that it is possible for the uploaded patterns to have different weight and length. If the user wants to specify only weight, length and number of the generated patterns, these parameters can be entered through a form on the submission page. Random patterns are then generated by our server using these parameters.

Program Output

The output of the program is a distance matrix in Phylip format. In addition, the output contains the set of patterns that was used for the program run. If the patterns were generated by our server (either with default parameters or with user-defined parameters), the parameters are also returned as part of the program output.

Example

A simple test example is available for DNA sequences here and for protein sequences including DNA here.

Alternative approach to alignment-free sequence comparison

Kmacs is another approach for alignment-free sequence comparison that we recently developed.
The Kmacs website is available here.

Contact

For comments, or if you encounter any technichal issues, please send an email at: lhahn(at)biologie.uni-goettingen.de or chris.leimeister(at)stud.uni-goettingen.de

References

Scientific publications using Spaced Words should cite:

B. Morgenstern, B. Zhu, S. Horwege, C.-A Leimeister (2015)
Estimating evolutionary distances between genomic sequences from spaced-word matches
Algorithms for Molecular Biology 10, 5.
C.-A. Leimeister, M. Boden, S. Horwege, S. Lindner, B. Morgenstern (2014)
Fast alignment-free sequence comparison using spaced-word frequencies
Bioinformatics 30, 1991-1999
S. Horwege, S. Lindner, M. Boden, K. Hatje, M. Kollmar, C.-A. Leimeister, B. Morgenstern (2014)
Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches
Nuc. Acids Research 42, W7-W11