How to use Bowtie to align a list of fasta sequence to a reference genome?
1
0
Entering edit mode
3.1 years ago

Hello, I have a fasta file containing 10 short DNA sequences (K-mers) which are significantly associated with a trait of interest. I want to map these sequences to a reference genome to identify the location and do some downstream analyses. I have used Bowtie v.1.2.2. to map the K-mers to the reference. First, I created the index for the reference genome. Then, I used the following command to get the output:

bowtie -f -a --best --strata reference_genome 1011_seq.fa > 1011_alignment

Here, I used -f to indicate the input file which is 1011_seq.fa that contains the K-mers. With this code, I get the output without any error message. The output looks like below:

Sequence_1  +   21  106906  AACGGAGACATGGGAACTAGAAGGA   IIIIIIIIIIIIIIIIIIIIIIIII   0   8:G>C,17:C>T
Sequence_2  -   5   328003  ATCCCGCTCTTAGCGCATAGCTCGT   IIIIIIIIIIIIIIIIIIIIIIIII   4   16:T>C,22:T>C
Sequence_2  +   3   3482001 ACGAGCTATGCGCTAAGAGCGGGAT   IIIIIIIIIIIIIIIIIIIIIIIII   4   9:A>G,22:A>G
Sequence_2  +   9   1055399 ACGAGCTATGCGCTAAGAGCGGGAT   IIIIIIIIIIIIIIIIIIIIIIIII   4   0:G>A,22:A>G
Sequence_2  +   3   996982  ACGAGCTATGCGCTAAGAGCGGGAT   IIIIIIIIIIIIIIIIIIIIIIIII   4   0:G>A,22:A>G
Sequence_2  -   3   1050015 ATCCCGCTCTTAGCGCATAGCTCGT   IIIIIIIIIIIIIIIIIIIIIIIII   4   9:T>C,22:T>C
Sequence_3  +   21  106907  ACGGAGACATGGGAACTAGAAGGAT   IIIIIIIIIIIIIIIIIIIIIIIII   0   7:G>C,16:C>T
Sequence_4  +   21  106901  AGGTCAACGGAGACATGGGAACTAG   IIIIIIIIIIIIIIIIIIIIIIIII   0   13:G>C,22:C>T
Sequence_5  -   21  106899  CGAGGTCAACGGAGACATGGGAACT   IIIIIIIIIIIIIIIIIIIIIIIII   0   0:C>T,9:G>C
Sequence_6  +   21  106905  CAACGGAGACATGGGAACTAGAAGG   IIIIIIIIIIIIIIIIIIIIIIIII   0   9:G>C,18:C>T
Sequence_7  -   3   3481999 TTACGAGCTATGCGCTAAGAGCGGG   IIIIIIIIIIIIIIIIIIIIIIIII   4   0:A>G,13:A>G
Sequence_7  +   5   328005  CCCGCTCTTAGCGCATAGCTCGTAA   IIIIIIIIIIIIIIIIIIIIIIIII   4   0:T>C,6:T>C

Can someone please tell me if the output is correct or not? I can recognize the strand sign, chromosome number, then the position and sequence. What I do no understand is the 6th and 7th columns. Why there is a continuous III and what does the number 0 and 4 means? Thanks a lot in advance.

Regards Anik

mapping genome alignment fasta genomics • 3.3k views
ADD COMMENT
0
Entering edit mode

If this is really important to you, it may be wise to consider using the most recent version of bowtie. I have v2.4.1 which still may not be the latest, but yours is definitely behind by several big upgrades.

ADD REPLY
1
Entering edit mode

Mensur Dlakic : These are very short input sequences so bowtie v.1.x is the appropriate program to use.

ADD REPLY
0
Entering edit mode

Oh okay. I will definitely check. Thanks for the info. I was just following a paper. Because I normally use Bowtie2.

ADD REPLY
3
Entering edit mode
3.1 years ago
GenoMax 141k

Since you did not save your alignments in SAM format that is the default bowtie output which is described in the manual

Default bowtie output

bowtie outputs one alignment per line. Each line is a collection of 8 fields separated by tabs; from left to right, the fields are:

1. Name of read that aligned.

Note that the [SAM specification] disallows whitespace in the read name. If the read name contains any whitespace characters, Bowtie 2 will truncate the name at the first whitespace character. This is similar to the behavior of other tools.

2. Reference strand aligned to, + for forward strand, - for reverse

3. Name of reference sequence where alignment occurs, or numeric ID if no name was provided

4. 0-based offset into the forward reference strand where leftmost character of the alignment occurs

5. Read sequence (reverse-complemented if orientation is -).

6. ASCII-encoded read qualities (reversed if orientation is -). The encoded quality values are on the Phred scale and the encoding is ASCII-offset by 33 (ASCII char !).

7. If -M was specified and the prescribed ceiling was exceeded for this read, this column contains the value of the ceiling, indicating that at least that many valid alignments were found in addition to the one reported.

Otherwise, this column contains the number of other instances where the same sequence aligned against the same reference characters as were aligned against in the reported alignment. This is not the number of other places the read aligns with the same number of mismatches. The number in this column is generally not a good proxy for that number (e.g., the number in this column may be ‘0’ while the number of other alignments with the same number of mismatches might be large).

8. Comma-separated list of mismatch descriptors. If there are no mismatches in the alignment, this field is empty. A single descriptor has the format offset:reference-base>read-base. The offset is expressed as a 0-based offset from the high-quality (5’) end of the read.

You may want to use -S file.sam to get SAM formatted output.

ADD COMMENT
0
Entering edit mode

Does it add any value if I output as a file.sam format since I am interested in the location of the reads?

ADD REPLY
0
Entering edit mode

If by location of the reads you simply mean the chromosome and start of the read then the default format will work. SAM format will be compatible with other downstream analysis tools like samtools.

ADD REPLY
0
Entering edit mode

Thank you for the answer.

ADD REPLY

Login before adding your answer.

Traffic: 2797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6