Entering edit mode
8.1 years ago
thefabnab
▴
10
Hey everyone,
I'm having a bit of trouble converting a sam file I generated from a bowtie alignment. I believe bowtie is outputting the same file in a legacy format which samtools isn't recognizing. Here my bowtie alignment:
> bowtie --wrapper basic-0 NC_008253 -1 out1.fq -2 out2.fq -n 3 -X 600 -q --al aligned.map --un unmapped.map -t -l 24 --sam output.sam
and my samtools conversion from sam to bam:
>samtools view -bS output.sam > output.bam
with my sam file looking as follows:
@HD VN:1.0 SO:unsorted
@SQ SN:gi|110640213|ref|NC_008253.1| LN:4938920
@PG ID:Bowtie VN:1.1.0 CL:"bowtie --wrapper basic-0 --wrapper basic-0 NC_008253 -1 out1.fq -2 out2.fq -n 3 -X 600 -q --al aligned.map --un unmapped.map -t -l 24 --sam output.sam"
gi|110640213|ref|NC_008253.1|_97295_97764_1:0:0_5:0:0_4 77 * 0 TACGCGCTTTACAGGAGAATGGGACATGTTAGTTTGGCTGGCCGAACATATGGTCAAATATTATTCCGGC 2222222222222222222222222222222222222222222222222222222222222222222222 XM:i:0
gi|110640213|ref|NC_008253.1|_97295_97764_1:0:0_5:0:0_4 141 * 0 ACAGCGCGGAGGCAACACCCAGTGCAACGACCGACATCCAGAAATACTCCCAACGGGCAACCAGGCCCTG 2222222222222222222222222222222222222222222222222222222222222222222222 XM:i:0
gi|110640213|ref|NC_008253.1|_4776109_4776620_6:0:0_0:0:0_b 77 * TGTTATTATCGGTGTAGGATCATAGTCGGGATCAACTAAACCACTAATATCCTCGTCTGTAGGGCAAGTA 2222222222222222222222222222222222222222222222222222222222222222222222 XM:i:0
gi|110640213|ref|NC_008253.1|_4776109_4776620_6:0:0_0:0:0_b 141 * ATCAACCATTTACCCTGTCTTTATAAATCTAATTGATAACGCAATATACTGGCTTGGGAAAACAACTGGA 2222222222222222222222222222222222222222222222222222222222222222222222 XM:i:0
gi|110640213|ref|NC_008253.1|_3259478_3260036_2:0:0_5:0:0_e 77 * TGAACGACAATCTTACGCTGCCGGGCAATATTGGCTGGATGGATAATGATCCGGTTTGTGATTGTCAGGA 2222222222222222222222222222222222222222222222222222222222222222222222 XM:i:0
gi|110640213|ref|NC_008253.1|_3259478_3260036_2:0:0_5:0:0_e 141 * GTTTTCTCCAGGGCACGCATTTCCTCAACGCTTGAAAATAGCGACTCTTTGCCAAAACAGATCGGATTGT 2222222222222222222222222222222222222222222222222222222222222222222222 XM:i:0
gi|110640213|ref|NC_008253.1|_1336979_1337478_2:0:0_5:0:1_f 77 * GTCAACAAAGCGATGGTAATAAATCAGGAAAATCCTGATTCACCGTTAATAGTTATGTCAGCTGATAACG 2222222222222222222222222222222222222222222222222222222222222222222222 XM:i:0
The error message I'm getting is the following:
[W::sam_read1] parse error at line 4 [main_samview] truncated file.
I'd appreciate any help
Your SAM file does not look like a standard one (see page 4). If you have standard fastq reads as your query then the first column for each line should start with query name (which would be the fastq ID). I assume you are aligning against NC_008253 which would be the third field in a SAM record but it happens to be the first in your example. So something in not right in the way you ran the bowtie job.
Oh yeah, there's not even 11 columns, so the columns >= 11: check is failing, making it think that its reached the end since the header was ok.
To be honest, this is bowtie's fault. It shouldnt be possible, no matter what the parameters, to write out something that isnt a valid sam using the --sam flag.
That's what I was thinking as well. The reference is a fasta reference which I built with the following command:
I think the problem is you added
--wrapper basic-0
to your bowtie command (if the information in the original post is correct). Can you run the command like this?That was the original command I ran, the posted command is from the .sam file I produced but just to be sure I ran exactly what you typed again after rebuilding the reference with bowtie-build and ran the resulting .sam through samtools and got the same error again.
Can you show us a few reads from your sequence files?
What kind of trouble? Are you getting an error? It may be those pipe "|" characters in the sequence names.
Sorry, the error is:
$ head -4 out1.fq
I am going out on a limb here but that looks like simulated fastq data created by breaking E.coli genome into "reads" (if all of your data looks like this). Q-scores for each base has been set to 2 which in sanger fastq sequence format would equate to "not reliable".
So you are trying to align these simulated reads against the same genome from which they were derived?
I think it would be best to go back to the genome fasta file, edit it and change the fasta header to just
>NC_008253
. Re-make the bowtie index followed by mapping.Those fastq headers on simulated data look ugly as well (see the examples from wikipedia page for proper Illumina headers) but let us see if this works first.