I have a SAM file, created by Bowtie2, that contains a lot of reads. I wrote an R script to sort out reads according to different criteria. Then, the remaining reads are supposed to be stored in a sam file or bam file. However, I don't manage to end up with a bam file.
The original SAM file is convertable to BAM with samtools view -bS original.sam > original.bam
without problems.
The sorted SAM file that is created with my R script is not convertable with samtools view -bS sorted.sam > sorted.bam
. My command line returns:
[E::sam_parse1] unrecognized CIGAR operator
[W::sam_read1] parse error at line 10
[main_samview] truncated file.
Line 10 is where the header region in my sorted.sam file ends and the first read column is reached. The cigar is 51M, which is perfectly fine and I don't understand what seems to be the problem here. Somehow, my sorted.sam is principally different from the original.sam.
My script looks basically like this:
header<- read.delim("original.sam"
, header=FALSE
, nrows=10
, fill=TRUE
, col.names=c(1:5)
)
sam_table<-read.delim("original.sam"
, header=FALSE
, skip=10
, sep="\t"
)
#sorting steps#
coming up with a data frame called sorted
>head(sorted)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 HWI-ST999:188:C49E6ACXX:8:1101:19228:59871 99 4 13390391 44 51M = 13390490 150
2 HWI-ST999:188:C49E6ACXX:8:1101:13179:65266 163 4 13586582 44 51M = 13586642 111
3 HWI-ST999:188:C49E6ACXX:8:1101:6735:68508 163 4 14183889 44 51M = 14183959 121
4 HWI-ST999:188:C49E6ACXX:8:1101:17834:67530 99 4 10229754 44 51M = 10229840 137
5 HWI-ST999:188:C49E6ACXX:8:1101:6694:49801 163 2 17188 1 51M = 17288 151
6 HWI-ST999:188:C49E6ACXX:8:1101:3920:70040 163 2 13532015 44 51M = 13532100 136
V10
1 AAAACTAAAAATTGTTAAGGACATATTACAAGATATCTCCTCTTTTCGCTT
2 GCTGCTACTTTCAGACCCTCCTCCGTTTCAGCTTCTTCCGAATTAACCCAT
3 AACGTTGTGGCGGTTAGGAGATTTAGAGAAAAACGGTCTATATAACAAAAT
4 TTAATTTGTGTTTGATCAAAACACAAATAGAAGTTTAACACACCTTTTTTT
5 CCGAATCAACACATCCGTTCCTTGCACCTTTCCGAAGAGTTGCACCTTTCC
6 CAACTGTTGTGGCAGTGCTTATTGCCACAGTGGCTTTCGCAGCAATCTTCA
V11 V12 V13 V14 V15
1 CCCFFFFFHHHDHJIIJJJJJJJIJIJJJJJIDEHIIJJIJJJJIIIHIIH AS:i:102 XN:i:0 XM:i:0 XO:i:0
2 CCCFFFFFHHHHHJJJJJJJJJJJJGIJJJJJJJJJJIJIHJJJIJGJIIG AS:i:102 XN:i:0 XM:i:0 XO:i:0
3 CCCFFFFFHHHHHCGDGHEGEGHJEGEGEGIJJEGI;CGGHGDFHGGIIJ; AS:i:102 XN:i:0 XM:i:0 XO:i:0
4 BCCFFFFFFFFHHJJJJJJJJJJJIJJGHHJJIHHIIJJIJIJJJJJJJJI AS:i:102 XN:i:0 XM:i:0 XO:i:0
5 CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJIJJJJJIIJJJJJJJJJI AS:i:102 XS:i:102 XN:i:0 XM:i:0
6 CCCFFFFFHHHHHIJHIJJJIJJJJJIJIJIJIJJJIIIIGGFIGIIJJJI AS:i:102 XN:i:0 XM:i:0 XO:i:0
V16 V17 V18 V19 V20 V21
1 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
2 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
3 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
4 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
5 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
6 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
then I save my sorted reads with write.table. first I save the header in a file, then I append the sorted reads with another write.table command.
write.table(header
, file = "sorted.sam"
, sep = "\t"
, row.names = FALSE
, col.names = FALSE
, append = TRUE
, quote= FALSE
)
write.table(sorted
, file = "sorted.sam"
, sep = "\t"
, row.names = FALSE
, col.names = FALSE
, append = TRUE
, quote= FALSE
)
The file sorted.sam looks in TextWrangler (a text editor) in principle just like the original.sam. However, I cannot create a bam file from it, albeit I am able to create a bam file from original.sam
This is what the sorted.sam looks like:
@HD VN:1.0 SO:coordinate
@SQ SN:Pt LN:154478
@SQ SN:Mt LN:366924
@SQ SN:4 LN:18585056
@SQ SN:2 LN:19698289
@SQ SN:3 LN:23459830
@SQ SN:5 LN:26975502
@SQ SN:1 LN:30427671
@PG ID:bowtie2 PN:bowtie2 VN:2.2.6 CL:/software/galaxy/dependency_dir/bowtie2/2.2.6/iuc/package_bowtie_2_2_6/0d9cd7487cc9/bin/bowtie2-align-s --wrapper basic-0 -p 1 -x /data/Reference/Cress/Bowtie2_index/AT10_Bowtie2 --very-sensitive-local -1 /software/galaxy/database/files/003/dataset_3456.dat -2 /software/galaxy/database/files/003/dataset_3457.dat
HWI-ST999:188:C49E6ACXX:8:1101:19228:59871 99 4 13390391 44 51M = 13390490 150 AAAACTAAAAATTGTTAAGGACATATTACAAGATATCTCCTCTTTTCGCTT CCCFFFFFHHHDHJIIJJJJJJJIJIJJJJJIDEHIIJJIJJJJIIIHIIH AS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:13179:65266 163 4 13586582 44 51M = 13586642 111 GCTGCTACTTTCAGACCCTCCTCCGTTTCAGCTTCTTCCGAATTAACCCAT CCCFFFFFHHHHHJJJJJJJJJJJJGIJJJJJJJJJJIJIHJJJIJGJIIG AS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:6735:68508 163 4 14183889 44 51M = 14183959 121 AACGTTGTGGCGGTTAGGAGATTTAGAGAAAAACGGTCTATATAACAAAAT CCCFFFFFHHHHHCGDGHEGEGHJEGEGEGIJJEGI;CGGHGDFHGGIIJ; AS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:17834:67530 99 4 10229754 44 51M = 10229840 137 TTAATTTGTGTTTGATCAAAACACAAATAGAAGTTTAACACACCTTTTTTT BCCFFFFFFFFHHJJJJJJJJJJJIJJGHHJJIHHIIJJIJIJJJJJJJJI AS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:6694:49801 163 2 17188 1 51M = 17288 151 CCGAATCAACACATCCGTTCCTTGCACCTTTCCGAAGAGTTGCACCTTTCC CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJIJJJJJIIJJJJJJJJJI AS:i:102 XS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:3920:70040 163 2 13532015 44 51M = 13532100 136 CAACTGTTGTGGCAGTGCTTATTGCCACAGTGGCTTTCGCAGCAATCTTCA CCCFFFFFHHHHHIJHIJJJIJJJJJIJIJIJIJJJIIIIGGFIGIIJJJI AS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:11701:30305 99 3 4624315 44 51M = 4624471 207 TAGGGAGAAATTCAAATCTTTATGATTCAATATTAACAAACCAGCCTTGAT CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIJJJJJJJIIJJJJJJJ AS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:2991:34132 99 3 3373839 44 51M = 3374021 233 TATGAAACCCTAGAAAAGTACCAGGAAATGAAATGACCAAAATCAGATCAA CCCFFFFFHHHHHIJJJJCHHJJJJJJJJJIJJJJIIJJJJIIIJJJJJJJ AS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:4431:46255 163 2 18089013 44 51M = 18089116 154 CAGTGCCACAGTAGGCTTTTGTGGTACCACAATATCCCCACCTGCTGCAGC CCCFFFFFHGHHHIJJJJJJJHJJGHIJIJJIJJJJJJJJJJHGIIJJJJG AS:i:102 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:102 YT:Z:CP
Where could I look for problems? The Cigar (column 6) is formatted as it is in the original.sam. So this does not seem to be the actual problem.