Question: (Closed) R: Write a SAM file
1
gravatar for michaela_boell
2.0 years ago by
michaela_boell70 wrote:

I have a SAM file, created by Bowtie2, that contains a lot of reads. I wrote an R script to sort out reads according to different criteria. Then, the remaining reads are supposed to be stored in a sam file or bam file. However, I don't manage to end up with a bam file.

The original SAM file is convertable to BAM with samtools view -bS original.sam > original.bam without problems. The sorted SAM file that is created with my R script is not convertable with samtools view -bS sorted.sam > sorted.bam. My command line returns:

[E::sam_parse1] unrecognized CIGAR operator
[W::sam_read1] parse error at line 10
[main_samview] truncated file.

Line 10 is where the header region in my sorted.sam file ends and the first read column is reached. The cigar is 51M, which is perfectly fine and I don't understand what seems to be the problem here. Somehow, my sorted.sam is principally different from the original.sam.

My script looks basically like this:

header<- read.delim("original.sam"
                , header=FALSE
                , nrows=10
                , fill=TRUE
                , col.names=c(1:5)
) 

sam_table<-read.delim("original.sam"
                      , header=FALSE
                      , skip=10
                      , sep="\t"
                      )

#sorting steps#

coming up with a data frame called sorted

>head(sorted)
                                          V1  V2 V3       V4 V5   V6 V7       V8  V9
1 HWI-ST999:188:C49E6ACXX:8:1101:19228:59871  99  4 13390391 44  51M  = 13390490 150
2 HWI-ST999:188:C49E6ACXX:8:1101:13179:65266 163  4 13586582 44  51M  = 13586642 111
3 HWI-ST999:188:C49E6ACXX:8:1101:6735:68508 163  4 14183889 44  51M  = 14183959 121
4 HWI-ST999:188:C49E6ACXX:8:1101:17834:67530  99  4 10229754 44  51M  = 10229840 137
5 HWI-ST999:188:C49E6ACXX:8:1101:6694:49801 163  2    17188  1  51M  =    17288 151
6 HWI-ST999:188:C49E6ACXX:8:1101:3920:70040 163  2 13532015 44  51M  = 13532100 136
                                                   V10
1  AAAACTAAAAATTGTTAAGGACATATTACAAGATATCTCCTCTTTTCGCTT
2  GCTGCTACTTTCAGACCCTCCTCCGTTTCAGCTTCTTCCGAATTAACCCAT
3  AACGTTGTGGCGGTTAGGAGATTTAGAGAAAAACGGTCTATATAACAAAAT
4  TTAATTTGTGTTTGATCAAAACACAAATAGAAGTTTAACACACCTTTTTTT
5  CCGAATCAACACATCCGTTCCTTGCACCTTTCCGAAGAGTTGCACCTTTCC
6  CAACTGTTGTGGCAGTGCTTATTGCCACAGTGGCTTTCGCAGCAATCTTCA
                                                   V11       V12       V13     V14     V15
1  CCCFFFFFHHHDHJIIJJJJJJJIJIJJJJJIDEHIIJJIJJJJIIIHIIH  AS:i:102    XN:i:0  XM:i:0  XO:i:0
2  CCCFFFFFHHHHHJJJJJJJJJJJJGIJJJJJJJJJJIJIHJJJIJGJIIG  AS:i:102    XN:i:0  XM:i:0  XO:i:0
3  CCCFFFFFHHHHHCGDGHEGEGHJEGEGEGIJJEGI;CGGHGDFHGGIIJ;  AS:i:102    XN:i:0  XM:i:0  XO:i:0
4  BCCFFFFFFFFHHJJJJJJJJJJJIJJGHHJJIHHIIJJIJIJJJJJJJJI  AS:i:102    XN:i:0  XM:i:0  XO:i:0
5  CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJIJJJJJIIJJJJJJJJJI  AS:i:102  XS:i:102  XN:i:0  XM:i:0
6  CCCFFFFFHHHHHIJHIJJJIJJJJJIJIJIJIJJJIIIIGGFIGIIJJJI  AS:i:102    XN:i:0  XM:i:0  XO:i:0
      V16     V17      V18       V19       V20      V21
1  XG:i:0  NM:i:0  MD:Z:51  YS:i:102   YT:Z:CP         
2  XG:i:0  NM:i:0  MD:Z:51  YS:i:102   YT:Z:CP         
3  XG:i:0  NM:i:0  MD:Z:51  YS:i:102   YT:Z:CP         
4  XG:i:0  NM:i:0  MD:Z:51  YS:i:102   YT:Z:CP         
5  XO:i:0  XG:i:0   NM:i:0   MD:Z:51  YS:i:102  YT:Z:CP
6  XG:i:0  NM:i:0  MD:Z:51  YS:i:102   YT:Z:CP

then I save my sorted reads with write.table. first I save the header in a file, then I append the sorted reads with another write.table command.

write.table(header
            , file = "sorted.sam"
            , sep = "\t"
            , row.names = FALSE
            , col.names = FALSE
            , append = TRUE
            , quote= FALSE
)
write.table(sorted
            , file = "sorted.sam"
            , sep = "\t"
            , row.names = FALSE
            , col.names = FALSE
            , append = TRUE
            , quote= FALSE
)

The file sorted.sam looks in TextWrangler (a text editor) in principle just like the original.sam. However, I cannot create a bam file from it, albeit I am able to create a bam file from original.sam

This is what the sorted.sam looks like:

@HD VN:1.0  SO:coordinate       
@SQ SN:Pt   LN:154478       
@SQ SN:Mt   LN:366924       
@SQ SN:4    LN:18585056     
@SQ SN:2    LN:19698289     
@SQ SN:3    LN:23459830     
@SQ SN:5    LN:26975502     
@SQ SN:1    LN:30427671     
@PG ID:bowtie2  PN:bowtie2  VN:2.2.6    CL:/software/galaxy/dependency_dir/bowtie2/2.2.6/iuc/package_bowtie_2_2_6/0d9cd7487cc9/bin/bowtie2-align-s --wrapper basic-0 -p 1 -x /data/Reference/Cress/Bowtie2_index/AT10_Bowtie2 --very-sensitive-local -1 /software/galaxy/database/files/003/dataset_3456.dat -2 /software/galaxy/database/files/003/dataset_3457.dat
HWI-ST999:188:C49E6ACXX:8:1101:19228:59871  99  4   13390391    44   51M     =  13390490    150  AAAACTAAAAATTGTTAAGGACATATTACAAGATATCTCCTCTTTTCGCTT     CCCFFFFFHHHDHJIIJJJJJJJIJIJJJJJIDEHIIJJIJJJJIIIHIIH     AS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP    
HWI-ST999:188:C49E6ACXX:8:1101:13179:65266  163 4   13586582    44   51M     =  13586642    111  GCTGCTACTTTCAGACCCTCCTCCGTTTCAGCTTCTTCCGAATTAACCCAT     CCCFFFFFHHHHHJJJJJJJJJJJJGIJJJJJJJJJJIJIHJJJIJGJIIG     AS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP    
HWI-ST999:188:C49E6ACXX:8:1101:6735:68508   163 4   14183889    44   51M     =  14183959    121  AACGTTGTGGCGGTTAGGAGATTTAGAGAAAAACGGTCTATATAACAAAAT     CCCFFFFFHHHHHCGDGHEGEGHJEGEGEGIJJEGI;CGGHGDFHGGIIJ;     AS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP    
HWI-ST999:188:C49E6ACXX:8:1101:17834:67530  99  4   10229754    44   51M     =  10229840    137  TTAATTTGTGTTTGATCAAAACACAAATAGAAGTTTAACACACCTTTTTTT     BCCFFFFFFFFHHJJJJJJJJJJJIJJGHHJJIHHIIJJIJIJJJJJJJJI     AS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP    
HWI-ST999:188:C49E6ACXX:8:1101:6694:49801   163 2   17188   1    51M     =  17288   151  CCGAATCAACACATCCGTTCCTTGCACCTTTCCGAAGAGTTGCACCTTTCC     CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJIJJJJJIIJJJJJJJJJI     AS:i:102    XS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP
HWI-ST999:188:C49E6ACXX:8:1101:3920:70040   163 2   13532015    44   51M     =  13532100    136  CAACTGTTGTGGCAGTGCTTATTGCCACAGTGGCTTTCGCAGCAATCTTCA     CCCFFFFFHHHHHIJHIJJJIJJJJJIJIJIJIJJJIIIIGGFIGIIJJJI     AS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP     
HWI-ST999:188:C49E6ACXX:8:1101:11701:30305  99  3   4624315 44   51M     =  4624471 207  TAGGGAGAAATTCAAATCTTTATGATTCAATATTAACAAACCAGCCTTGAT     CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIJJJJJJJIIJJJJJJJ     AS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP    
HWI-ST999:188:C49E6ACXX:8:1101:2991:34132   99  3   3373839 44   51M     =  3374021 233  TATGAAACCCTAGAAAAGTACCAGGAAATGAAATGACCAAAATCAGATCAA     CCCFFFFFHHHHHIJJJJCHHJJJJJJJJJIJJJJIIJJJJIIIJJJJJJJ     AS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP    
HWI-ST999:188:C49E6ACXX:8:1101:4431:46255   163 2   18089013    44   51M     =  18089116    154  CAGTGCCACAGTAGGCTTTTGTGGTACCACAATATCCCCACCTGCTGCAGC     CCCFFFFFHGHHHIJJJJJJJHJJGHIJIJJIJJJJJJJJJJHGIIJJJJG     AS:i:102    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51     YS:i:102    YT:Z:CP

Where could I look for problems? The Cigar (column 6) is formatted as it is in the original.sam. So this does not seem to be the actual problem.

sequencing sam samtools bam R • 1.5k views
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by michaela_boell70

just curious: why R to sort a SAM ?!

ADD REPLYlink written 2.0 years ago by Pierre Lindenbaum113k
6

Some people just want to watch the world burn. - Alfred Pennyworth

ADD REPLYlink written 2.0 years ago by John12k

how would you do it? R is my "mother tongue", I would need to learn something like Python from scratch. Which I am planning to do... at some point. And is not just sorting, but rather filtering. I filter reads of a certain Qname that are flagged with the flag for "being in a pair" and then for quality criteria. Is there a way to write a SAM file without using e.g. write.table() in R ? Is there some magic to a SAM file that I did not know about, I assume it is a human readable format that can be written by any text editor.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by michaela_boell70
5

Please only use R for statistics and plotting, it's completely inappropriate for anything else. For sorting, samtools sort at the command line. For filtering, you're better off with directly using samtools or learning a bit of python and using pysam if the stock samtools doesn't suffice for your needs.

ADD REPLYlink written 2.0 years ago by Devon Ryan85k

How did you get 99 and 163 as flags in column 2? I don't think those are legal values.

ADD REPLYlink written 2.0 years ago by charbo2440
2

Why? For example, 99 = 1+2+32+64, which are all legal flags.

ADD REPLYlink written 2.0 years ago by michaela_boell70

How does line 10 look like? (and maybe include 9 and 11 too...)

ADD REPLYlink written 2.0 years ago by ddiez1.7k

I included it. The line 10 starts with this Qname: HWI-ST999:188:C49E6ACXX:8:1101:19228:59871. It is the first line that is actually a "read" instead of a header.

ADD REPLYlink written 2.0 years ago by michaela_boell70
1

Oh, OK sorry. I missed that. I tried your code above in a subset of a SAM file and it gave me no problems to convert it to BAM. My guess then is that the sorting/filtering is doing something that corrupts the data, but I cannot test it without the actual code- if you want to include it I could give it a try. Alternatively, if you want to do everything in R (as I tend to do myself), you could take a look at the Bioconductor project. In particular the Rsamtools package can do sorting and filtering of SAM/BAM files and many other things. This is particularly convenient if you plan to do you downstream analyses in R.

ADD REPLYlink written 2.0 years ago by ddiez1.7k

Thank you! That helps me narrow it down :) Here is one Skript (I made several following this scheme):

#create character vectors path and sam_in first!

setwd(path)

toskip<-system(paste0("head -n 200 "
                      , sam_in
                      ," | cut -f 1 | grep @")
               , intern=TRUE)
n_toskip<-length(toskip)
header<- read.delim(sam_in
                    , header=FALSE
                    , nrows=n_toskip
                    , fill=TRUE   # is not enough to overcome varying col numbers
                    , col.names=c(1:5)    # or more than 5 depending on the col numbers
)      

sam_table<-read.delim(sam_in 
                      , header=FALSE
                      , skip=n_toskip
                      , sep="\t"
)

flags<-sam_table$V2
mapped.yes<-which(sapply(flags, function(x) as.numeric(intToBits(x)))[3,] !=1)
mappedsam<-sam_table[mapped.yes,]

write.table(header
            , file = paste0("UnmappedFiltered", sam_in)
            , sep = "\t"
            , row.names = FALSE
            , col.names = FALSE
            , append = TRUE
            , quote= FALSE
)
write.table(mappedsam
            , file = paste0("UnmappedFiltered", sam_in)
            , sep = "\t"
            , row.names = FALSE
            , col.names = FALSE
            , append = TRUE
            , quote= FALSE
)
ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by michaela_boell70

One problem here is that I do not think you need to call which(), as.numeric() or sapply(). Also don't know what this [3,] is doing here. Doesn't this just work the same for what you want?: mapped.yes <- intToBits(flags) != 1. In my example SAM file with your code I get empty vector.

ADD REPLYlink written 2.0 years ago by ddiez1.7k
1

I would highly recommend you use Rsamtools for this. Or just samtools. There are several posts at this site showing how to accomplish this. Maybe this is a good one, as it shows how to put unmapped reads into a file, which seems to be one of your goals.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by ddiez1.7k

I know that samtools can do this, but I wanted to supply a short script for you to test. I have other filtering steps that are longer. Do you want to see them, too? I don't think that would help because they all result in the same thing: a data.frame that I want to save in sam format.

ADD REPLYlink written 2.0 years ago by michaela_boell70

I already meant to thank you for just testing whether reading and writing a sam file like this should work, this is good to know. I don't know why the script does not work for your sam file. But thank you for trying it! If you want to know: sapply loops through the flags, and for every flag e.g. 99 it applies the function intToBits. This results in a vector of 0s and 1s. If there is a 1 at the third place (this is why [3,]) then the read was unmapped. So any positions in my sam file, that denote a read that was not mapped are excluded by !=1.

ADD REPLYlink written 2.0 years ago by michaela_boell70
1

I see, I was misunderstanding completely what intToBits() was doing. Regarding your comment above, I know you were providing the code for testing. What I wanted to say is that if you use samtools or the R variant Rsamtool you will have more chances to being successful as these are well stablished tools used by many people around the world for precisely what you want to do. But that is just a suggestion!

ADD REPLYlink written 2.0 years ago by ddiez1.7k

The file must have been opened with a bad text editor that wrote spaces into my file without being asked to do so or without mentioning that it did so. So I sabotaged myself.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by michaela_boell70
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1051 users visited in the last hour