Question: How to prepare a bam file containing reads of same length?
gravatar for mmitra
4.4 years ago by
Los Angeles, United States
mmitra30 wrote:

Hi all,

  I am new to genomic analysis and to this website, so sorry in advance if this question has been answered before.

  I have aligned a fastq file to the human genome using tophat. For the downstream analysis, I need to generate a bam file containing the reads of same length. How could I generate this file from the tophat output file (accepted_reads.bam)?

  Thanks so much!


bam file rna-seq tophat • 2.1k views
ADD COMMENTlink written 4.4 years ago by mmitra30
gravatar for Ashutosh Pandey
4.4 years ago by
Ashutosh Pandey12k wrote:

I would discourage trimming reads in the bam file (post alignment). This is because it wont be as straightforward as trimming the read and the corresponding quality string, but you will also have to update the CIGAR string. I would suggest you to trim your reads to a fixed length before the alignment and realign them. That would be much easier. That being said, GATK ClipReads (see --cyclesToTrim option) may help you but i would still advise you to trim reads and realign. See below for the link to GATK:

PS: I guess this is some alternative splicing detection software that requires bam files with fix length sequences. 

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Ashutosh Pandey12k

Thanks a lot Ashutosh for your thoughts on this and for providing information about GATK. You are right about the alternative splicing software (rMATS). I actually did start with a fastq file where all the reads were of same length. But when I looked at the CIGAR values for bam file generated by tophat, they were not the same. Do you think this should be fine?

ADD REPLYlink written 4.4 years ago by mmitra30

If you started with reads of same length then you should be fine. CIGAR strings may be different for reads with same length. For example, a 60 nt read may be represented by a cigar string of 1) "60M" in case no indel were introduced in the alignments or pure match or mostly matches and a few mismatches OR 2) "50M1I9M" in case there was an extra nucleotide present in the read wrt to the reference genome. So dont worry about it. You should be totally fine. I am moving all comments to answer, please accept them if you think you have got your answers. 

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Ashutosh Pandey12k

Sounds good! Thanks again for all the information.

ADD REPLYlink written 4.4 years ago by mmitra30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 903 users visited in the last hour