How to prepare a bam file containing reads of same length?
1
0
Entering edit mode
6.2 years ago
mmitra ▴ 30

Hi all,

  I am new to genomic analysis and to this website, so sorry in advance if this question has been answered before.

  I have aligned a fastq file to the human genome using tophat. For the downstream analysis, I need to generate a bam file containing the reads of same length. How could I generate this file from the tophat output file (accepted_reads.bam)?

  Thanks so much!

 

RNA-Seq tophat Bam file • 2.7k views
ADD COMMENT
2
Entering edit mode
6.2 years ago

I would discourage trimming reads in the bam file (post alignment). This is because it wont be as straightforward as trimming the read and the corresponding quality string, but you will also have to update the CIGAR string. I would suggest you to trim your reads to a fixed length before the alignment and realign them. That would be much easier. That being said, GATK ClipReads (see --cyclesToTrim option) may help you but i would still advise you to trim reads and realign. See below for the link to GATK: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_readutils_ClipReads.php

PS: I guess this is some alternative splicing detection software that requires bam files with fix length sequences. 

ADD COMMENT
0
Entering edit mode

Thanks a lot Ashutosh for your thoughts on this and for providing information about GATK. You are right about the alternative splicing software (rMATS). I actually did start with a fastq file where all the reads were of same length. But when I looked at the CIGAR values for bam file generated by tophat, they were not the same. Do you think this should be fine?

ADD REPLY
0
Entering edit mode

If you started with reads of same length then you should be fine. CIGAR strings may be different for reads with same length. For example, a 60 nt read may be represented by a cigar string of 1) "60M" in case no indel were introduced in the alignments or pure match or mostly matches and a few mismatches OR 2) "50M1I9M" in case there was an extra nucleotide present in the read wrt to the reference genome. So dont worry about it. You should be totally fine. I am moving all comments to answer, please accept them if you think you have got your answers. 

ADD REPLY
0
Entering edit mode

Sounds good! Thanks again for all the information.

ADD REPLY

Login before adding your answer.

Traffic: 2128 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6