Question: Filtering reads aligned to specific chromosome
gravatar for signauncreal
4.6 years ago by
signauncreal0 wrote:


First of all: Im quite new to this field, so please apologize any unexact formulations or missing informations.

i want to solve the following task: In order to save time, i want to do alignments not to the whole genome, but only to chromsome 21.

My strategy so far consisted of the following steps: Align NA12878 reads to whole genome. Filter .bam file for reads where at least one read-pair mapps to chr 21 (done using awk). Filter .bam file for unmapped reads (using samtools view -f 12). Merge .bam files and to bedtools bamtofastq conversion. The problem that comes here now (and is also described online) is that bamtofastq only converts those reads to fastq which have the read-pair next to it in the .bam file.

Note here: I specifically need all reads to be retained for technical reasons later.

I tried to solve the problems by taking the reads which lack their respective read pair from the original whole genome alignment file (using picard filtersamreads includeReadList) and adding them onto the file described above. This approach reduced the number of reads lacking its read pair, but only by about one third, meaning i still loose a lot of reads.

My questions are: Shouldnt my approach solve the problem? Why are there still reads which lack their mate? Also, does anybody have an idea what i can do? is there a bamtofastq transformation which does not require every read to be paired?

Im really grateful for any input, thanks.

snp sam bam bedtools bamtofastq • 1.9k views
ADD COMMENTlink modified 12 weeks ago by Biostar ♦♦ 20 • written 4.6 years ago by signauncreal0
gravatar for Amitm
4.6 years ago by
Amitm2.0k wrote:

hi, its probably due to chimeric alignments. If you do samtools flagstat on your BAM, you would see that the last two line sate that some number of reads/ mates have chimeric algn (specificially "mate mapped to a different chr"). This is a strategy I use -

samtools view -h -F 12 my.bam

(are you sure you meant -f 12. Small caps f is required flag, which means unmapped reads would be retained. Capital F is filtering flag. So you need F) After that step, I parse with a perl script which is basically -

        @split_samline = split("\t",$samline);
        if($split_samline[6] eq '=')
                print $samline,"\n";

Col. 7 of samfile is RNEXT meaning the chr of the mate. In case of algn. where the mate is NOT on a different chr, the RNEXT value is "="

After that, I am left with bam/sam that has ONLY those reads which are algn AND their mate algnd too, and both algn on the same chr.

ADD COMMENTlink written 4.6 years ago by Amitm2.0k

Thank you for your quick answer:

First of all, isn't it true that what you described can be solved simply by something like this (which is what i did):

samtools view input.bam | awk '{if ($3==YourChromosomeNumber && $7== "=") print $0} > outfile.sam Correct me if im wrong or i am missing your point.

Furthermore, the problem is that i do NOT want to filter reads out. I want my fastq files to have all the the reads, especially those which are chimeric.

ADD REPLYlink written 4.6 years ago by signauncreal0

If i have understood correctly, of reads aligned to whole genome, you want those that aligned to chr21 only. And after this you want to get those specific reads in fastq out. One quick check is that you are using bedtools bamtofastq. It requires query name sorted BAM and if you have a coordinate sorted BAM, there would be problem. Probably you are doing alright. Just query-name sort the BAM file first.

ADD REPLYlink written 4.6 years ago by Amitm2.0k

Im really sorry if im not expressing myself accurately. It is true that i want those file that align to chr21. Those i obtain with the the commands stated above. Also i want all unmapped reads and all split reads that map to chr21 (and somewhere else). The problem is not the extraction of those reads from the bam file, but the conversion of those reads into fastq. I know about the namesorting issue and i have done that.

Essentially what i need is a bamtofastq transformation which does not require a read to have its read-pair adjacent to it in the bam file (as is the case for every split read)

ADD REPLYlink written 4.6 years ago by signauncreal0

Picard SamtoFastq maybe? It doesn't require name sorted, but coordinate-sorted data. See that helps

ADD REPLYlink written 4.6 years ago by Amitm2.0k
gravatar for Chris Miller
4.6 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

Most tools don't handle this well, as you're finding out. What if you name-sort the whole bam, then just use a little perl script (or python, et al) to look through the SAM file and output:

a) your header and b) any pair of reads where at least one of them maps to chr21 (third column).

Then you're guaranteed to have both ends, and can move on to generating a fastq from your new bam in whatever way is most convenient.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Chris Miller21k

You are right, which is why i am trying to do what i described above. I do not want to map reads of the whole genome on chromosome 21.

ADD REPLYlink written 4.6 years ago by signauncreal0

Ah, I see. Apologies for misunderstanding. Editing my above response with some ideas

ADD REPLYlink written 4.6 years ago by Chris Miller21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1471 users visited in the last hour