Sam Reads Not Aligned To Chromosome
3
1
Entering edit mode
8.2 years ago

Okay, I will be a little more specific.

With the .bai file you can randomly access any position in a BAM file. The goal is to access the region that contains reads unmapped.

samtools view -f 4 input.bam will produce reads unmapped, but if its mate is mapped to a specific chromosome, these reads are "aligned" to a chromosome, i.e. they have a rname and a position.

I want to try and access those reads that do not contain a rname. Yes they will not be aligned (i.e. -f 4) but the set of reads that I want wont include those that have a rname.

FURTHER EXPLAINED

samtools idxstats input.bam

> ...

chrUn_gl000241 ? ? ?

chrUn_gl000242 ? ? ?

chrUn_gl000243 ? ? ?

chrUn_gl000244 ? ? ?

chrUn_gl000245 ? ? ?

chrUn_gl000246 ? ? ?

chrUn_gl000247 ? ? ?

chrUn_gl000248 ? ? ?

chrUn_gl000249 ? ? ?

* ? ? ? <--------------------------------------I WANT THIS

I want to access the last region (i.e. the reads that do not contain a rname).

samtools view input.bam '*' doesn't work. These are always located at the end of a BAM file, so it seems inefficient to run throught a BAM file to just grab these reads (i.e. I want the random access if possible).

Let me know if this makes sense.

THANKS!!!

(Additional comments: In a sorted bam file all the paired reads with both read and its pair unmapped will be moved to the bottom of the bam file. The user asked if there is any fast way to access those reads directly without going through the whole bam file. The samtools command "samtools view -f 12 input.bam" takes too long.)

bam sam chromosome samtools • 4.8k views
2
Entering edit mode
8.2 years ago

samtools view -f 4 input.bam will output unmapped reads.

Generally an aligner doesn't send unmapped reads to the bam file. It produces another fastq files with unmapped reads in it. So the above command will only work with bam files that have unaligned reads in them.

1
Entering edit mode

I don't think it's fair to say an aligner "generally" ignores unmapped reads. For instance, bwa doesn't do what you say.

0
Entering edit mode

You are right. i didnt mean that they ignore it. I meant that you can pretty much control it. For bowtie you can provide --un option and it will make a new set of fastq files containing unmapped reads. I was not clear if he has used BWA.

0
Entering edit mode

All reads present in the FASTQ file are present in the BAM file regardless of the alignment. I'm not 100% sure of the aligner used (think its novoalign) but I hope this information helps.

0
Entering edit mode

I updated Question if this makes more sense

0
Entering edit mode

1) Do you want reads where both reads in a pair are unmapped? 2) Do you want read where the read itself is unmapped and its paired-read can be mapped or unmapped?

0
Entering edit mode

I want to target a region in a BAM file that isn't aligned to anything. So yes, it would be the case that both reads in a pair are unmapped. I was just curious if this is considered a separate region in a BAM file for random access. when you run

samtools idxstats input.bam

it shows a distinct region

1
Entering edit mode

OK i got where is the confusion. As you want reads where both read and its pair are unmapped you thought that in a sorted bam file these reads should be at the very bottom and I think you are correct. For the other case, where one read is mapped and its pair not mapped, these reads will occur together in a sorted bam file somewhere in between depending on the coordinates of the aligned read. So if you want to filter all reads where both read and its pair are unmapped, you should use -f12 option as mentioned by "matted" . You shouldnt worry about where they exist in bam. You can get them using this command.

0
Entering edit mode

Yes, and I understand this. Just thought maybe samtools had some representation where you could randomly access this region in a BAM file (because all of these reads are located at the end so it is kind of time consuming to pull them all via samtools view -f 12)

1
Entering edit mode

I dont think so. You will have to go through the whole file it seems. If it is a big file it may take some time. Atleast I am not aware of any short cut that you can use.

0
Entering edit mode

Thanks for the help...this was my first post and I figured this might be a challenge to explain :)

0
Entering edit mode

You are always welcome. You started with a nice question. :-)

0
Entering edit mode

At least this old htslib supports such queries. Old samtools does not.

1
Entering edit mode
8.2 years ago
matted 7.5k

A read can have an rname and be unmapped, for instance bwa will give positions and chromosomes to reads that span the end of one chromosome and the beginning of the next, but mark it as unaligned. I think you should only consider the unmapped flag, regardless of what's in the rname field.

If you're asking about finding reads whose mates are also unmapped, you can do that easily by also requiring the "mate unmapped" flag (0x08):

samtools view -f 12 in.bam


If you're asking about seeking directly to the unmapped reads in a sorted bam using a bam index, I don't have a good answer. I think samtools doesn't do any indexing magic based on flags, but I could be wrong. A bad answer is that it's not that long to wait (if it's the naive linear algorithm), or you could address it earlier by filtering and splitting the alignment output as it's being created (or sorted). Or for a challenge, you could modify the samtools code yourself to achieve this behavior.

0
Entering edit mode

Its more that I have a program that is very parrallelized. i.e. it is ran on individual chromosomes and then results are combined. I need to be able to run it on the "non chromosome" and include that in the combing step. So I can randomly access each chromosome, and run my program on each. This isn't representative of the entire BAM file because of the non-aligned reads.

0
Entering edit mode

Is there a way to skip regions in a BAM file using samtools?

1
Entering edit mode
8.2 years ago
lomereiter ▴ 470

Time can be saved by starting reading from the largest position presented in the index file, which usually points to some read near the end of the last chromosome. And if the index file contains also optional fields required for samtools idxstats, the exact position of the start of unmapped reads is known.

I have just added that functionality to my tool: https://github.com/lomereiter/sambamba/releases/tag/v0.4.2