Question: Sam Reads Not Aligned To Chromosome
1
gravatar for bic.uwaterloo.ca
6.9 years ago by
bic.uwaterloo.ca10 wrote:

Okay, I will be a little more specific.

With the .bai file you can randomly access any position in a BAM file. The goal is to access the region that contains reads unmapped.

samtools view -f 4 input.bam will produce reads unmapped, but if its mate is mapped to a specific chromosome, these reads are "aligned" to a chromosome, i.e. they have a rname and a position.

I want to try and access those reads that do not contain a rname. Yes they will not be aligned (i.e. -f 4) but the set of reads that I want wont include those that have a rname.

FURTHER EXPLAINED

samtools idxstats input.bam

> ...

chrUn_gl000241 ? ? ?

chrUn_gl000242 ? ? ?

chrUn_gl000243 ? ? ?

chrUn_gl000244 ? ? ?

chrUn_gl000245 ? ? ?

chrUn_gl000246 ? ? ?

chrUn_gl000247 ? ? ?

chrUn_gl000248 ? ? ?

chrUn_gl000249 ? ? ?

* ? ? ? <--------------------------------------I WANT THIS

I want to access the last region (i.e. the reads that do not contain a rname).

samtools view input.bam '*' doesn't work. These are always located at the end of a BAM file, so it seems inefficient to run throught a BAM file to just grab these reads (i.e. I want the random access if possible).

Let me know if this makes sense.

THANKS!!!

(Additional comments: In a sorted bam file all the paired reads with both read and its pair unmapped will be moved to the bottom of the bam file. The user asked if there is any fast way to access those reads directly without going through the whole bam file. The samtools command "samtools view -f 12 input.bam" takes too long.)

bam samtools sam chromosome • 4.2k views
ADD COMMENTlink modified 6.9 years ago by lomereiter460 • written 6.9 years ago by bic.uwaterloo.ca10
2
gravatar for Ashutosh Pandey
6.9 years ago by
Philadelphia
Ashutosh Pandey12k wrote:

samtools view -f 4 input.bam will output unmapped reads.

Generally an aligner doesn't send unmapped reads to the bam file. It produces another fastq files with unmapped reads in it. So the above command will only work with bam files that have unaligned reads in them.

ADD COMMENTlink written 6.9 years ago by Ashutosh Pandey12k
1

I don't think it's fair to say an aligner "generally" ignores unmapped reads. For instance, bwa doesn't do what you say.

ADD REPLYlink written 6.9 years ago by matted7.3k

You are right. i didnt mean that they ignore it. I meant that you can pretty much control it. For bowtie you can provide --un option and it will make a new set of fastq files containing unmapped reads. I was not clear if he has used BWA.

ADD REPLYlink written 6.9 years ago by Ashutosh Pandey12k

All reads present in the FASTQ file are present in the BAM file regardless of the alignment. I'm not 100% sure of the aligner used (think its novoalign) but I hope this information helps.

ADD REPLYlink written 6.9 years ago by bic.uwaterloo.ca10

I updated Question if this makes more sense

ADD REPLYlink written 6.9 years ago by bic.uwaterloo.ca10

Sorry I didnt get your new question. Let me ask you:

1) Do you want reads where both reads in a pair are unmapped? 2) Do you want read where the read itself is unmapped and its paired-read can be mapped or unmapped?

ADD REPLYlink written 6.9 years ago by Ashutosh Pandey12k

I want to target a region in a BAM file that isn't aligned to anything. So yes, it would be the case that both reads in a pair are unmapped. I was just curious if this is considered a separate region in a BAM file for random access. when you run

samtools idxstats input.bam

it shows a distinct region

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by bic.uwaterloo.ca10
1

OK i got where is the confusion. As you want reads where both read and its pair are unmapped you thought that in a sorted bam file these reads should be at the very bottom and I think you are correct. For the other case, where one read is mapped and its pair not mapped, these reads will occur together in a sorted bam file somewhere in between depending on the coordinates of the aligned read. So if you want to filter all reads where both read and its pair are unmapped, you should use -f12 option as mentioned by "matted" . You shouldnt worry about where they exist in bam. You can get them using this command.

ADD REPLYlink written 6.9 years ago by Ashutosh Pandey12k

Yes, and I understand this. Just thought maybe samtools had some representation where you could randomly access this region in a BAM file (because all of these reads are located at the end so it is kind of time consuming to pull them all via samtools view -f 12)

ADD REPLYlink written 6.9 years ago by bic.uwaterloo.ca10
1

I dont think so. You will have to go through the whole file it seems. If it is a big file it may take some time. Atleast I am not aware of any short cut that you can use.

ADD REPLYlink written 6.9 years ago by Ashutosh Pandey12k

Thanks for the help...this was my first post and I figured this might be a challenge to explain :)

ADD REPLYlink written 6.9 years ago by bic.uwaterloo.ca10

You are always welcome. You started with a nice question. :-)

ADD REPLYlink written 6.9 years ago by Ashutosh Pandey12k

At least this old htslib supports such queries. Old samtools does not.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by lh332k
1
gravatar for matted
6.9 years ago by
matted7.3k
Boston, United States
matted7.3k wrote:

I still am not sure what you're asking. Here are three possible answers to what your question might be asking:

A read can have an rname and be unmapped, for instance bwa will give positions and chromosomes to reads that span the end of one chromosome and the beginning of the next, but mark it as unaligned. I think you should only consider the unmapped flag, regardless of what's in the rname field.

If you're asking about finding reads whose mates are also unmapped, you can do that easily by also requiring the "mate unmapped" flag (0x08):

samtools view -f 12 in.bam

If you're asking about seeking directly to the unmapped reads in a sorted bam using a bam index, I don't have a good answer. I think samtools doesn't do any indexing magic based on flags, but I could be wrong. A bad answer is that it's not that long to wait (if it's the naive linear algorithm), or you could address it earlier by filtering and splitting the alignment output as it's being created (or sorted). Or for a challenge, you could modify the samtools code yourself to achieve this behavior.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by matted7.3k

Its more that I have a program that is very parrallelized. i.e. it is ran on individual chromosomes and then results are combined. I need to be able to run it on the "non chromosome" and include that in the combing step. So I can randomly access each chromosome, and run my program on each. This isn't representative of the entire BAM file because of the non-aligned reads.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by bic.uwaterloo.ca10

Is there a way to skip regions in a BAM file using samtools?

ADD REPLYlink written 6.9 years ago by bic.uwaterloo.ca10
1
gravatar for lomereiter
6.9 years ago by
lomereiter460
Russian Federation
lomereiter460 wrote:

Time can be saved by starting reading from the largest position presented in the index file, which usually points to some read near the end of the last chromosome. And if the index file contains also optional fields required for samtools idxstats, the exact position of the start of unmapped reads is known.

I have just added that functionality to my tool: https://github.com/lomereiter/sambamba/releases/tag/v0.4.2

ADD COMMENTlink written 6.9 years ago by lomereiter460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1862 users visited in the last hour