Ngs: Filtering Out Reads That Have A Hit...
4
2
Entering edit mode
11.9 years ago
Nicojo ★ 1.1k

I am going to be working with a lot of Illumina next generation sequencing data, but I will be doing something I suspect is rather unusual: Within the data, there will be a vast majority of sequences that will map fairly well to my reference genome. However, those are not the sequences I am interested in: I want to retrieve all the reads that do not map well to my reference genome.

Update: To clarify, within my dataset I have a vast majority of the reads that will match to my reference genome. However, a small portion of the dataset does (should) not match at all (i.e. I also exclude SNPs, short indels, etc.). I want to reduce my dataset to those reads that do not correspond to the conserved parts of the genome in order to do de novo assembly on them, without the overhead of assembling the whole genome from scratch. End of Update.

  • Is this something commonly done?
  • Are the usual tools capable of doing that for me or will I need to implement something myself?
  • Any suggestions on the best method(s) to proceed?
next-gen • 2.7k views
ADD COMMENT
1
Entering edit mode

Just to be sure, when you bitwise AND with "4", you'll get those reads that are unmapped. If you are interested in reads that do not map WELL to your reference, they could also mean reads that are NOT uniquely mapped. Usually, softwares indicate this with an optional flag. In case of BWA, it would be XT:A: and it would be XT:A:U in case of uniquely mapped and you would be interested in all other values for that. In case of tophat/bowtie, it would be NH:i: optional flag, where NH:i:1 denotes a read with exactly 1 hit. So you would be interested in all other numbers for NH:i:.

ADD REPLY
1
Entering edit mode

Thanks for the complement of information. I'll clarify in the question what I'm looking for!

ADD REPLY
5
Entering edit mode
11.9 years ago

This can be done fairly easily with classical NGS alignment tools like bwa. In bwa, once you have run the whole thing, just look for the reads that were flagged "4" in the .sam file output (i.e. no match).

ADD COMMENT
0
Entering edit mode

Thanks for your quick answer ;)

ADD REPLY
0
Entering edit mode

No problem, glad I could help.

ADD REPLY
1
Entering edit mode
11.9 years ago

Normally for most of the aligners there is some option where you can output the unaligned reads into a different file. Otherwise look for the reads that were flagged "4" as Leonor suggested. In case of paired end reads "5", "13" and some other flags may apply too. (See http://picard.sourceforge.net/explain-flags.html) to select the read pairs or read pair that is unaligned.

ADD COMMENT
1
Entering edit mode

It's Leonor, please.

ADD REPLY
0
Entering edit mode

Edited. Please accept my sincere apology.

ADD REPLY
0
Entering edit mode

Thanks: I'll look for that option to output the unaligned reads!

ADD REPLY
0
Entering edit mode

tophat does it. I am not sure if you have to use the --keep-tmp option though. bwa provides both aligned and unaligned reads in the same file.

ADD REPLY
0
Entering edit mode

And you can always look for 4 and bitwise AND with 4, even if the read is paired.

ADD REPLY
0
Entering edit mode
11.9 years ago

With bwa, you can parse by the NM tag. NM:A:0 means that the read mapped perfectly. NM:A:3, for instnace, would mean three mismatches.

ADD COMMENT

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6