Filtering Ngs Genomic Alignments
1
4
Entering edit mode
13.4 years ago
Travis ★ 2.8k

Following alignment of Paired-End DNA reads to the human genome, I am wondering what sort of filters should generally be applied to the alignments?

I know that duplicates should be removed as they are likely PCR artifacts (possible with SAMTools).

Can anyone outline other important criteria to filter on and perhaps suggest a filter threshold? I have heard mention of removing any reads with more than one alignment - I am not sure if this is overkill though.

If it helps, I plan to align 100bp PE Illumina reads to the human genome using Bfast and the 10 indexes recommended in the publication. My application will be targeted sequencing of approx 1000 genes followed by SNP/Indel analysis to look for association with a given phenotype.

Thanks in advance.

next-gen sequencing alignment filter paired • 4.8k views
ADD COMMENT
3
Entering edit mode

The choice of filtering depends on the nature of samples and hypotheses that you are testing. For example if the regions of interest are in repetitive region then removing reads with multiple alignments makes no sense etc. so you should frame your question in the term of biological question rather than a generic how do I filter reads

ADD REPLY
0
Entering edit mode

Question has been edited to include my application.

ADD REPLY
4
Entering edit mode
13.3 years ago
John St. John ★ 1.2k

You could always follow the GATK best practices for this kind of stuff. Or check out the supplementary material for a paper like the 1000 genomes project pilot.

One thing I would add that I am not sure is discussed in either though is to check for illumina adapter sequences and trim those from your data. I don't know how much of an issue variant callers have with adapter contamination, but I have seen it sneak into some published genome databases. You can find some of Illumina's adapter sequences posted online, but I haven't had luck finding the multiplexed adapter sequences online. If you write to them though they will send you a letter with all of the current sequences, and then it is up to you to determine which ones could be in your reads and remove them. There are some programs out there to do that, but I think those all work directly on the fastq reads rather than the alignment to the genome.

Here is the 1000 genomes pilot project paper: http://dx.doi.org/doi:10.1038/nature09534

And here are Broad's GATK best practices recommendations: http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2

ADD COMMENT

Login before adding your answer.

Traffic: 755 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6