Question: BWA gives wrong number of input sequences
0
gravatar for agata88
2.2 years ago by
agata88800
Poland
agata88800 wrote:

Hi all!

I was trying to map sequences to genome contigs. To do that I've performed commands:

bwa index ref.fasta
bwa mem ref.fasta input.fasta > aln.sam
samtools view -Sb aln.sam > aln.bam
samtools sort aln.bam > aln_sorted.bam
samtools flagstat aln_sorted.bam > aln_stats.txt

input. fasta include 3158 sequences.

Here is aln_stats.txt output:

 5284 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    2126 + 0 supplementary
    0 + 0 duplicates
    4890 + 0 mapped (92.54% : N/A)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (N/A : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (N/A : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

Why output includs 5284 sequences in total?

I see that 2126 is supplementary, 5284 - 2126 = 3158, which is the number of sequences in input.

Thanks in advance, Best, Agata

bwa • 1.3k views
ADD COMMENTlink modified 2.2 years ago by swbarnes28.9k • written 2.2 years ago by agata88800
2

bwa mem ref.fasta input.fasta > aln.sam
samtools view -Sb aln.sam > aln.bam
samtools sort aln.bam > aln_sorted.bam

Note that you could simplify this, speed things up and avoid intermediate files:

bwa mem ref.fasta input.fasta | samtools sort -o aln_sorted.bam -

(requiring that your samtools is not too old, but you should use up-to-date software anyway)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by WouterDeCoster44k
1

Hi Agata, these are most likely sequences mapping to more than one location. The BAM doesn't tell you the number of input sequences but the number of alignments against the reference

ADD REPLYlink written 2.2 years ago by Carambakaracho2.2k

Hey, thanks! Do you know how can I find which sequences mapped to different contigs? Best, Agata

ADD REPLYlink written 2.2 years ago by agata88800

Search Biostars for "identify multi-mapped reads". Use an external google search.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by genomax91k
0
gravatar for swbarnes2
2.2 years ago by
swbarnes28.9k
United States
swbarnes28.9k wrote:

samtools flagstat is not a very smart program. It's merely counting lines in your bam. If a read is in there twice, because it aligned to multiple places and all are included, samtools flagstat will count it twice. If you omit reads with the 256 binary flag, that will get rid of the secondary alignments.

ADD COMMENTlink written 2.2 years ago by swbarnes28.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2030 users visited in the last hour