Question: Number of reference sequence contigs in sam/bam header does not match the number in my fasta
gravatar for marcusanaymik
2.3 years ago by
United States
marcusanaymik10 wrote:

I created a de novo assembly fasta (masked) which I then used to align the fastq sequences to. I noticed that when building the bowtie 2 reference index for the assembly there were many sequences that were all N's (662,820 to be exact) which I expected from the masking. After alignment I did a grep for the @SQ tag in the sam header which should tell me how many contigs are in the reference; there were 2,118,137 listed in the sam header. Finally, I did a grep for the total number of sequences in the reference fasta file but there were only 2,449,547.

Any reason why these numbers wouldn't add up? Where does the number of contigs in the sam header actually come from?

bam sam contig bowtie2 fasta • 941 views
ADD COMMENTlink modified 2.3 years ago by Brian Bushnell16k • written 2.3 years ago by marcusanaymik10
gravatar for Brian Bushnell
2.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

The number of reference sequences comes from the aligner. Possibly, the aligner is ignoring sequences under a certain length, or all sequences that are entirely N. I suggest you skip masking/filtering, or record how many sequences you filtered, and try again.

ADD COMMENTlink written 2.3 years ago by Brian Bushnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 827 users visited in the last hour