Question: get a consensus out of a Sorted bam file
1
gravatar for lait
2.9 years ago by
lait130
lait130 wrote:

I have an SRA file (paired end) with a total number of sequences equal to 4,177,734.

I need to map the reeds to the reference genome.

So what I did was the following:

1-converts SRA to fastq (one for each end)

2- I aligned the paired end reads separately using BWA,

bwa aln ref readset_R1.fq > readset_R1.sai

bwa aln ref readset_R2.fq > readset_R2.sai

3-then combined them

bwa sampe ref readset_R1.sai readset_R2.sai readset_clean_R1.fq readset_clean_R2.fq > readset_ref_bwa.sam

4- then I created the bam sorted file using the following commands

samtools view -S readset_ref_bwa.sam -b -o readset_ref_bwa.bam

samtools sort readset_ref_bwa.bam readset_ref_bwa.sorted.bam

samtools index readset_ref_bwa.sorted.bam

Question 1

if I view the sorted bam file using tview, I would get around 48 reads sorted from longest to shortest. Sice I am new to this domain, is it logical that I got only 48 reads out of the original number of reads that was in the SRA file (4 million +) ?

Question 2

I need to get a consensus out of the sorted bam file. I have read dozens of posts about this but couldnt figure it out.

Many thanks in advance

bwa sam bam ngs • 1.2k views
ADD COMMENTlink written 2.9 years ago by lait130

Q1 : count the number of mapped reads : samtools view -c -F 4 in.bam Q2 : but couldnt figure it out. : what's the problem ?

ADD REPLYlink written 2.9 years ago by Pierre Lindenbaum120k

thanks for Q1. Regarding Q2, how can this be done ? I wasnt lucky using mpileup. Do you know what is the best way to achieve this ?

ADD REPLYlink written 2.9 years ago by lait130

htsbox is marked as experimental by Heng Li (@lh3) but you could try it.

htsbox pileup -f ref.fa -Q20 -q30 -Fs3 sorted.bam

As long as you have the latest samtools and bcftools can you try this?

samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax68k

I tried the second option, is it normal to get something like this at the end of the output file ?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~z~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~x~uuuxx~~~~~~a~~~~~~~}~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~{xxxxuuuuror~~~~~~~~~~~~~~~~~~~~~~x~~~~~~~~~~~~~rK?

ADD REPLYlink written 2.9 years ago by lait130

When I try the first option, I get a file that contains just the letter X , like the following:

gi|556550243|ref|NC_022648.1| nnnXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX etc

can anyone figure out the problem please ?

ADD REPLYlink written 2.9 years ago by lait130

Obvious question but let me confirm.

Is everything matching in this case as far as the reference? Same fasta genome used for indexing, alignments and subsequent consensus calling? Also using the genbank headers as is may be a bad idea since those pipes in the fasta headers cause issues with various things.

Edit: Can you provide alignment stats (samtools flagstat your.bam)? You are referring to 48 sequences in your original post but not clear what that reference is (48 reads is all you may be seeing in one terminal screen at a time, there has got to be tons more read aligned).

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax68k

yes you are right, way more than 48 sequences

this is what I got:

4177734 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

3841885 + 0 mapped (91.96% : N/A)

4177734 + 0 paired in sequencing

2088867 + 0 read1

2088867 + 0 read2

3804094 + 0 properly paired (91.06% : N/A)

3820194 + 0 with itself and mate mapped

21691 + 0 singletons (0.52% : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

and yes, same fasta genome used.

ADD REPLYlink modified 2.9 years ago by genomax68k • written 2.9 years ago by lait130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1044 users visited in the last hour