Question: How to generate FASTA sequence of genes and phylogenetic analysis from NGS sequenced reads?
gravatar for bioinforesearchquestions
2.3 years ago by
United States
bioinforesearchquestions280 wrote:


One of our collaborators lab recently isolated virus samples from patients. I received the sequenced reads from MiSeq for those virus samples. Then I investigated the SNVs present in the samples. Now I am interested in creating a phylogenetic tree for Neuraminidase gene with the other NA sequence in the GenBank database. Therefore, I would like to know

1) how to generate FASTA sequence of the NA gene from the MiSeq sequenced reads?

2) What are the factors to be considered while generating a phylogenetic tree for viruses? because viruses mutates at a much faster rate compared to prokaryotes.

3) Any tools recommended for performing the phylogenetic tree.

Thanks in advance.

ADD COMMENTlink modified 2.2 years ago by Biostar ♦♦ 20 • written 2.3 years ago by bioinforesearchquestions280

I investigated the SNVs present in the samples

Do you have the alignment files for these samples already? See this past thread for some inspiration and thoughts: How to extract fasta sequence of a gene from BAM files generated using Miseq sequencing

Edit: It was you who I was talking with 2 years ago :-)

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax80k

Yeah :)! I forgot about it.

I am planning to generate the phylogenetic tree from the NGS reads. Would you recommend any interesting articles and tools to explore?

ADD REPLYlink written 2.3 years ago by bioinforesearchquestions280

Issues mentioned in that last thread still remain valid. If you create a consensus sequence then you may lose interesting variation but making a phylogenetic tree from individual sequences does not make a lot of sense.

You could try to assemble the viral genomes using from BBMap suite to see if are able to use the assemblies to fish out the NA gene and then do trees with those sequences.

ADD REPLYlink written 2.3 years ago by genomax80k

Hi Genomax,

I tried the following steps to get the FASTA sequence from NGS reads.

1) bin/bamtools filter -region gi\|758899355\|ref\|NC_026438.1\|:1..1700 -in Sample2.sorted.bam -out Sample2.NC_026438.sorted.bam

2) bin/bamtools convert -format fasta -in Sample2.NC_026438.sorted.bam -out Sample2.NC_026438.sorted.fasta

3) In the fasta file, I found 35,072 reads matching the above region.

4) bbmap/bbmap_37.02/ in=Sample2.NC_026438.sorted.fasta out=Sample2.NC_026438.sorted.contig

5) I got 5 contigs as the output,






  • Should I merge all these contigs as one scaffold?
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by bioinforesearchquestions280

If you have the fasta sequences you could try doing an MSA but be aware that the reads could have bad regions (unless you had scanned/trimmed you reads before alignments).

Those contigs from tadpole seem rather small. There should be no need to merge the contigs since all you need to figure out is which of those contigs contain the NA gene.

Is this targeted sequencing by any chance?

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax80k

Yes, I have trimmed them. But I have 35K FASTA sequence for 35,072 reads (mapping to gi|758899355|ref|NC_026438.1|:1..1700). In the past, for bacteria, I used gene sequence from multiple strains and generated phylogenetic tree. In this case, I have 35K FASTA sequence. So I am confused.

NC_026438.1 this gene sequence length is 1700, but contigs are smaller in length. So do I have only partial gene sequence in those contigs?

Yes, this is targeted sequencing.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by bioinforesearchquestions280

Yes, this is targeted sequencing.

This was important information. So those 35K reads represent just one strain you are working with? You could make MSA from those reads to use for the phylogenetic analysis. But then you could have made that consensus from your BAM file by using samtools/bcftools directly (method in the old thread).

It is possible that you may need to tweak analysis (looks like you used defaults) to get the assembly to complete. How many reads were left over after the assembly? But before you do anything else you should align the five contigs to NC_026438 to see how and where they align and if the assembly looks reasonable. Was there any indication in your BAM that the gene was not completely covered (or if that was by design then there is nothing you can do but use what you have).

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax80k

Hi Genomax,

As suggested, I used samtools

samtools mpileup -uf reference.fa Sample2.NC_026433.sorted.bam | bcftools view -cg - | vcf2fq > cns.fq

I have pasted the contents from the cns.fq. Some bases are in upper case and some in lower case and there are n's.

What do they mean?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinforesearchquestions280

That is not a fastq format file. What do you get with just the first part of mpileup command?

ADD REPLYlink written 2.2 years ago by genomax80k

Sorry, I pasted just the sequence part alone in the previous thread. This is what I got in the cns.fq output file.




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Z????!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!???????????????????????????? ???????????????????????????????EEEEEEEEEEEEEEEEEEEEEEEEEEH HHHHHHHHHHHHKKKKKKKKKKNNNNNNNNNNNNNNNNNNNNNNN@NNN NNNNNNNNNNNNNNNNKKKKKKKKKKKKKKKKKKKKKNNNNNNNNNNNNN NNNNNNNNNNNNNNNQNNNNNNNHHHHHHHHHHHHHHHHEHHBHHHHHH HHEEEEEEEEEEEEEEEEEEEEEBBBBBBBBBBBBBBBBBBBEEEEEEEEEKK KKKK&KKKKKKKKKKKKKKKKKKKKKKNNNNNNNTQQQQTTZZZ]]]]]]]]ccccccc cccc``ccccccccfflllllllllllllffffffffffffffffffffZZZZ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~

I took the sequence from this step and BLASTed it with the reference gene sequence (

free image host no sign up

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinforesearchquestions280

Looks like you have a gap in your sequence of ~74 bp which is represented by those n.

ADD REPLYlink written 2.2 years ago by genomax80k


I randomly checked it on different samples. In those samples, I am getting N's at the middle of the sequence.

Does it mean, we couldn't sequence that region or it is a deletion?

ADD REPLYlink written 2.2 years ago by bioinforesearchquestions280

If there are absolutely no reads covering that sequence then you could have a deletion. Was the experiment supposed to recover the entire region? Won't it mean that you have a defective NA gene in all these samples?

Keep in mind that samtools by default uses a depth of 250 for mpileup. If you have a ton of reads in this region then you will have to increase that using -d option.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by genomax80k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1345 users visited in the last hour