Question: How to generate FASTA sequence of genes and phylogenetic analysis from NGS sequenced reads?
0
gravatar for bioinforesearchquestions
18 days ago by
United States
bioinforesearchquestions120 wrote:

Hi,

One of our collaborators lab recently isolated virus samples from patients. I received the sequenced reads from MiSeq for those virus samples. Then I investigated the SNVs present in the samples. Now I am interested in creating a phylogenetic tree for Neuraminidase gene with the other NA sequence in the GenBank database. Therefore, I would like to know

1) how to generate FASTA sequence of the NA gene from the MiSeq sequenced reads?

2) What are the factors to be considered while generating a phylogenetic tree for viruses? because viruses mutates at a much faster rate compared to prokaryotes.

3) Any tools recommended for performing the phylogenetic tree.

Thanks in advance.

ADD COMMENTlink modified 18 days ago • written 18 days ago by bioinforesearchquestions120
1

I investigated the SNVs present in the samples

Do you have the alignment files for these samples already? See this past thread for some inspiration and thoughts: How to extract fasta sequence of a gene from BAM files generated using Miseq sequencing

Edit: It was you who I was talking with 2 years ago :-)

ADD REPLYlink modified 18 days ago • written 18 days ago by genomax40k

Yeah :)! I forgot about it.

I am planning to generate the phylogenetic tree from the NGS reads. Would you recommend any interesting articles and tools to explore?

ADD REPLYlink written 18 days ago by bioinforesearchquestions120

Issues mentioned in that last thread still remain valid. If you create a consensus sequence then you may lose interesting variation but making a phylogenetic tree from individual sequences does not make a lot of sense.

You could try to assemble the viral genomes using tadpole.sh from BBMap suite to see if are able to use the assemblies to fish out the NA gene and then do trees with those sequences.

ADD REPLYlink written 18 days ago by genomax40k

Hi Genomax,

I tried the following steps to get the FASTA sequence from NGS reads.

1) bin/bamtools filter -region gi\|758899355\|ref\|NC_026438.1\|:1..1700 -in Sample2.sorted.bam -out Sample2.NC_026438.sorted.bam

2) bin/bamtools convert -format fasta -in Sample2.NC_026438.sorted.bam -out Sample2.NC_026438.sorted.fasta

3) In the fasta file, I found 35,072 reads matching the above region.

4) bbmap/bbmap_37.02/tadpole.sh in=Sample2.NC_026438.sorted.fasta out=Sample2.NC_026438.sorted.contig

5) I got 5 contigs as the output,

contig_1,length=178,cov=8182.4,gc=0.416

contig_2,length=454,cov=4072.5,gc=0.374

contig_3,length=125,cov=191.4,gc=0.368

contig_4,length=443,cov=161.0,gc=0.433

contig_5,length=264,cov=4.8,gc=0.436

  • Should I merge all these contigs as one scaffold?
ADD REPLYlink modified 17 days ago • written 17 days ago by bioinforesearchquestions120

If you have the fasta sequences you could try doing an MSA but be aware that the reads could have bad regions (unless you had scanned/trimmed you reads before alignments).

Those contigs from tadpole seem rather small. There should be no need to merge the contigs since all you need to figure out is which of those contigs contain the NA gene.

Is this targeted sequencing by any chance?

ADD REPLYlink modified 17 days ago • written 17 days ago by genomax40k

Yes, I have trimmed them. But I have 35K FASTA sequence for 35,072 reads (mapping to gi|758899355|ref|NC_026438.1|:1..1700). In the past, for bacteria, I used gene sequence from multiple strains and generated phylogenetic tree. In this case, I have 35K FASTA sequence. So I am confused.

NC_026438.1 this gene sequence length is 1700, but contigs are smaller in length. So do I have only partial gene sequence in those contigs?

Yes, this is targeted sequencing.

ADD REPLYlink modified 17 days ago • written 17 days ago by bioinforesearchquestions120

Yes, this is targeted sequencing.

This was important information. So those 35K reads represent just one strain you are working with? You could make MSA from those reads to use for the phylogenetic analysis. But then you could have made that consensus from your BAM file by using samtools/bcftools directly (method in the old thread).

It is possible that you may need to tweak tadpole.sh analysis (looks like you used defaults) to get the assembly to complete. How many reads were left over after the assembly? But before you do anything else you should align the five contigs to NC_026438 to see how and where they align and if the assembly looks reasonable. Was there any indication in your BAM that the gene was not completely covered (or if that was by design then there is nothing you can do but use what you have).

ADD REPLYlink modified 17 days ago • written 17 days ago by genomax40k

Hi Genomax,

As suggested, I used samtools

samtools mpileup -uf reference.fa Sample2.NC_026433.sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

I have pasted the contents from the cns.fq. Some bases are in upper case and some in lower case and there are n's.

What do they mean?

@gi|758899355|ref|NC_026433.1|
ATGAAGGCAATACTAGTAGTTCTGCTATATACATTTGCAACCGCAAATGCAGACACATTA
TGTATAGGTTATCATGCGAACAATTCAACAGACACTGTAGACACAGTACTAGAAAAGAAT
GTAACAGTAACACACTCTGTTAACCTTCTAGAAGACAAGCATAACGGGAAACTATGCAAA
CTAAGAGGGGTAGCCCCATTGCATTTGGGTAAATGCAACATTGCTGGCTGGATCCTGGGA
AATCCAGAGTGTGAATCACTCTCCACAGCAAGTTCATGGTCCTACATTGTGGAAACATCT
AGTTCAGACAATGGAACGTGTTACCCAGGAGATTTCATCAATTATGAGGAGCTAAGAGAG
CAATTGAGCTCAGTGTCATCATTTGAAAGGTTTGAGATATTCCCCAAGACAAGTTCATGG
CCCAATCATGACTCGAACAAAGGTGTAACGGCAGCATGTCCTCACGCTGGAGCAAAAAGC
TTCTACAAAAATTTAATATGGCTAGTTAAAAAAGGAAATTCATACCCAAGGctcagtcaa
tcCTACATTAATGATAAAGGGAAAGAAGTCCTCGTGCTGTGGGGCATTCACCATCcatcn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnagaagttcaagccggaaatagcaataagacccaaagtgagggatcaa
gaagggagaatgAACTATTACTGGACACTAGTAGAGCCGGGAGACAAAATAACATTCGAA
GCAACTGGAAATCTAGTGGTACCGRGATATGCATTCACAATGGAAAGAAATGCTGGATCT
GGTATTATCATTTCAGATACACCAGTCCACGATTGCAATACAACTTGTCAGACACCCGAG
GGTGCTATAAACACCAGCCTCCCATTTCAGaatatacatccgatcacaaTTGGAAAATGT
CCAAAGTATGTAAAAAGCACAAAATTGAGACTGGCCACAGGATTGAGGAATGTCCCGTCT
ATTCAATCTAGAGGCCTATTCGGGGCCATTGCCGGCTTCATTGAAGGGGGGTGGACAGGG
ATGGTAGATGGATGGTACGGTTATCACCATCAAAATGAGCAGGGGTCAGGATATGCAGCC
GACCTGAAGAGCACACAAAATGCCATTGACAAGATTACTAACAAAGTAAATTCTGTTATT
GAAAAGATGAATACACAGTTCACAGCAGTGGGTAAGGAGTTCAACCACCTGGAAAAAAGA
ATAGAGAATTTAAATAAAAAAGTTGATGATGGTTTCCTGGACATTTGGACTTACAATGCC
GAACTGTTGGTTCTATTGGAAAATGAAAGAACTTTGGACTACCACGATTCAAATGTGAAG
AACTTGTATGAAAAGGTAAGAAACCAGCTAAAAAACAATGCCAAGGAAATTGGAAACGGC
TGCTTTGAATTTTACCACAAATGCGATAACACGTGCATGGAAAGTGTCAAAAATGGGACT
TATGACTACCCAAAATACTCAGAGGAAGCAAAATTAAACAGAGAAAAAATAGATGGGGTA
AAGCTGGAATCAACAAGGATTTACCAGATTTTGGCGATCTATTCAACTGTCGCCAGTTCA
TTGGTACTGGTAGTCTCCCTGGGGGCAATCAGCTTCTGGATGTGCTCTAATGGGTCTCTA
CAGTGTAGAATATGTATTTAA
ADD REPLYlink modified 10 days ago • written 10 days ago by bioinforesearchquestions120

That is not a fastq format file. What do you get with just the first part of mpileup command?

ADD REPLYlink written 10 days ago by genomax40k

Sorry, I pasted just the sequence part alone in the previous thread. This is what I got in the cns.fq output file.

@gi|758899355|ref|NC_026433.1|

SEQUENCE IS BIGGER SO I DINT PASTE IT AGAIN

+

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Z????!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!???????????????????????????? ???????????????????????????????EEEEEEEEEEEEEEEEEEEEEEEEEEH HHHHHHHHHHHHKKKKKKKKKKNNNNNNNNNNNNNNNNNNNNNNN@NNN NNNNNNNNNNNNNNNNKKKKKKKKKKKKKKKKKKKKKNNNNNNNNNNNNN NNNNNNNNNNNNNNNQNNNNNNNHHHHHHHHHHHHHHHHEHHBHHHHHH HHEEEEEEEEEEEEEEEEEEEEEBBBBBBBBBBBBBBBBBBBEEEEEEEEEKK KKKK&KKKKKKKKKKKKKKKKKKKKKKNNNNNNNTQQQQTTZZZ]]]]]]]]ccccccc cccc``ccccccccfflllllllllllllffffffffffffffffffffZZZZ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~

I took the sequence from this step and BLASTed it with the reference gene sequence (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&RID=5F7TV6FT114).

Blast_output
free image host no sign up

ADD REPLYlink modified 10 days ago • written 10 days ago by bioinforesearchquestions120

Looks like you have a gap in your sequence of ~74 bp which is represented by those n.

ADD REPLYlink written 9 days ago by genomax40k

Hi,

I randomly checked it on different samples. In those samples, I am getting N's at the middle of the sequence.

Does it mean, we couldn't sequence that region or it is a deletion?

ADD REPLYlink written 9 days ago by bioinforesearchquestions120

If there are absolutely no reads covering that sequence then you could have a deletion. Was the experiment supposed to recover the entire region? Won't it mean that you have a defective NA gene in all these samples?

Keep in mind that samtools by default uses a depth of 250 for mpileup. If you have a ton of reads in this region then you will have to increase that using -d option.

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax40k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 617 users visited in the last hour