I have 30 viral samples sequenced using MiSeq. I received forward and reverse fastq files and also bam files which are aligned with the reference virus genome. I have been asked to build phylogenetic tree for 30 sequenced viral genome based on a gene. I believe, I need to select a gene sequence from all the samples and do a multiple sequence alignment and build a phylogenetic tree. But I am struck in fetching the gene sequence from bam files. How do I select one gene sequence for all the 30 viral sequence? Any suggestions? Can samtools do it?
Thanks genomax for your help. I have 30 separate bam files. For example, this is a reference gene of interest for me NC_026434 which is NA Neuraminidase. For this gene, the CDS starts at1 and ends at 1410.
I am looking for 30 fasta sequence file to do multiple sequence alignment. How do I get the fasta sequence mapped to the reference gene for each viral bam file?
Are those bam files aligned or unaligned? If they are aligned what are they aligned against as a reference (hopefully the same genome as the gene of your interest). If they are not aligned or aligned against a non-interesting genome then additional work is going to be needed.
I have aligned against a viral genome which has only 8 gene segments. Currently no genome sequence is available, so I took the fasta sequence of all 8 genes and merged them to a single fasta file. That single merged fasta file has been used for aligning with the sequencing reads(fastq) to generate the bam file (aligned against viral referral genes).
Assuming that you used a multi-fasta file (with 8 sequences) to do the alignments you can easily split the bam for each sample into 8 gene specific bam's How To Split A Bam File By Chromosome. Then use something (BAM/SAM to FASTA conversion) to convert the bam files to fasta.
You will most likely need to sub-sample the reads (I assume you have a ton of coverage) to do the MSA.
Other option is you could try to generate a consensus for each gene from the individual bam's and then use the consensus to do the MSA. It depends on what you finally want to do.
I am currently working with Influenza virus. I don't have chromosome numbers in my case. This is the format of the reference fasta file, I used for alignment.
Chromosome equates to a fasta identifier (it does not strictly have to be a real chromosome) for the splitting.
Though having all those extra characters (bars etc) in your headers may do some interesting thing. Tell us if the split works.
Thanks a lot for ur wonderful links and comments. It is really useful. I have two goals to achieve in my case.
1) Get the gene sequence for all 30 viral samples and build a phylogenetic tree for them and see how is the relationship between them
2) After fetching the gene sequence from sequencing reads, I am planning to perform codon usage analysis across all the 30 samples to investigate any codon bias in them.
Would it make sense to generate a consensus from the bam and then create the tree from the consensus for each sample?
Have you inspected the alignments and are there lots/small number of SNP or other changes?
After variant calling step, I found out 800 snps has been passed the filtering condition of GATK. Out of 800, 250 are non-synonymous and 550 are synonymous mutations.
Is that for one sample/gene? How many reads were there in total?
It is not for one sample or one gene. All samples have reads ranging from 57,000 to 430,000 reads.
I supplied the above mentioned fasta sequence of reference virus and all 30 bam files together to the GATK pipeline to generate the sample VCF file.
Please find the sample VCF file for instance I listed only some vcf records from the total list of 800 vcf records.
Get a consensus sequence for each gene for every sample (based on the alignment) and then go for MSA.
I am getting an error while using the following command to split bam file.
I was afraid that was going to happen because of the
|in your headers. Try to see if escaping the
Otherwise try one of the other options to split (bamtools).
Thanks for your suggestions and letting me know about bamtools.
I have two bam files. one is
sample01.sorted.picard.dedup.bam. Which bam file should I use to fetch the fasta sequence of the gene?
But for the below case, I used
I could see 13,000 reads matching to this gene region NC_026438 (1-2280). Am I correct?
What is the next step, how should I find the starting of the gene and ending of the gene from these reads?
What number do you get with the de-duped file?
Instead of getting the individual reads (which are 13000+ here and would be hard to handle with an MSA program) how about getting the consensus with samtools (look at the examples here: http://www.htslib.org/doc/samtools.html )