Question: From bam paired files to fasta
gravatar for chevivien
3.5 years ago by
chevivien70 wrote:

Hii people,

I have paired-end sequence read of my target species which I have mapped to the reference genome which outputed bam files. From this bam files I wish to extract protein coding genes using homology based approach or de novo.

Question: Is it in order if I convert the bam files(obtained by mapping to reference) to fastq files, merge the paired reads, then convert the fastq files to fasta format? The idea is to get fasta files from the reads for feature prediction purposes. If that is the way to do it how to one get continues fasta sequences , since on following the mentioned step I get something like this:


I expect to get something like this;


I would appreciate your input on how to go about it.

ADD COMMENTlink modified 3.5 years ago by Istvan Albert ♦♦ 85k • written 3.5 years ago by chevivien70
gravatar for Istvan Albert
3.5 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

Manually merging contigs from mapped reads is a complicated task that is conceptually closest to what is called "consensus calling".

Better alternatives might be to

  1. perform a denovo assembly of the reads that will give you long contigs that can be annotated alter
  2. call SNPs (variation) from the mapped reads then use the resulting VCF file and the original reference sequence to re-construct a consensus sequence (contigs) then annotate those

For the first step many assemblers may be used depending on the problem type. The latter step could be done with the GATK tool called FastaAlternateReferenceMaker:

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Istvan Albert ♦♦ 85k

Hii Albert,

Thank you for the suggestion. So basically with whole genome data assembled by reference mapping , pulling out protein coding genes or mRNA is complicated assignment? My main goal is to get extract protein coding genes from this data for downstream work before embarking on denovo aspect. Let me think through the alternatives you have provided.

ADD REPLYlink written 3.5 years ago by chevivien70

So basically with whole genome data assembled by reference mapping, pulling out protein coding genes or mRNA is complicated assignment?

Yes, that's extremely complicated :) There's no perfect solution for it. But it seems likely that Istvan's recommendation of FastaAlternateReferenceMaker is the best approach. However, if you are dealing with a microbe or something not closely related to the reference, it might be better to just assemble it and annotate the assembly. The decision is affected by whether it's a eukaryote or prokaryote, and the ploidy. For a polyploid eukaryote, I don't even know of a good way to address the problem without thousands of hours of manual work. But it depends on how important the quality is... if you don't care about phasing and so forth, it's much easier. And if you don't have really long reads (PacBio, Nanopore, 10x), long range phasing is not possible anyway.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1220 users visited in the last hour