From bam paired files to fasta
1
0
Entering edit mode
6.9 years ago
chevivien ▴ 90

Hii people,

I have paired-end sequence read of my target species which I have mapped to the reference genome which outputed bam files. From this bam files I wish to extract protein coding genes using homology based approach or de novo.

Question: Is it in order if I convert the bam files(obtained by mapping to reference) to fastq files, merge the paired reads, then convert the fastq files to fasta format? The idea is to get fasta files from the reads for feature prediction purposes. If that is the way to do it how to one get continues fasta sequences , since on following the mentioned step I get something like this:

HWI-574/1 TTCTTGGTCCATGTACTGCTGAAGCCCTGGCATGTGAAATGAGTGCAAATGTACAGTAGTTTGAA 55086/1 TAAAAGTTCTTGGTCCATGTACTGTTTCCTTACTGGCATGTGAAATGAGTGCAAATGTACAGTAGTTTGAA HWI-D00466:100:CADMYANXX:7:1111:1943:12062/1 AGTGAAGCAGAAGTGGATATTTTTCTGGAATTCCCTTGCTTTCTCTGTGATCCAAGGGAT 75804/1 CCCTTGGATCACAGAGAAAGATATCCACTTCTGCTTCACTGACTACACTTAAAGCCTTTGACTGTGT 16:15787:83520/1 GAAAGCAAGGGAATTCCAGAAAAATATCCACTTCTGCTTTTGACTGTGTGGATCACAACAAGC

I expect to get something like this;

chr1 TTCTTGGTCCATGTACTGCTGAAGCCCTGGCATGTGAAATGAGTGCAAATGTACAGTAGT TTGAATAAAAGTTCTTGGTCCATGTACTGTTTCCTTACTGGCATGTGAAATGAGTGCAAA TGTACAGTAGTTTGAAAGTGAAGCAGAAGTGGATATTTTTCTGGAATTCCCTTGCTTTCT CTGTGATCCAAGGGATCCCTTGGATCACAGAGAAAGATATCCACTTCTGCTTCACTGACT ACACTTAAAGCCTTTGACTGTGTGAAAGCAAGGGAATTCCAGAAAAATATCCACTTCTGC TTTTGACTGTGTGGATCACAACAAGC...........................................................

I would appreciate your input on how to go about it.

bam protein-coding genes fastq fasta paired-reads • 1.7k views
ADD COMMENT
2
Entering edit mode
6.9 years ago

Manually merging contigs from mapped reads is a complicated task that is conceptually closest to what is called "consensus calling".

Better alternatives might be to

  1. perform a denovo assembly of the reads that will give you long contigs that can be annotated alter
  2. call SNPs (variation) from the mapped reads then use the resulting VCF file and the original reference sequence to re-construct a consensus sequence (contigs) then annotate those

For the first step many assemblers may be used depending on the problem type. The latter step could be done with the GATK tool called FastaAlternateReferenceMaker:

https://software.broadinstitute.org/gatk/gatkdocs/3.6-0/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

ADD COMMENT
0
Entering edit mode

Hii Albert,

Thank you for the suggestion. So basically with whole genome data assembled by reference mapping , pulling out protein coding genes or mRNA is complicated assignment? My main goal is to get extract protein coding genes from this data for downstream work before embarking on denovo aspect. Let me think through the alternatives you have provided.

ADD REPLY
0
Entering edit mode

So basically with whole genome data assembled by reference mapping, pulling out protein coding genes or mRNA is complicated assignment?

Yes, that's extremely complicated :) There's no perfect solution for it. But it seems likely that Istvan's recommendation of FastaAlternateReferenceMaker is the best approach. However, if you are dealing with a microbe or something not closely related to the reference, it might be better to just assemble it and annotate the assembly. The decision is affected by whether it's a eukaryote or prokaryote, and the ploidy. For a polyploid eukaryote, I don't even know of a good way to address the problem without thousands of hours of manual work. But it depends on how important the quality is... if you don't care about phasing and so forth, it's much easier. And if you don't have really long reads (PacBio, Nanopore, 10x), long range phasing is not possible anyway.

ADD REPLY

Login before adding your answer.

Traffic: 1322 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6