Question: Extract consensus dna sequence of multiple genes from multiple bam files in fasta format
0
gravatar for bioinfo8
2.1 years ago by
bioinfo8120
bioinfo8120 wrote:

Hi,

I have multiple bam files (aligned with reference) and bed file of multiple genes (chr, start, end). I tried as follows for one bam file which gave me only reads information in fasta file :

samtools view -b -L genes.bed 1sorted.bam > 1sorted_genes.bam

samtools fasta 1sorted_genes.bam > 1sorted_genes.bam.fasta

I would like to extract the DNA sequence of whole genes (and not only reads) from all my bam files, including the gene name. I know I can add gene name in bed file, but how gene name can be integrated into fasta file. I want to translate DNA sequences to proteins for further downstream work.

Any guidance would be appreciated.

Thanks.

EDIT1: I have paired-end bam files.

bam samtools genes consensus fasta • 1.4k views
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by bioinfo8120

samtools fasta will not generate a reference fasta. It turns each read from a one line sam format into two line fasta format. samtools.pl should have pileup2fq, that is closer to what you want.

ADD REPLYlink written 2.1 years ago by swbarnes26.5k

Thanks, but I don't need reference fasta but consensus fasta sequence of all reads belonging to a gene which I can translate to protein.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by bioinfo8120
1
gravatar for Sej Modha
2.1 years ago by
Sej Modha4.4k
Glasgow, UK
Sej Modha4.4k wrote:

Please have a look a the answers from this post: How to extract fasta sequence of a gene from BAM files generated using Miseq sequencing

ADD COMMENTlink written 2.1 years ago by Sej Modha4.4k

Thanks, but I have already gone through that, its not helpful, hence posted.

ADD REPLYlink written 2.1 years ago by bioinfo8120

I have tried all the following ways :

1) samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq 

2) samtools bam2fq 1sorted_genes.bam > genes_sorted.fq

3) seqtk seq -a  genes_sorted.fq >  genes_sorted.fa

My question is similar to Reference Mapping To Consensus Fasta but answers are not helpful.

Kindly guide.

Thanks.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by bioinfo8120

I have tried all the following ways

Just saying that without providing any information on what happened (did you get an error, did not work as expected or some thing else) you have little chance of getting useful help.

ADD REPLYlink written 2.1 years ago by genomax71k

@genomax Ok. Actually, I wrote in the starting of my post that "gave me only reads information in fasta file" and not consensus of all reads belong to a sequence. "I would like to extract the DNA sequence of whole genes (and not only reads) from all my bam files, including the gene name. I know I can add gene name in bed file, but how gene name can be integrated into fasta file. I want to translate DNA sequences to proteins for further downstream work."

I have used Bam2Consensus tool from BamBam : the only tool I find to get direct consensus for a gene, but for most of my genes it shows NNNNN mostly with few nucleotides:

>gene
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGCTGGTCTTCATGTCCNNNNNNNNNNN

and for few genes proper nucleotide appears (without N).

Hence, I would like to try other option but did not find any other tools so far.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by bioinfo8120

but for most of my genes it shows NNNNN mostly with few nucleotides

I am assuming you have checked the alignments to make sure you have read coverage across the entire gene/area of interst and you are using appropriate options for bam2consensus (I have not used this tool and it may need a certain number of reads covering a base in order for it to call a consensus). If there are no reads present then consensus tool is likely going to put N's in the area as padding.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax71k

That is what I also thought when I read samtools tview description in many places where N indicates no alignment with reference. But, I was thinking if I can confirm it with some other tools too.

Also, for those genes (although very few) where there are no or very few N's, I can easily translate to protein and do further analysis. But, those genes with so many N's, how should I translate them or what approach I should follow.

ADD REPLYlink written 2.1 years ago by bioinfo8120

But, I was thinking if I can confirm it with some other tools too.

Have you inspected the alignments with a genome viewer like IGV? Are the alignments uniform/sufficiently deep/without too many sequencing errors, to be able to reliably call a consensus. There is no point in trying to call a consensus if you have sparse alignments/no coverage across the entire gene of interest.

If you have N's (and no real reads covering that region) there is not much you can do but to get additional sequencing done to try to get coverage in that area.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax71k

Yes I visualized for few genes and not aligned throughout but I can see variants for few regions inside those genes (I already did variant calling). I also want to mention here that these are exomes (paired-end bam files).

As you suggested, sequencing from more samples would be helpful.

ADD REPLYlink written 2.1 years ago by bioinfo8120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2297 users visited in the last hour