Question: covert SAM to full length fasta
0
gravatar for marongiu.luigi
2.1 years ago by
Germany, Mannheim, UMM
marongiu.luigi520 wrote:

Hello,

I have a SAM file (BAM actually) that I would like to convert to a fasta. I found different solutions online for instance this or this, but the utput I get is in this form:

>file-1316
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCG
GCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTT
TTCAAAGATTTTAGGGGAAT
>file-354
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGC
CCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAG
GGTCAAATACGGTCCCTATT
>file-2000
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAA
AAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGT
GCTTATAGGGTCAAATACGG
>file-2930
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAA
AGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG
TAGAGGGGGTGCTTATAGGG
>file-1270
CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAG
CTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCA
GTGCTGTGCTATATATGGGG

Is there a way to obtain an aligned fasta? I think the term is consensus fasta, essentially I need all these fragments to be aligned in a single consecutive sequence.

Thank you

conversion sam file fasta • 1.3k views
ADD COMMENTlink written 2.1 years ago by marongiu.luigi520
2

What you are asking for is not a simple conversion. Your sam file (using bam files would be better to save disk usage) contains an alignment: the reads are mapped to where they belong on the reference genome. What you are asking for is a genome assembly in which reads are stitched together to create a genome. You could create a consensus sequence using an alignment though, but that is not a trivial conversion.

ADD REPLYlink written 2.1 years ago by WouterDeCoster44k
1

Are you sure it wouldn't be easier to find the reference file used to make the bam, instead of trying to remake it yourself?

ADD REPLYlink written 2.1 years ago by swbarnes28.6k

what's wrong with the 2nd solution ?

samtools bam2fq input.bam | seqtk seq -A - > output.fa

I mean could be more clear about "aligning these sequences to single seq" ?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Santosh Anand5.1k
1

OP wants consensus fasta sequence.

marongiu.luigi : Do you not search the web/biostars before posting a question: https://samtools.github.io/bcftools/howtos/consensus-sequence.html

ADD REPLYlink written 2.1 years ago by genomax89k

I have seen this page, but I dont have a indexed VCF file

ADD REPLYlink written 2.1 years ago by marongiu.luigi520
1

You need to first call variants using the alignment file and then do the consesus. Relevant instructions are linked on the page I included above (https://samtools.github.io/bcftools/howtos/variant-calling.html ).

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax89k

Followinf your lead, I created a bcf with

bcftools mpileup -Ou -f <ref.fa> <alnSorted.bam> | bcftools call -mv -Ob -o calls.bcf

and I index it with

bcftools index calls.bcf

which created a calls.bcf.csi file. I then used the instructions for the consensus sequence but I got:

$ bcftools mpileup -Ou <ref.fa> <alnSorted.bam> | bcftools call -mv -Oz -o calls.bcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
Error: mpileup requires the --fasta-ref option by default; use --no-reference to run without a fasta reference

But in Failed to open -: unknown file type, the reference is present in the working directory (ref.fa).

ADD REPLYlink written 2.1 years ago by marongiu.luigi520
1

You forgot the -f flag (which the error message says quite clearly). It's in your top line but not in the second part where you're actually running your commands.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by RamRS30k

I see, now the command is:

$ bcftools mpileup -Ou -f <ref.fa> <alnSorted.bam> | bcftools call -mv -Oz -o calls.bcf.csi 
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

This generates a file ref.fa.fai that contains:

MT  16569   54  60  61

Essentially the second command is a re-run of the first apart for the option -z.

ADD REPLYlink written 2.1 years ago by marongiu.luigi520

-O just controls output format, so there's almost zero difference between the processes.

ADD REPLYlink written 2.1 years ago by RamRS30k
1

Can you explain to us why you want a FASTA out of a BAM? BAM documents how reads (sequences with base-quality scores) align to a reference genome. BAM does not document in any way the changes one would need to make to the reference genome to get a consensus sequence. VCF files can be used to extract that information. We need to understand why you are particular about BAM files, so we can evaluate your approach.

ADD REPLYlink written 2.1 years ago by RamRS30k

Because a BAM on its own give me little information. I need a sequence that I can design primers upon and remap with Clustal in case I need it. BAM is an alignment, I would like to see where they align that is all.

ADD REPLYlink written 2.1 years ago by marongiu.luigi520
1

I would like to see where they align that is all.

Well that's what bam tells you.

I need a sequence that I can design primers upon

Do variant calling, mask the reference using the identified variants and use that for primers.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by WouterDeCoster44k

OK, so I get that the next step after alignments is variant calling. I'll continue from there then. Thanks.

ADD REPLYlink written 2.1 years ago by marongiu.luigi520

that gives me a pile of fragments like before:

$ samtools bam2fq file.bam | seqtk seq -A - > OUT.fa
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 904 reads
$ head OUT.fa 
>file-1316/2
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTTTTCAAAGATTTTAGGGGAAT
>file-354/2
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTATT
>file-2000/2
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGG
>file-2930/2
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGG
>file-1270/2
 CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG

If I try

$ samtools mpileup -d8000 -f <ref.fasta> <file.bam> |  bcftools call -c | vcfutils.pl vcf2fq | seqtk seq -aQ64 -q20 -n N > <out.fa>

the file is empty, apart for the initial '>'.

I wanted a single fasta entry with all these fragments aligned together. So, instead of having:

> 1
atgttg
> 2
ttgttc
>3 
tcaagtg

I would like:

> 1
atgttgttcaagtg
ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by marongiu.luigi520
1

We don't know what exactly you are looking for. I can think of:

1) You want to get a fasta of all loci on the genome that has at least one read mapping?

2) You want to get a fasta of "consensus" reads. There is no consensus, since there might be SNPs/variants. You could potentially have two or more versions of the sequence at each polymorphism.

3) You want to perform variant calling/phasing and then output the possible variant fasta?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Damian Kao15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 642 users visited in the last hour