Question: covert SAM to full length fasta
0
gravatar for marongiu.luigi
16 months ago by
Germany, Mannheim, UMM
marongiu.luigi420 wrote:

Hello,

I have a SAM file (BAM actually) that I would like to convert to a fasta. I found different solutions online for instance this or this, but the utput I get is in this form:

>file-1316
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCG
GCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTT
TTCAAAGATTTTAGGGGAAT
>file-354
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGC
CCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAG
GGTCAAATACGGTCCCTATT
>file-2000
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAA
AAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGT
GCTTATAGGGTCAAATACGG
>file-2930
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAA
AGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG
TAGAGGGGGTGCTTATAGGG
>file-1270
CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAG
CTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCA
GTGCTGTGCTATATATGGGG

Is there a way to obtain an aligned fasta? I think the term is consensus fasta, essentially I need all these fragments to be aligned in a single consecutive sequence.

Thank you

conversion sam file fasta • 894 views
ADD COMMENTlink written 16 months ago by marongiu.luigi420
2

What you are asking for is not a simple conversion. Your sam file (using bam files would be better to save disk usage) contains an alignment: the reads are mapped to where they belong on the reference genome. What you are asking for is a genome assembly in which reads are stitched together to create a genome. You could create a consensus sequence using an alignment though, but that is not a trivial conversion.

ADD REPLYlink written 16 months ago by WouterDeCoster42k
1

Are you sure it wouldn't be easier to find the reference file used to make the bam, instead of trying to remake it yourself?

ADD REPLYlink written 16 months ago by swbarnes27.0k

what's wrong with the 2nd solution ?

samtools bam2fq input.bam | seqtk seq -A - > output.fa

I mean could be more clear about "aligning these sequences to single seq" ?

ADD REPLYlink modified 16 months ago • written 16 months ago by Santosh Anand5.0k
1

OP wants consensus fasta sequence.

marongiu.luigi : Do you not search the web/biostars before posting a question: https://samtools.github.io/bcftools/howtos/consensus-sequence.html

ADD REPLYlink written 16 months ago by genomax75k

I have seen this page, but I dont have a indexed VCF file

ADD REPLYlink written 16 months ago by marongiu.luigi420
1

You need to first call variants using the alignment file and then do the consesus. Relevant instructions are linked on the page I included above (https://samtools.github.io/bcftools/howtos/variant-calling.html ).

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax75k

Followinf your lead, I created a bcf with

bcftools mpileup -Ou -f <ref.fa> <alnSorted.bam> | bcftools call -mv -Ob -o calls.bcf

and I index it with

bcftools index calls.bcf

which created a calls.bcf.csi file. I then used the instructions for the consensus sequence but I got:

$ bcftools mpileup -Ou <ref.fa> <alnSorted.bam> | bcftools call -mv -Oz -o calls.bcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
Error: mpileup requires the --fasta-ref option by default; use --no-reference to run without a fasta reference

But in Failed to open -: unknown file type, the reference is present in the working directory (ref.fa).

ADD REPLYlink written 15 months ago by marongiu.luigi420
1

You forgot the -f flag (which the error message says quite clearly). It's in your top line but not in the second part where you're actually running your commands.

ADD REPLYlink modified 15 months ago • written 15 months ago by RamRS25k

I see, now the command is:

$ bcftools mpileup -Ou -f <ref.fa> <alnSorted.bam> | bcftools call -mv -Oz -o calls.bcf.csi 
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

This generates a file ref.fa.fai that contains:

MT  16569   54  60  61

Essentially the second command is a re-run of the first apart for the option -z.

ADD REPLYlink written 15 months ago by marongiu.luigi420

-O just controls output format, so there's almost zero difference between the processes.

ADD REPLYlink written 15 months ago by RamRS25k
1

Can you explain to us why you want a FASTA out of a BAM? BAM documents how reads (sequences with base-quality scores) align to a reference genome. BAM does not document in any way the changes one would need to make to the reference genome to get a consensus sequence. VCF files can be used to extract that information. We need to understand why you are particular about BAM files, so we can evaluate your approach.

ADD REPLYlink written 16 months ago by RamRS25k

Because a BAM on its own give me little information. I need a sequence that I can design primers upon and remap with Clustal in case I need it. BAM is an alignment, I would like to see where they align that is all.

ADD REPLYlink written 15 months ago by marongiu.luigi420
1

I would like to see where they align that is all.

Well that's what bam tells you.

I need a sequence that I can design primers upon

Do variant calling, mask the reference using the identified variants and use that for primers.

ADD REPLYlink modified 15 months ago • written 15 months ago by WouterDeCoster42k

OK, so I get that the next step after alignments is variant calling. I'll continue from there then. Thanks.

ADD REPLYlink written 15 months ago by marongiu.luigi420

that gives me a pile of fragments like before:

$ samtools bam2fq file.bam | seqtk seq -A - > OUT.fa
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 904 reads
$ head OUT.fa 
>file-1316/2
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTTTTCAAAGATTTTAGGGGAAT
>file-354/2
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTATT
>file-2000/2
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGG
>file-2930/2
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGG
>file-1270/2
 CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG

If I try

$ samtools mpileup -d8000 -f <ref.fasta> <file.bam> |  bcftools call -c | vcfutils.pl vcf2fq | seqtk seq -aQ64 -q20 -n N > <out.fa>

the file is empty, apart for the initial '>'.

I wanted a single fasta entry with all these fragments aligned together. So, instead of having:

> 1
atgttg
> 2
ttgttc
>3 
tcaagtg

I would like:

> 1
atgttgttcaagtg
ADD REPLYlink modified 16 months ago • written 16 months ago by marongiu.luigi420
1

We don't know what exactly you are looking for. I can think of:

1) You want to get a fasta of all loci on the genome that has at least one read mapping?

2) You want to get a fasta of "consensus" reads. There is no consensus, since there might be SNPs/variants. You could potentially have two or more versions of the sequence at each polymorphism.

3) You want to perform variant calling/phasing and then output the possible variant fasta?

ADD REPLYlink modified 16 months ago • written 16 months ago by Damian Kao15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1967 users visited in the last hour