covert SAM to full length fasta
0
0
Entering edit mode
6.2 years ago

Hello,

I have a SAM file (BAM actually) that I would like to convert to a fasta. I found different solutions online for instance this or this, but the utput I get is in this form:

>file-1316
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCG
GCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTT
TTCAAAGATTTTAGGGGAAT
>file-354
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGC
CCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAG
GGTCAAATACGGTCCCTATT
>file-2000
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAA
AAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGT
GCTTATAGGGTCAAATACGG
>file-2930
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAA
AGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG
TAGAGGGGGTGCTTATAGGG
>file-1270
CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAG
CTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCA
GTGCTGTGCTATATATGGGG

Is there a way to obtain an aligned fasta? I think the term is consensus fasta, essentially I need all these fragments to be aligned in a single consecutive sequence.

Thank you

fasta SAM • 4.2k views
ADD COMMENT
2
Entering edit mode

What you are asking for is not a simple conversion. Your sam file (using bam files would be better to save disk usage) contains an alignment: the reads are mapped to where they belong on the reference genome. What you are asking for is a genome assembly in which reads are stitched together to create a genome. You could create a consensus sequence using an alignment though, but that is not a trivial conversion.

ADD REPLY
1
Entering edit mode

Are you sure it wouldn't be easier to find the reference file used to make the bam, instead of trying to remake it yourself?

ADD REPLY
0
Entering edit mode

what's wrong with the 2nd solution ?

samtools bam2fq input.bam | seqtk seq -A - > output.fa

I mean could be more clear about "aligning these sequences to single seq" ?

ADD REPLY
1
Entering edit mode

OP wants consensus fasta sequence.

marongiu.luigi : Do you not search the web/biostars before posting a question: https://samtools.github.io/bcftools/howtos/consensus-sequence.html

ADD REPLY
0
Entering edit mode

I have seen this page, but I dont have a indexed VCF file

ADD REPLY
1
Entering edit mode

You need to first call variants using the alignment file and then do the consesus. Relevant instructions are linked on the page I included above (https://samtools.github.io/bcftools/howtos/variant-calling.html ).

ADD REPLY
0
Entering edit mode

Followinf your lead, I created a bcf with

bcftools mpileup -Ou -f <ref.fa> <alnSorted.bam> | bcftools call -mv -Ob -o calls.bcf

and I index it with

bcftools index calls.bcf

which created a calls.bcf.csi file. I then used the instructions for the consensus sequence but I got:

$ bcftools mpileup -Ou <ref.fa> <alnSorted.bam> | bcftools call -mv -Oz -o calls.bcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
Error: mpileup requires the --fasta-ref option by default; use --no-reference to run without a fasta reference

But in Failed to open -: unknown file type, the reference is present in the working directory (ref.fa).

ADD REPLY
1
Entering edit mode

You forgot the -f flag (which the error message says quite clearly). It's in your top line but not in the second part where you're actually running your commands.

ADD REPLY
0
Entering edit mode

I see, now the command is:

$ bcftools mpileup -Ou -f <ref.fa> <alnSorted.bam> | bcftools call -mv -Oz -o calls.bcf.csi 
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

This generates a file ref.fa.fai that contains:

MT  16569   54  60  61

Essentially the second command is a re-run of the first apart for the option -z.

ADD REPLY
0
Entering edit mode

-O just controls output format, so there's almost zero difference between the processes.

ADD REPLY
1
Entering edit mode

Can you explain to us why you want a FASTA out of a BAM? BAM documents how reads (sequences with base-quality scores) align to a reference genome. BAM does not document in any way the changes one would need to make to the reference genome to get a consensus sequence. VCF files can be used to extract that information. We need to understand why you are particular about BAM files, so we can evaluate your approach.

ADD REPLY
0
Entering edit mode

Because a BAM on its own give me little information. I need a sequence that I can design primers upon and remap with Clustal in case I need it. BAM is an alignment, I would like to see where they align that is all.

ADD REPLY
1
Entering edit mode

I would like to see where they align that is all.

Well that's what bam tells you.

I need a sequence that I can design primers upon

Do variant calling, mask the reference using the identified variants and use that for primers.

ADD REPLY
0
Entering edit mode

OK, so I get that the next step after alignments is variant calling. I'll continue from there then. Thanks.

ADD REPLY
0
Entering edit mode

that gives me a pile of fragments like before:

$ samtools bam2fq file.bam | seqtk seq -A - > OUT.fa
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 904 reads
$ head OUT.fa 
>file-1316/2
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTTTTCAAAGATTTTAGGGGAAT
>file-354/2
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTATT
>file-2000/2
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGG
>file-2930/2
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGG
>file-1270/2
 CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG

If I try

$ samtools mpileup -d8000 -f <ref.fasta> <file.bam> |  bcftools call -c | vcfutils.pl vcf2fq | seqtk seq -aQ64 -q20 -n N > <out.fa>

the file is empty, apart for the initial '>'.

I wanted a single fasta entry with all these fragments aligned together. So, instead of having:

> 1
atgttg
> 2
ttgttc
>3 
tcaagtg

I would like:

> 1
atgttgttcaagtg
ADD REPLY
1
Entering edit mode

We don't know what exactly you are looking for. I can think of:

1) You want to get a fasta of all loci on the genome that has at least one read mapping?

2) You want to get a fasta of "consensus" reads. There is no consensus, since there might be SNPs/variants. You could potentially have two or more versions of the sequence at each polymorphism.

3) You want to perform variant calling/phasing and then output the possible variant fasta?

ADD REPLY

Login before adding your answer.

Traffic: 793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6