Question

covert SAM to full length fasta

0

Entering edit mode

6.2 years ago

marongiu.luigi ▴ 720

Hello,

I have a SAM file (BAM actually) that I would like to convert to a fasta. I found different solutions online for instance this or this, but the utput I get is in this form:

>file-1316
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCG
GCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTT
TTCAAAGATTTTAGGGGAAT
>file-354
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGC
CCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAG
GGTCAAATACGGTCCCTATT
>file-2000
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAA
AAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGT
GCTTATAGGGTCAAATACGG
>file-2930
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAA
AGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG
TAGAGGGGGTGCTTATAGGG
>file-1270
CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAG
CTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCA
GTGCTGTGCTATATATGGGG

Is there a way to obtain an aligned fasta? I think the term is consensus fasta, essentially I need all these fragments to be aligned in a single consecutive sequence.

Thank you

fasta SAM • 4.2k views

ADD COMMENT • link updated 19 months ago by Ram 44k • written 6.2 years ago by marongiu.luigi ▴ 720

2

Entering edit mode

What you are asking for is not a simple conversion. Your sam file (using bam files would be better to save disk usage) contains an alignment: the reads are mapped to where they belong on the reference genome. What you are asking for is a genome assembly in which reads are stitched together to create a genome. You could create a consensus sequence using an alignment though, but that is not a trivial conversion.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

1

Entering edit mode

Are you sure it wouldn't be easier to find the reference file used to make the bam, instead of trying to remake it yourself?

ADD REPLY • link 6.2 years ago by swbarnes2 14k

0

Entering edit mode

what's wrong with the 2nd solution ?

samtools bam2fq input.bam | seqtk seq -A - > output.fa

I mean could be more clear about "aligning these sequences to single seq" ?

ADD REPLY • link 6.2 years ago by Santosh Anand 5.8k

1

Entering edit mode

OP wants consensus fasta sequence.

marongiu.luigi : Do you not search the web/biostars before posting a question: https://samtools.github.io/bcftools/howtos/consensus-sequence.html

ADD REPLY • link 6.2 years ago by GenoMax 147k

0

Entering edit mode

I have seen this page, but I dont have a indexed VCF file

ADD REPLY • link 6.2 years ago by marongiu.luigi ▴ 720

1

Entering edit mode

You need to first call variants using the alignment file and then do the consesus. Relevant instructions are linked on the page I included above (https://samtools.github.io/bcftools/howtos/variant-calling.html ).

ADD REPLY • link 6.2 years ago by GenoMax 147k

0

Entering edit mode

Followinf your lead, I created a bcf with

bcftools mpileup -Ou -f <ref.fa> <alnSorted.bam> | bcftools call -mv -Ob -o calls.bcf

and I index it with

bcftools index calls.bcf

which created a calls.bcf.csi file. I then used the instructions for the consensus sequence but I got:

$ bcftools mpileup -Ou <ref.fa> <alnSorted.bam> | bcftools call -mv -Oz -o calls.bcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
Error: mpileup requires the --fasta-ref option by default; use --no-reference to run without a fasta reference

But in Failed to open -: unknown file type, the reference is present in the working directory (ref.fa).

ADD REPLY • link 6.2 years ago by marongiu.luigi ▴ 720

1

Entering edit mode

You forgot the -f flag (which the error message says quite clearly). It's in your top line but not in the second part where you're actually running your commands.

ADD REPLY • link 6.2 years ago by Ram 44k

0

Entering edit mode

I see, now the command is:

$ bcftools mpileup -Ou -f <ref.fa> <alnSorted.bam> | bcftools call -mv -Oz -o calls.bcf.csi 
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

This generates a file ref.fa.fai that contains:

MT  16569   54  60  61

Essentially the second command is a re-run of the first apart for the option -z.

ADD REPLY • link 6.2 years ago by marongiu.luigi ▴ 720

0

Entering edit mode

-O just controls output format, so there's almost zero difference between the processes.

ADD REPLY • link 6.2 years ago by Ram 44k

1

Entering edit mode

Can you explain to us why you want a FASTA out of a BAM? BAM documents how reads (sequences with base-quality scores) align to a reference genome. BAM does not document in any way the changes one would need to make to the reference genome to get a consensus sequence. VCF files can be used to extract that information. We need to understand why you are particular about BAM files, so we can evaluate your approach.

ADD REPLY • link 6.2 years ago by Ram 44k

0

Entering edit mode

Because a BAM on its own give me little information. I need a sequence that I can design primers upon and remap with Clustal in case I need it. BAM is an alignment, I would like to see where they align that is all.

ADD REPLY • link 6.2 years ago by marongiu.luigi ▴ 720

1

Entering edit mode

I would like to see where they align that is all.

Well that's what bam tells you.

I need a sequence that I can design primers upon

Do variant calling, mask the reference using the identified variants and use that for primers.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

OK, so I get that the next step after alignments is variant calling. I'll continue from there then. Thanks.

ADD REPLY • link 6.2 years ago by marongiu.luigi ▴ 720

0

Entering edit mode

that gives me a pile of fragments like before:

$ samtools bam2fq file.bam | seqtk seq -A - > OUT.fa
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 904 reads
$ head OUT.fa 
>file-1316/2
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTTTTCAAAGATTTTAGGGGAAT
>file-354/2
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTATT
>file-2000/2
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGG
>file-2930/2
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGG
>file-1270/2
 CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG

If I try

$ samtools mpileup -d8000 -f <ref.fasta> <file.bam> |  bcftools call -c | vcfutils.pl vcf2fq | seqtk seq -aQ64 -q20 -n N > <out.fa>

the file is empty, apart for the initial '>'.

I wanted a single fasta entry with all these fragments aligned together. So, instead of having:

> 1
atgttg
> 2
ttgttc
>3 
tcaagtg

I would like:

> 1
atgttgttcaagtg

ADD REPLY • link 6.2 years ago by marongiu.luigi ▴ 720

1

Entering edit mode

We don't know what exactly you are looking for. I can think of:

1) You want to get a fasta of all loci on the genome that has at least one read mapping?

2) You want to get a fasta of "consensus" reads. There is no consensus, since there might be SNPs/variants. You could potentially have two or more versions of the sequence at each polymorphism.

3) You want to perform variant calling/phasing and then output the possible variant fasta?

ADD REPLY • link 6.2 years ago by Damian Kao 16k