Summarise bases from alignment
2
1
Entering edit mode
23 months ago
mpm20 ▴ 10

Say I have an alignment of multiple consensus sequences in multi-fasta format, and a specific site (or sites) I am interested in. I'd like to know the base frequency across all sequences in the alingment at these sites of interest.

Is there a command line approach to help summarise those specific sites? For example, to tell me the number of A/T/G/C bases at position 95?

alignment • 745 views
ADD COMMENT
0
Entering edit mode

Something like bam-readcount tool could work, but you need a BAM as input with the alignment of the reads to your sequence of interest. The output is exactly what you need, frequency of A/T/G/C at the position of interest. There are many other tools as well, I am just suggesting the one I am familiar with :).

ADD REPLY
0
Entering edit mode

Thanks for your reply. This looks like a great solution, but only works for looking a single read set of interest. And I now realise I wasn't specific enough with my orginal post! Apologies, I have edited to give further clarification.

In my use case, I have multiple consensus sequences aligned in a multi-fasta file, so I am just interested in the consensus-level base frequency at particular sites across all the sequences in the aligment

ADD REPLY
0
Entering edit mode
23 months ago
iraun 6.2k

Alright, I think I understood now the question. How are your python skills? Here I have quickly coded a script (that has a lot room for improvement), but that I believe it does what you want? Let me know.

You can run it in the following way:

python biostars9523808.py <multifasta> <position> <length>

The first argument is the multi-fasta file, the second the position you want to interrogate (0-based, therefore, position 0 corresponds to the first nucleotide of the sequences), and the third argument is the length of the sequences in the fasta file (this script assumes that all the sequences have the same length, if not, I think it could work if you specify the length of the longest sequence but I have not tested it).

The input looks like this:

>f1
CAGGTAGCC
>f2
CCGGTCAGA
>f3
AGGGTTTGA
>f4
TTGGTGAGG

And the output, for position 0, like this:

python biostars9523808.py test.fa 0 9
A   1   0.25
C   2   0.50
G   0   0.00
T   1   0.25

The second column is the count, and the third the frequency.

from Bio import SeqIO
import sys

input_file=sys.argv[1]
position=int(sys.argv[2])
sequence_length=int(sys.argv[3])

frequency_list = []
for i in range(sequence_length):
    frequency_list.append({'A': 0, 'C': 0, 'G': 0, 'T': 0})

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
total_sequences=0
for fasta in fasta_sequences:
   total_sequences+=1
   name, sequence = fasta.id, str(fasta.seq)
   for idx, c in enumerate(sequence.strip()):
      frequency_list[idx][c] += 1

for char in ['A', 'C', 'G', 'T']:
    print("{}\t{}\t{}".format(char, frequency_list[position][char], "\t".join(["{:0.2f}".format(frequency_list[position][char]/total_sequences)])))
ADD COMMENT
0
Entering edit mode
23 months ago

convert to VCF ( https://github.com/sanger-pathogens/snp-sites ) , if needed update the AC,AN,AF fields using bcftools +fill-tags, if needed convert to a tabular format using bcftools query

ADD COMMENT

Login before adding your answer.

Traffic: 2420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6