Flanking sequence retreival
1
0
Entering edit mode
22 months ago

Hi all, I have a genome of size 7190 mb. I am using following python code to extract the flanking sequence nucleotide sequence around a list of variants. My code works for a smaller genome but it takes so long to process and fetch the sequence from a fasta file. Is there any way to speed up the process?

here is my code:

fr = open("cov_40/rep_1/test.vcf", "r")
a = fr.read()
aa = a.split("#CHROM")
aaa = aa[1].split("\n")

seq = ''
header = ''
sequence = {}
new_dict = {}


with open("Agria_reference_sequence_final.fasta") as fs:
    for linee in fs:
        if linee.startswith(">"):
            header = linee.split("\n")[0]
            seq = ''
            # print(header)

        else:
            seq = seq + linee.split("\n")[0]
            sequence.update({header: seq})
            # print(seq)

for line in aaa[1:-1]:
    index1 = line.split('\t')[0]
    index = line.split()
    # print(index[0],index[1],index[2],index[3],index[4] )
    head = index[0] + " " + index[1] + " " + index[2] + " " + index[3] + " " + index[4] + " " + " " + index[5]
    for keys, values in sequence.items():
        if index1 in keys:
            site = int(index[1]) - 1

            start = abs(site - 10)
            end = site + 50
            seqq = values[start:end + 1]
            new_dict.update({head: seqq})

fw = open("per_variants.txt", "a")
for base, basee in new_dict.items():
    fw.write(str(base) + " " + str(basee) + "\n"
Variants fasta sequence extraction • 491 views
ADD COMMENT
0
Entering edit mode

Have you considered bedtools flank?

ADD REPLY
0
Entering edit mode

I guess bedtools flank gives only intervals not complete sequence present around a given variant.

ADD REPLY
1
Entering edit mode

bedtools does it all! see: bedtools getfasta

ADD REPLY
1
Entering edit mode
22 months ago
rbagnall ★ 1.7k

Extract subsequence from indexed reference sequence using samtools faidx.

samtools faidx your.fasta chrX:start-end
ADD COMMENT
0
Entering edit mode

thanks, but I have to pick variant location from vcf file and it contains millions of variants. I am not sure how long will it takes to fetch sequence for each variant position.

ADD REPLY
1
Entering edit mode

Samtools

   time samtools faidx Homo_sapiens_assembly38.fasta chr8:300000-300050
>chr8:300000-300050
GACCACTTGTTCTGCATTTTCTCCATCTTCCTTGTGATTAGAAACCTCAAA

real    0m0.174s
user    0m0.003s
sys 0m0.024s

Bedtools

    time bedtools getfasta -fi Homo_sapiens_assembly38.fasta -bed test.bed
>chr8:300000-300050
ACCACTTGTTCTGCATTTTCTCCATCTTCCTTGTGATTAGAAACCTCAAA

real    0m0.012s
user    0m0.007s
sys 0m0.003s

Bedtools wins!

ADD REPLY

Login before adding your answer.

Traffic: 1226 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6