Entering edit mode
5.4 years ago
ishmahe16
•
0
Hey I have to find the FASTA file containing the sequence 1kb upstream of each gene on the X chromosome. Given: a GFF and a genome sequence file in fasta format, The code I used to solve is
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
fast_lgx_records = []
lgs_record_ids = []
records = list(SeqIO.parse(open("Midterm.fna"), 'fasta'))
for record in records:
if "LGX" in record.description:
fast_lgx_records.append(record)
lgs_record_ids.appendrecord.id)
positions_to_read = []
with open("Midterm.gff") as f:
for line in f:
if not line.startswith("#"):
split_line = line.split('\t')
seq_id = split_line[0]
feature_type = split_line[2]
start = split_line[3]
end = split_line[4]
sign = split_line[6]
if seq_id in lgs_record_ids and feature_type == "gene":
if sign == "+":
start_index = int(start)
positions_to_read.append((start_index - 1001, start_index - 1))
else:
end_index = int(end)
positions_to_read.append((end_index + 1, end_index + 1001))
# write the final sequence to a new fasta file
final_data = []
My professor mentioned that : All the headers are identical so I can't match sequences to their corresponding genes or positions in the genome. There are incorrect sequences, but I can't be sure exactly the problem because I can't tell which gene they are supposed to belong to.
Please help.