Python: How to find coordinates of short sequences in a FASTA file?
4
0
Entering edit mode
6.4 years ago

I have a list of short sequences that I want to obtain its coordinate or in another word to get its bed file after compare with a fasta file which contains original sequences.

Fasta file:
>PGH2
CGTAGCGGCTGAGTGCGCGGATAGCGCGTA

Short sequence fasta file:
>PGH2
CGGCTGAGT

Is there any ways to obtain its coordinate? Bedtools can't help much.

Desired output:
PGH2 6 14

sequence biopython python fasta bed • 3.7k views
0
Entering edit mode

@allyson1115ar: I have a feeling you are somewhat confused or you are confusing me with your question. The file format that you have shown above is a FASTA format. FASTA format hold sequence information, where string after > explain what sort of sequence that is e.g it can be a gene sequence, a sequenced read sequence, contig sequence. The gene (or any other features such as CDS, UTR's etc)  coordinates, that previously have been annotated usually kept in gff/gtf or bed files.

It is a good practise to give a bit more background information on forums to what you are doing. At least provide species name and what is PGH2, a gene or contig or something else?

If those are the reads you got from sequencing and you are interested in identifying where abouts those read fall on the reference genome than your first step will be read alignment to the reference genome. This will give a bam file with start and end coordinates of where about your reads land in the reference genome.

0
Entering edit mode
6.4 years ago
PoGibas 4.9k

I had the same problem long time ago and Farhat solution was what helped me: A: 7N Motif Search Over The Genome

You can edit script a little or just easily parse output for wanted bed format.

0
Entering edit mode
6.4 years ago
Ram 34k

Either Pgibas's solution, or you can create rigid motif matrices out of your sequences and use MEME's FIMO.

By "rigid", I mean where the probability for one base is 1.0 and the others is 0.0, as opposed to the flexibility offered by most motif position weight matrices.

0
Entering edit mode
6.4 years ago
alec_djinn ▴ 360

It is not a difficult problem to solve, all you need is to use string.find().

The basic algorithm is something like that in python.

fasta = 'CGTAGCGGCTGAGTGCGCGGATAGCGCGTA'
target = 'CGGCTGAGT'

start = fasta.find(target)+1
end = start + len(target) -1

print(start, end)