Question

Python: How to find coordinates of short sequences in a FASTA file?

0

Entering edit mode

10.1 years ago

allyson1115ar ▴ 30

I have a list of short sequences that I want to obtain its coordinate or in another word to get its bed file after compare with a fasta file which contains original sequences.

Fasta file:

>PGH2
CGTAGCGGCTGAGTGCGCGGATAGCGCGTA

Short sequence fasta file:

>PGH2
CGGCTGAGT

Is there any ways to obtain its coordinate? Bedtools can't help much.

Desired output:

PGH2 6 14

fasta bed biopython sequence • 5.6k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by allyson1115ar ▴ 30

0

Entering edit mode

@allyson1115ar: I have a feeling you are somewhat confused or you are confusing me with your question. The file format that you have shown above is a FASTA format. FASTA format hold sequence information, where string after > explain what sort of sequence that is e.g it can be a gene sequence, a sequenced read sequence, contig sequence. The gene (or any other features such as CDS, UTR's etc) coordinates, that previously have been annotated usually kept in gff/gtf or bed files.

It is a good practice to give a bit more background information on forums to what you are doing. At least provide species name and what is PGH2, a gene or contig or something else?

If those are the reads you got from sequencing and you are interested in identifying whereabouts those read fall on the reference genome than your first step will be read alignment to the reference genome. This will give a bam file with start and end coordinates of where about your reads land in the reference genome.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Kirill Tsyganov ▴ 370

Ram · Answer 1 · 2015-05-21

0

Entering edit mode

10.1 years ago

PoGibas 5.1k

I had the same problem long time ago and Farhat solution was what helped me: 7N Motif Search Over The Genome

You can edit script a little or just easily parse output for wanted bed format.

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by PoGibas 5.1k

score 0 · Answer 2 · 2015-05-21

0

Entering edit mode

10.1 years ago

Ram 45k

Either Pgibas's solution, or you can create rigid motif matrices out of your sequences and use MEME's FIMO.

By "rigid", I mean where the probability for one base is 1.0 and the others is 0.0, as opposed to the flexibility offered by most motif position weight matrices.

ADD COMMENT • link 2.5 years ago by Ram 45k

Ram · Answer 3 · 2015-05-21

0

Entering edit mode

10.1 years ago

alec_djinn ▴ 390

It is not a difficult problem to solve, all you need is to use string.find().

The basic algorithm is something like that in python.

fasta = 'CGTAGCGGCTGAGTGCGCGGATAGCGCGTA'
target = 'CGGCTGAGT'

start = fasta.find(target)+1
end = start + len(target) -1

print(start, end)

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by alec_djinn ▴ 390