How to extract sequence from fasta file with specific coordinates?
2
0
Entering edit mode
7.9 years ago
tcf.hcdg ▴ 70

I would like to extract sequence with specific coordinate. I did the blast search and now from this output I need to extract the 100bp upstream and downstream from the subject sequence to whom query sequence matched in blastn search. (which is subject start and subject end). I have gone through the similar problems and their solution and tried them. I tried the following approach from brentp enter link description here and I have the error.

python seq_ext_coordinate.py 
Traceback (most recent call last):
  File "seq_ext_coordinate.py", line 5, in <module>
    EST_030516_lnr.fa = sys.argv[1]
IndexError: list index out of range

Does anybody have the idea what I did wrong?

fasta blast sequence • 3.7k views
ADD COMMENT
0
Entering edit mode
7.9 years ago

fastaFromBed

If you need to get +/- n bp, use bedtools slop

ADD COMMENT
0
Entering edit mode

But I would like to extract 100bp upstream and downstream of blast output. For example in the following example for first case I need to get the subject sequence "gi|34729142|" from 290+100 to 271-100 that is 171-390. While in the second case I need 371-100 to 391+100 that is 271-491.

   query                 subject           %identity  aln_lnt  MM    g.opn qstart   qend   sstart     s_end      evalue    bit_score
    ath-miR156a-5p    gi|34729142|    95.00     20        1        0        1        20        **290     271**       0.010      32.2
    ath-miR159a        gi|34729002|    100.00    21        0        0        1        21        **371     391**       1e-005    42.1
    ath-miR170-3p     gi|34723330|     95.24     21        1        0        1        21       **163      183**       0.003      34.2

In short I need such a program/script that look into the blast output result (sstart and s_end ). If sstart is smaller then s_end then substract 100 from sstart and add 100 in s_end as in second entry in the above example otherwise do other way round substract 100 from s_end and add 100 to s_start as in the first case in above example.

ADD REPLY
0
Entering edit mode

So First I need coordinate that can be supplied to bedtools to get the sequence

ADD REPLY
0
Entering edit mode

I updated the answer.

ADD REPLY
0
Entering edit mode

I tried according to the manual but I have following error:

$bedtools slop -i dataset2.bed -g EST_030516_lnr.fa -b 100

   Less than the req'd two fields were encountered in the genome file (EST_030516_lnr.fa) at line 1.  Exiting.

bed file is as follow: (exmple)

gi|238893|gb|JK494213.1|JK494213     290     271
gi|3472902|gb|JK49708.1|JK497608     371     391
gi|3493330|gb|JK45637.1|JK495637     163     183
gi|343330|gb|JK49637.1|JK495637      163     183
gi|341462|gb|JK49423.1|JK494213      290     270
gi|343330|gb|JK49637.1|JK495637      163     183

EST file is as follow(example)

>gi|2388794|gb|GR252.1|GR222152 CAN022_2006-03-17_1/CAN022_H11_ (marijuana cultivar Skunk #1) glandular trichomes isolated from female inflorescences cDNA, mRNA sequence
GGCCTTTTGTAGCCAATCATTTATTAAGAAAAATGAATATGCTTAACACAAAAGCGGAAAGAATAATAGTAACTTGGTCCCGAGCGTCTACCATTATACCTACATGATCGGGCATACTATTGCTGTTCATACTGAAAGA
>gi|238893|gb|GR2151.1|GR222151 CAN022_2006-03-17_1/ (marijuana cultivar Skunk #1) glandular trichomes isolated from female inflorescences cDNA, mRNA sequence
GGGGGTCTCCATCTCTCTTCTACTGCGTTTCGGAAGGACCTCCATTAAAGCTTTCGCTGGGTGTTTCTCATTTCGCAGTCTGATTTCGATATCTGACTCTTCACTGTTCCATTTCTTTTTTGTAGAATTATATGGAGGAGCAGTCGGCCACTTGGTTAGCTCTAGCTGAAGTGTTGTAGTTGGAAATCTGTCGGATCATATTGGATACCCATATCATAGATTTTGGGACGAAGTT
>gi|2386792|gb|GR22150.1|GR222150 CAN022_2006-03-17_1 (marijuana cultivar Skunk #1) glandular trichomes isolated from female inflorescences  cDNA, mRNA sequence
GGGGAATCATATCTCACTTATCATTGGTTAAAATTATTCAAGTTATAACTAACTACATAAAACAAAGATTGAGAGTTTGTGAGAATGGAAAACAATAAGCTTATTATTGTTGTGGCATTATTGTTATGCCTCATTGGATTAACCAAGCTAGTCGAAGCTGATCATTACGATGATGAAGGTGCTAATCAAAATCCATCTTCACTACTAGGTCAAAAAGGTTTCCTAATACCAAAAAATTTTACCCAAATTACCTAAGCCATCACTACCACCGCTACCGTCATTGCCAAAACTTCCATCTCTGCCACCAATATCCTCACCTCTTTCCCGAAACTCCCATCCTGCACCAAATCACACTCCTTGCCATCACTGCCACCCACTCTCCATTTTACCCAAAAAAACCACTTT
ADD REPLY
0
Entering edit mode

EST_030516_lnr.fa is not the genome fasta. Its the file with chromosome name and length.

ADD REPLY
0
Entering edit mode
7.9 years ago
shaun ▴ 80

The script is expecting the fasta file name as first argument and the position as second argument.

ADD COMMENT
0
Entering edit mode

Could you please explain it little bit. Actually I never worked with python befor this that's why not enough idea

ADD REPLY
0
Entering edit mode

python seq_ext_coordinate.py fasta_file.fa blast_out.txt

fasta_file.fa - fasta file you want to cut the sequences from blast_out.txt - output file of blast (see that format matches the script, I think -outfmt 6)

ADD REPLY
0
Entering edit mode

Yes, It is in outfmt 6, but the format is not .txt rather it is .out (is it really matters) Secondly, while I did the blast in the subject name it is only the gi and gb number while in the sequence file it is detail description for example

gi|3472462|gb|JK49213.1|JK494213 CSATR1JP-T3-004_O05_3DEC2008_017 CSATR1JP  cDNA, mRNA sequence

Is it matters while script extract sequence taking coordinates and subject id from blast output file (only gi and gb no) and search it in the EST file (gi,gb + detail information).

while I am running script I got the following error:

 Traceback (most recent call last):
  File "seq_ext_quardinate.py", line 5, in <module>
    EST_030516_lnr.fa = sys.argv[1]
NameError: name 'EST_030516_lnr' is not defined
ADD REPLY

Login before adding your answer.

Traffic: 2655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6