Extract some features from a combined gene bank file in bash
1
0
Entering edit mode
5.1 years ago
A ★ 4.0k

Hi,

I have combined gene bank file of polymerase coding genes from all virus family, how I can extract gene IDs and the corresponding end...start position?

http://q07i.imgup.net/Screenshot9673.png

I need to extract start...end position and corresponding gene IDs from this file

thank you

gene sequence genome gene bank NCBI • 2.3k views
0
Entering edit mode

Perhaps one of the many other all.*.tar.gz files @ ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ would be easier to parse and/or could even give you exactly what you want without any parsing..

0
Entering edit mode

thank you,

http://q07i.imgup.net/Screenshot9673.png

I need to extract start...end position and corresponding gene IDs from this file

1
Entering edit mode

Did you try something out - I mean this is basic string extraction and stackoverflow is overloaded with these questions already. Might help to know what has been tried and model of output is expected

2
Entering edit mode
5.1 years ago
Gungor Budak ▴ 250

You can use BioPython for that. But don't merge GB files, keep them individual and read them one by one using BioPython and extract the information. Check out below example code for NC_005816:

from Bio import SeqIO

for feature in record.features:
if feature.type == 'gene':
print 'Location:', feature.location
print 'Locus tag:', feature.qualifiers['locus_tag']


Output:

Location: [86:1109](+)
Locus tag: ['YP_RS22210']
Location: [1108:1888](+)
Locus tag: ['YP_RS22215']
Location: [2924:3119](+)
Locus tag: ['YP_RS22220']
Location: [4354:4780](+)
Locus tag: ['YP_RS22225']
Location: [4814:5888](-)
Locus tag: ['YP_RS22230']
Location: [6115:6421](+)
Locus tag: ['YP_RS22235']
Location: [6663:7602](+)
Locus tag: ['YP_RS22240']
Location: [7788:8088](-)
Locus tag: ['YP_RS22245']
Location: [8087:8429](-)
Locus tag: ['YP_RS22250']
Location: [9131:9347](-)
Locus tag: ['YP_RS22255']

1
Entering edit mode

I'd recommend OP split her file by a delimiter so she can apply your logic over multiple files. (Unless there is a multi-genbank format?)

EDIT: Looks like tools exist to split multi-gb files. Example: http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/seqretsplit.html

1
Entering edit mode

Thanks, this is helpful. I'm not sure but SeqIO might also read multi-genbank format. But she might also download them again (very small files although I don't know how many she has).

0
Entering edit mode

thank you

5
Entering edit mode

The community is willing to offer guidance, but our time is limited, too. Many of your questions are variations on "I don't know how to do this - help!" - which is okay for the first few questions, but becomes increasingly tiresome with repetition. At some point, you need to invest the time and develop the skills to actually perform some of the analysis on your own.

2
Entering edit mode

He's basically given you the code! Based on the operating system you use, Google "biopython installation os_name" (where os_name is your OS name), and you'll get instructions. It takes 5 minutes to get biopython installed.

0
Entering edit mode

Check out this answer: A: Extract Features From Gbk Via Bash

It uses sed. But I suggest you learn BioPython, it's much more expressive and easy.

0
Entering edit mode

you all right about learning and so on

actually today and tomorrow is my last opportunity to convince this professor to accept me as her student i cant illustrate my situation by this post

3
Entering edit mode

So she wants you to show her your bioinformatic skills and for that purpose you turn to us to solve it. Doesn't sound right to me.

0
Entering edit mode

if i give up i would stuck in master...

2
Entering edit mode

It is great to see that you have been making tireless efforts to learn new things/solve problems for past few months but hopefully your future mentor knows that you are "learning" and are not an expert. If that is not the case it will get you in trouble down the road. It is always good to be upfront about things you know and things you don't.

I hope your assignments so far have been test of "are you able to solve certain problems on your own" as opposed to depth of your knowledge about bioinformatics/informatics.

I wish you all the best!

0
Entering edit mode

thank you

for final decision he invited me to Germany... i reached university yesterday evening and she asked new question, my visa will expired November 6

everything looks like a sad story

time going fast and i should accept this reality that cant be a bioinfomatician in an evening

2
Entering edit mode

i should accept this reality that cant be a bioinfomatician in an evening

That is certainly correct. No way to change that.

This is an odd way of challenging a prospective candidate before accepting them as a student. If she knows you can't program (I assume so, otherwise you would not be asking this question in first place) why did she ask you to do this? Was the idea to propose just pseudo-code (and not actual working code)?

Hang in there since the deadline has not passed as yet. I hope this is a (unfortunately rather cruel) way of making you realize what you need to learn, should you decide to start the PhD program.

0
Entering edit mode

thank you,

she knows I cant program

she asked me to find the exact position of RNA dependent RNA polymerase genes in genomes of single stranded RNA viruses, I have an ultimatum might be until Friday evening to complete my task

3
Entering edit mode

Why did you not use the simple solution that @Harold.smith.tarheel had posted in the other thread. It would require manual work to go through the files but at least you would know what exactly you did. While the code posted in this thread will work you would not be able to answer any questions about it.

0
Entering edit mode

actually

tonight I will stay in an office and solution that @Harold.smith.tarheel had posted as this screen shot I extracted these genes

http://q07i.imgup.net/Screenshot9673.png

and by

cat <(awk -F'\t' '/$3=="gene"/ {print$1"\t"$2}' infile) <(grep -w db_xref infile) | sed -e 's/.*db_xref\tGeneID://' | paste -d " " - - i was going to extract the start ... end position but at noon she told these from all virus families and I should perform only for single stranded RNA viruses ADD REPLY 0 Entering edit mode Do a separate search for accession numbers of ssRNA viruses and then just select those entries from your larger list (or you could re-do @harold's search and restrict it to ssRNA using advanced search options at NCBI?). You will need to grab accession number but if you have the files it should be a quick re-run of the command you already used. ADD REPLY 0 Entering edit mode You should have put in some minimal effort before asking, that is not giving up. This is not the right way to proceed from your masters ADD REPLY 0 Entering edit mode this is her new question from yesterday, I am new to this challenge... already she asked for miRNAseq, exomseq, and many others, how I can be professional in all ADD REPLY 2 Entering edit mode I'm pretty sure people ask for proof of experience - existing code perhaps. Is that the case? Either way, if you have to check with us for basic text processing, your mentor should definitely know this difficulty you have. You've been here for a while now and have progressed a lot - I don't see how you have difficulties with this simple task. Sorry if this is rude, but you have to let go of self pity and face reality - your mentor and you have a HUGE communication gap. And then, seek actual useful help, not band-aid solutions that don't address the core problems. ADD REPLY 3 Entering edit mode Agreed, the mentor is probably trying to evaluate the OP's skill level with this task. I understand that the people on the forum are trying to be helpful but, by writing and troubleshooting the code, you're actually undermining the objective of the exercise. Is it really in the OP's best interest to pretend to possess a set of skills that are lacking? And it's not a question of becoming "a bioinformatician in an evening". The OP has been a member of this community for nearly two years, has more than 1000 posts, and still has trouble performing the most rudimentary tasks. This type of question (where do I find XX database, how do I download YY data, and how do I parse it for ZZ information?) has been asked and answered multiple times (e.g., see this post from the OP 18 months ago). Yet there seems to be little ability to generalize the question and apply any prior knowledge, or to interpret/modify/troubleshoot the code that's offered. I realize that this assessment sounds harsh, but it's not intended as a personal attack. Bioinformatics IS hard! I sympathize that the OP may not have received adequate training in the discipline, and that the mentor's request may seem unreasonable. But those problems will not be solve by the members of this forum, however well-meaning their intentions. ADD REPLY 0 Entering edit mode I definitely agree and realize that it might not be in her best interest. However, it's not our or my responsibility to judge that. If someone needs help, asks for help and people here are willing to help then that's all what's required. I'm not in the position (and don't desire) to train or provide pedagogical support. People have to ask themselves whether their problem is something they are going to fix by themselves, struggle for hours days or a week, or whether they need external help. Every problem you fix on your own makes you a better (bioinformatic) scientist. The next problem will be easier. Given that OP has really a short time to get accepted (for a PhD if I interpret correctly) and that her situation is therefore far from optimal I think she can use a lot of help. A few hours ago she had never ran a piece of python code. But lying to your future supervisor is not a great start, even if the requests are unreasonable. OP will have to realize that she'll have to work extremely hard to make this work. I agree with the rest of your observations. ADD REPLY 1 Entering edit mode @WouterDeCoster, I respect and admire your non-judgmental approach to the problem (and no, I'm not being sarcastic). Yet even you acknowledge the dubious nature of "lying to your future supervisor", and assistance in this case abets that effort. Also, it's difficult to envision a scenario where deception by the OP ends well. But you're right - we're all adults here, and entitled to make our own choices. ADD REPLY 0 Entering edit mode This is not a direct response for @Ram's post but I will put it here so it does not get buried in this long thread. One thing we are forgetting is OP is from a country where most basic things many of us access easily are blocked (e.g. the other day she could not access docker website). I am not sure what exact PhD program she is trying to enroll in but programming is something a student can learn in the first semester or two (now-a-days students come from so diverse backgrounds for interdisciplinary programs that it is almost impossible to have a common yardstick for qualification). I have continually reminded her to be open about her expertise/difficulties she is facing in solving these problems with her prospective mentor. The questions in the last month or so are perhaps out of desperation as she apparently has been asked to do "one more thing" out of the left field. If she starts this placement on a wrong foot then this will not bode well for future. Hopefully she understands that. ADD REPLY 0 Entering edit mode That's a lot to ask from one person, usually there is one person assigned to each of those topics. But what do I know, I too am a student :| ADD REPLY 1 Entering edit mode An ugly one-liner cat <(awk -F'\t' '/$3=="gene"/ {print $1"\t"$2}' infile) <(grep -w db_xref infile) | sed -e 's/.*db_xref\tGeneID://' | paste -d " " - -


Probably the genbank file has no tab-separation, usually a test file is better to debug (one-liners get uglier) -

cat <(grep -w 'gene' infile | grep '^[1-9]*' | sed 's/\s\s+gene//') <(fgrep 'db_xref GeneID:' infile | ) | sed -e 's/.*db_xref GeneID://' | paste -d " " - -

0
Entering edit mode

sorry Rohit this is output of your code

the first column is gene ids but there is no start-end position in the output

https://postimg.org/image/ff5i4cz9d/

I should extract gene id and the correspond position

http://q07i.imgup.net/Screenshot9673.png

0
Entering edit mode

bash: command substitution: line 10: syntax error near unexpected token )'

bash: command substitution: line 10: fgrep 'db_xref GeneID:' feature.txt | )'

biprak4@noroc:~/Desktop$ADD REPLY 0 Entering edit mode Remove the extra "|" character ADD REPLY 0 Entering edit mode biprak4@noroc:~/fereshteh$ cat <(grep -w 'gene' feature.txt | grep '^[1-9]*' | sed 's/\s\s+gene//')

<(fgrep 'db_xref GeneID:' feature.txt | ) | sed -e 's/.*db_xref GeneID://' | paste -d - -

bash: command substitution: line 21: syntax error near unexpected token )'

bash: command substitution: line 21: fgrep 'db_xref GeneID:' feature.txt | )'

biprak4@noroc:~/fereshteh$ADD REPLY 0 Entering edit mode The extra "|" was at the infile cat <(grep -w 'gene' feature.txt | grep '^[1-9]*' | sed 's/\s\s+gene//') <(fgrep 'db_xref GeneID:' feature.txt) | sed -e 's/.*db_xref GeneID://' | paste -d - -  ADD REPLY 0 Entering edit mode I installed biopython biprak4@noroc:~/Desktop$ from Bio import SeqIO

0
Entering edit mode

It's python code, not meant to be executed in your shell

0
Entering edit mode

biprak4@noroc:~$python Python 2.7.9 (default, Jun 29 2016, 13:08:31) [GCC 4.9.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. from Bio.Seq import Seq where I should place my data to be located by biopython ADD REPLY 1 Entering edit mode The piece of code written by Gungor Budak expects the genbank file(s) in the current directory. But I would recommend to save the following code as extractPositions.py (slightly modified) import sys from Bio import SeqIO record = SeqIO.read(sys.argv[1], 'genbank') for feature in record.features: if feature.type == 'gene': print 'Location:', feature.location print 'Locus tag:', feature.qualifiers['locus_tag']  And execute that as python extractPositions.py yourfile.gb (just in your shell, not in python). Since it will print to stdout you can also redirect the output to a file. ADD REPLY 0 Entering edit mode thank you I could figure out how should run this code for one virus (NC_025403.1) biprak4@noroc:~/fereshteh$ python extractPositions.py Achimota.gb

Location: 55:1846

Locus tag: ['OI48_gp1']

Location: 1855:3310

Locus tag: ['OI48_gp2']

Location: 3318:4756

Locus tag: ['OI48_gp3']

Location: 4825:6649

Locus tag: ['OI48_gp4']

Location: 6691:8645

Locus tag: ['OI48_gp5']

Location: 8735:15600

Locus tag: ['OI48_gp6']

biprak4@noroc:~/fereshteh$but I have 1534 NCBI accessions for viruses, how I can perform so for all of them because manually it takes days to download gene bank files one by one and then run this code for each of them http://u75i.imgup.net/Screenshot4fc6.png ADD REPLY 1 Entering edit mode This script can easily be adapted to run on those 1534 files sequentially and summarize the output. ADD REPLY 0 Entering edit mode sorry do you mean that I should download 1534 gene bank files one by one and run the code 1534 times? ADD REPLY 0 Entering edit mode I'm quite sure there should be a better way to download all the files in one go, I don't know how the ftp server is organized, perhaps wildcards can help. I didn't follow very well which files exactly you need :-/ Then we can modify the pythoncode to run on each .gb file in the directory and print the desired output, perhaps with a title or whatever information you would want. When is the deadline? ADD REPLY 0 Entering edit mode thank you, about Friday evening I am leaving university ADD REPLY 0 Entering edit mode this python code gives me whole of the Locus tags in a gene bank file, how I can extract the locus tags if only they encode for RNA dependent RNA polymerases and not all of them?? I changed the code like so import sys from Bio import SeqIO record = SeqIO.read(sys.argv[1], 'genbank') for feature in record.features: if feature.type == 'gene':  if 'RNA dependent RNA polymerase' not in product['gene'][0].lower():  continue print 'Location:', feature.location print 'Locus tag:', feature.qualifiers['locus_tag']  but error telling biprak4@noroc:~/fereshteh$ python extractPositions.py Achimota.gb

File "extractPositions.py", line 7

if 'RNA dependent RNA polymerase' not in gene.qualifiers['gene'][0].lower():

^


IndentationError: expected an indented block

biprak4@noroc:~/fereshteh$python extractPositions.py Achimota.gb File "extractPositions.py", line 7 if 'RNA dependent RNA polymerase' not in product['gene'][0].lower(): ^  IndentationError: expected an indented block biprak4@noroc:~/fereshteh$

0
Entering edit mode

Indentation is important in python, it makes code blocks. The code below should work or give another error ;)

import sys
from Bio import SeqIO
for feature in record.features:
if feature.type == 'gene':
if 'RNA dependent RNA polymerase' not in product['gene'][0].lower():
continue
print 'Location:', feature.location
print 'Locus tag:', feature.qualifiers['locus_tag']


I have my doubts about the use of product here, where does it come from? I don't know where you found that line of code. I like the solution of genomax2, in this comment C: Extract some features from a combined gene bank file in bash

As you can see, we will probably be able to fix your issue before the deadline. But you need to make sure that you understand what you are doing, and nog just running code you found online.

0
Entering edit mode
0
Entering edit mode

You should be able to download all genomes in a single file.

0
Entering edit mode

you know, I can download a gene bank file contains all of viruses genome but I only needs ssRNA viruses then I should first find a way to download gene bank file of ssRNA viruses then modify this python code because this code needs separate gene bank file and also code should only extract the genes if they encode for RNA dependent RNA polymerases

1
Entering edit mode

Do this:

1. Start at this link (these are all ssRNA viruses at NCBI, 1530 of them): https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=439488
2. Top right corner. "Retrieve sequences --> RefSeq Nucleotide".
3. On next page click on "Summary" change to "Genbank or Genbank(full)".
4. Click on "Send to" at top right. Select "file". Then "Create file".
5. That should get you a combined GenBank file with 2088 records as of now.
6. Run your python script (modify it to get the description/title).
7. Grep out the records for "RNA polymerase"
0
Entering edit mode

thank you and excuse me

I performed until step 5

https://postimg.org/image/6wkugz49t/

biprak4@noroc:~/fereshteh$python extractPositions.py ssRNA.gb Traceback (most recent call last): File "extractPositions.py", line 4, in <module> record = SeqIO.read(sys.argv[1], 'genbank')  File "/home/koriege/python2/Bio/SeqIO/__init__.py", line 676, in read raise ValueError("More than one record found in handle")  ValueError: More than one record found in handle biprak4@noroc:~/fereshteh$

0
Entering edit mode

That code is designed to work on one GenBank record. You have 2088.

Can you see if you are able to grep "RAN polymerase"? Then you should be able to find the lines around that that you need (look at the -A -B options for grep).

0
Entering edit mode

thank you friends from heart,

I should back to hostel in my lap top I don't have python to go on in hostel

thanks again

0
Entering edit mode

That's because the python code is written for genbank files containing just 1 record.

You'll likely be able to fix this (not sure) by swapping SeqIO.read with SeqIO.parse which returns an iterator allowing you to loop over all records.

I'm going to karate now, I can look further into this later today and/or tomorrow. Have a look at the biopython cookbook, that might also help ;)

0
Entering edit mode

I think SeqIO.parse worked for all but another error about feature

biprak4@noroc:~/fereshteh$python extractPositions.py ssRNA.gb Traceback (most recent call last): File "extractPositions.py", line 5, in <module> for feature in record.features:  AttributeError: 'generator' object has no attribute 'features' biprak4@noroc:~/fereshteh$

0
Entering edit mode

You have to iterate over the generator object

Try something like the following:

import sys
from Bio import SeqIO
records = SeqIO.parse(sys.argv[1], 'genbank')
for record in records:
for feature in record.features:


And follow this by the rest of your code. So this code loops over each record in all records from the file (I usually use the singular as name when I loop over the plural, that's intuitive). Then for each record it loops over the features.

0
Entering edit mode

strong textthis is the final code I ran

import sys

from Bio import SeqIO

records = SeqIO.parse(sys.argv[1], 'genbank')

for record in records:

for feature in record.features:

print 'Location:', feature.location

print 'Locus tag:', feature.qualifiers['locus_tag']


biprak4@noroc:~/fereshteh$python extractPositions.py ssRNA.gb Location: 0:6719 Locus tag: Traceback (most recent call last): File "extractPositions.py", line 7, in <module> print 'Locus tag:', feature.qualifiers['locus_tag']  KeyError: 'locus_tag' biprak4@noroc:~/fereshteh$

1
Entering edit mode

We will fix this tomorrow morning. Almost there!

1
Entering edit mode

Could you check whether this is what you need?

import sys
from Bio import SeqIO

print("protein_id\tlocation")
for record in SeqIO.parse(sys.argv[1], 'genbank'):
for feature in record.features:
if feature.type == 'CDS' and "RNA-dependent RNA polymerase" in feature.qualifiers['product']:
print(str(feature.qualifiers['protein_id'][0])  + '\t' +  str(feature.location))

0
Entering edit mode

she told I have 1530 NCBI gene bank file from ssRNA viruses while these viruses depend on RNA dependent RNA polymerase but I have only found about 100, therefore might be some one then have another name and should run a more flexible search on this gene bank file...

anyway, thank you that you and @ genomax2 did not leave me alone with such a stressful situation

I am sure she will reject me by 99 percent but I tried what I could during these months she told this a mathematics and computer department and people should script... she told you know some basic in data analysis not bioinformatics and suggested to apply for labs where need someone to analyze their data sets and before that I should learn a programming language

now this is a cloudy morning in Jena and I am thinking how to start from scratch

I never told her lie about my skills I sent my CV for her

however

thank you

0
Entering edit mode

If you ask me she is playing a cruel and inappropriate student selection game, but I wish you all the best in the following days.

0
Entering edit mode

OK my friends, she rejected me

she told she is not never wrong about her decision really I don't know what to say

how it is good that we have tear for such an unexplainable situation...

0
Entering edit mode

It wasn't going to work out anyway - events like these make sense only in retrospect. Time to use your backup options.