How to remove Mitochondrial genes from Human annotation file (.GTF)?
1
1
Entering edit mode
9.4 years ago
M K ▴ 660

Hi All,

I want to remove Mitochondrial genes from Human annotation file (.GTF)

next-gen-sequencing RNA-Seq R • 6.3k views
ADD COMMENT
0
Entering edit mode
grep -v '^ChrM' <your.gtf>
ADD REPLY
0
Entering edit mode
'^ChrM'
ADD REPLY
0
Entering edit mode

edited thanks, just trying to indicate that grep -v would work in this case

ADD REPLY
0
Entering edit mode
9.4 years ago

If the GTF is from Ensembl:

grep -v "^MT" genes.gtf > genes_noMT.gtf
ADD COMMENT
1
Entering edit mode

I don't recommend doing this as it would remove lines containinghavana_transcript ID (e.g, OTTHUMT00000002421.3). Also what about genes containing MT in gene_name?

As you posted before grep -v '^chrM' should work.

ADD REPLY
1
Entering edit mode

Ensemble GTF do not have 'Chr' prefix. They are simple named as 1,2,3...MT,X,Y. The symbol '^' will not find the havana_transcript as it looks for MT only at the beginning of the line.

ADD REPLY
0
Entering edit mode

I tried both grep commands on ensemble annotation file release 37.75 (Homo_sapiens.GRCh37.75), and when I used wc -l to count the #of lines in gtf file, I noticed that: in the original gtf file there are 2828317 Homo_sapiens.GRCh37.75.gtf

but when I used grep -v "^MT", I found there is a decreasing of the lines# as shown below;

grep -v "^MT" Homo_sapiens.GRCh37.75.gtf > Homo_sapiens.GRCh37.75_noMT.gtf
wc -l Homo_sapiens.GRCh37.75_noMT.gtf
2828173  Homo_sapiens.GRCh37.75_noMT.gtf

While using grep -v "^ChrM", I found that line# in the original file same as when I used grep -v "^ChrM"

grep -v "^ChrM" Homo_sapiens.GRCh37.75.gtf > Homo_sapiens.GRCh37.75_noMT_M.gtf

Could any one explain that.

2828317 Homo_sapiens.GRCh37.75_noMT_M.gtf
ADD REPLY
0
Entering edit mode

If you have Ensembl GTF, use grep -v "^MT". We are matching a pattern using grep and removing the lines which has those pattern. Here it depends on how the mitochondrial genes are represented. Ensemble represents them as MT, and other sources represent as ChrM. As you are using ensemble, the pattern '^ChrM' is not resulting in any matches, hence the number of lines remains same.

Read some tut to understand grep. Here is the one http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics#grep

ADD REPLY
0
Entering edit mode

Thanks a lot Geek and Manu.

ADD REPLY

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6