Question

How to extract specific lines from a GTF file?

0

Entering edit mode

2.2 years ago

Shraddha ▴ 90

Hello all,

I'm trying to extract the lines containing a specific list of genes, from a gtf file. My pattern file looks like this:

LOC115705147 
LOC115716814 
LOC115713105

My reference file looks like this (gene IDs in the 9th column next to gene_id):

NC_044371.1     Gnomon  gene    3273    3552    .       -       .       gene_id "LOC115705401"; db_xref "GeneID:115705401"; gbkey "Gene"; gene "LOC115705401"; gene_biotype "lncRNA";
NC_044371.1     Gnomon  exon    3539    3552    .       -       .       gene_id "LOC115705401"; transcript_id "XR_004009451.1"; db_xref "GeneID:115705401"; gbkey "ncRNA"; gene "LOC115705401"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "uncharacterized LOC115705401"; exon_number "1";
NC_044371.1     Gnomon  exon    3384    3456    .       -       .       gene_id "LOC115705401"; transcript_id "XR_004009451.1"; db_xref "GeneID:115705401"; gbkey "ncRNA"; gene "LOC115705401"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "uncharacterized LOC115705401"; exon_number "2";

If I use grep on a single gene ID, then it gives me a set of results I need. If I try to use it on my list of genes, though, I get an empty file. Here are the things I've tried:

grep -f genes.txt anno.gtf > results.txt
cat genes.txt | xargs -I {} grep -f {} anno.gtf > results.txt
while read -r line; do grep -f $line anno.gtf > results.txt done < genes.txt

I'm struggling to see what the issue could be, since the parts work individually (e.g., when I use the while read loop and put in an echo $line command, it works fine). Does someone know what the issue is/another way to extract these?

Thanks in advance!

command-line grep loop • 2.6k views

ADD COMMENT • link updated 2.2 years ago by Juke34 8.5k • written 2.2 years ago by Shraddha ▴ 90

1

Entering edit mode

Except the problem related to the CR/LF you needed to fix you might prefer to use a tool like agat_sp_filter_feature_from_keep_list.pl from AGAT.
It might avoid problems like in a case you want to catch a gene "LOC115705401" and "LOC115705401.2" would exist too (with grep both would be taken).

ADD REPLY • link 2.2 years ago by Juke34 8.5k

0

Entering edit mode

. If I try to use it on my list of genes, though, I get an empty file.

your genes are not in the gtf.

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

They're in the 9th column under gene_id. If I just grep any one of those genes with the reference annotation file, I get all the lines that contain that gene in the 9th column.

ADD REPLY • link 2.2 years ago by Shraddha ▴ 90

0

Entering edit mode

then there is probably something like a space or any invisible character in "genes.txt", or "genes.txt" is a windows file with CRLF end of lines etc...

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

show me the output of

grep -F LOC115705147 genes.txt anno.gtf | cat -A

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Responding to both comments here:

That seems likely. I got these lists by saving sheets of an excel file as a csv, which I then converted to a txt (not the optimal process, of course, but that's just sort of how it happened). Do you know how I can fix that?
Did you mean grep -F LOC115605147 anno.gtf | cat -A? If so, the results are:

NC_044371.1^IGnomon^Igene^I22449283^I22450610^I.^I+^I.^Igene_id "LOC115705147"; db_xref "GeneID:115705147"; gbkey "Gene"; gene "LOC115705147"; gene_biotype "protein_coding"; $ NC_044371.1^IGnomon^Iexon^I22449283^I22449484^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "1"; $ NC_044371.1^IGnomon^Iexon^I22449570^I22449792^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "2"; $ NC_044371.1^IGnomon^Iexon^I22449904^I22450049^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "3"; $ NC_044371.1^IGnomon^Iexon^I22450118^I22450610^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "4"; $ NC_044371.1^IGnomon^ICDS^I22449314^I22449484^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "1"; $ NC_044371.1^IGnomon^ICDS^I22449570^I22449792^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "2"; $ NC_044371.1^IGnomon^ICDS^I22449904^I22450049^I.^I+^I2^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "3"; $ NC_044371.1^IGnomon^ICDS^I22450118^I22450252^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "4"; $ NC_044371.1^IGnomon^Istart_codon^I22449314^I22449316^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "1"; $ NC_044371.1^IGnomon^Istop_codon^I22450253^I22450255^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "4"; $

I assume the ^| there are stand ins for the tab that separates the columns?

ADD REPLY • link 2.2 years ago by Shraddha ▴ 90

0

Entering edit mode

Did you mean grep -F LOC115605147 anno.gtf

no I really meant

grep -F LOC115705147 genes.txt anno.gtf | cat -A

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Oh sorry, in that case, it didn't show anything.

ADD REPLY • link 2.2 years ago by Shraddha ▴ 90

score 0 · Answer 1 · 2022-01-19

0

Entering edit mode

2.2 years ago

Pierre Lindenbaum 161k

I got these lists by saving sheets of an excel file as a csv

you're probably having a CR/LF problem. https://en.wikipedia.org/wiki/Newline#Issues_with_different_newline_formats

tr -d '\r' < genes.txt > new.genes.txt