Question

Problem in subsetting a gtf based on ids in another file

0

Entering edit mode

3.8 years ago

newbie ▴ 120

I have a Final.gtf that looks like below:

chr17   StringTie   transcript  48525369    48528212    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48525369    48525461    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; exon_number "1";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48527443    48528212    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; exon_number "2";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA";
chr17   StringTie   transcript  48539225    48540614    1000    +   .   gene_id "MSTRG.100008"; transcript_id "MSTRG.100008.1";  class_code "u"; transcript_length "1390"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48539225    48540614    1000    +   .   gene_id "MSTRG.100008"; transcript_id "MSTRG.100008.1"; exon_number "1";  class_code "u"; transcript_length "1390"; lncRNA_type "LincRNA"; 
chr17   StringTie   transcript  49197365    49198218    1000    -   .   gene_id "MSTRG.100023"; transcript_id "MSTRG.100023.1";  class_code "u"; transcript_length "854"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    49197365    49198218    1000    -   .   gene_id "MSTRG.100023"; transcript_id "MSTRG.100023.1"; exon_number "1";  class_code "u"; transcript_length "854"; lncRNA_type "LincRNA";
chr17   StringTie   transcript  49191640    49196359    1000    +   .   gene_id "MSTRG.100024"; transcript_id "MSTRG.100024.1";  class_code "u"; transcript_length "4720"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    49191640    49196359    1000    +   .   gene_id "MSTRG.100024"; transcript_id "MSTRG.100024.1"; exon_number "1";  class_code "u"; transcript_length "4720"; lncRNA_type "LincRNA"; 
chr17   StringTie   transcript  49198593    49199192    1000    -   .   gene_id "MSTRG.100025"; transcript_id "MSTRG.100025.1";  class_code "u"; transcript_length "600"; lncRNA_type "LincRNA";

I also have a Int.csv file with ids like below: (The original csv file is with 55,000 ids)

"t_id"
"MSTRG.100023.1"
"MSTRG.100031.2"
"MSTRG.100038.1"
"MSTRG.100066.1"
"MSTRG.10013.2"
"MSTRG.10013.3"
"MSTRG.100143.1"
"MSTRG.100147.1"
"MSTRG.100150.1

I am working on cluster (SLURM). I used following command.

grep -f Int.csv Final.gtf > Transcripts_step.gtf

It is taking huge time so, I wrote a script and submitted a job using sbatch. For this I created run.sh script like below:

#!/bin/bash

# Set shell for job
#SBATCH --cpus-per-task=8
#SBATCH --mem=240G
#SBATCH --time=05:59:59

if [ -f ~/.bashrc ] ; then
    . ~/.bashrc
fi

echo "##CMD:>" $*
echo "##" `date`
eval $*
echo "##" `date`

Then I submitted a job like below:

sbatch run.sh "grep -f Int.csv Final.gtf > Transcripts_step.gtf"

The job had run more than 2 hours and the job got killed.

##CMD:> grep -f Int.csv Final.gtf > Transcripts_step.gtf

/var/lib/slurm/slurmd/job13500184/slurm_script: line 21: 21411 Killed                  
grep -f Int.csv Final.gtf > Transcripts_step.gtf

slurmstepd: error: Detected 1 oom-kill event(s) in step 13500184.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

What could be the problem and how to resolve this?

gtf slurm grep shell bash • 1.1k views

ADD COMMENT • link 3.8 years ago by newbie ▴ 120

0

Entering edit mode

Even with 240G of RAM the job was killed. Well ...

You may want to take a look at AGAT toolkit and see if any of the filter options would work here. You may also want to try a fixed string (-F) search with grep to speed things up.

ADD REPLY • link 3.8 years ago by GenoMax 142k

0

Entering edit mode

you mean grep -F Int.csv Final.gtf > Transcripts_step.gtf ?

ADD REPLY • link 3.8 years ago by newbie ▴ 120

0

Entering edit mode

In general, grep -f does not scale well. For tasks similar to this, here's what I had done previously:

Modify the GTF file to add two more columns. You can use awk for this. One column is just a serial number (if you are using awk you can use NR for this) and the second column in your case would be the MSTRG identifier copied from column 9 of GTF.
Sort the "table" on identifier column
Use join command to join the ID file and the modified GTF file
Sort the output from step 3 on the line number column added in step 1 (this is so that the lines in GTF remain in the same order they were in the starting file) and remove that column from the output to get a valid GTF file.

ADD REPLY • link 3.8 years ago by vkkodali_ncbi ★ 3.7k