Question: Problem in subsetting a gtf based on ids in another file
0
gravatar for newbie
24 days ago by
newbie70
newbie70 wrote:

I have a Final.gtf that looks like below:

chr17   StringTie   transcript  48525369    48528212    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48525369    48525461    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; exon_number "1";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48527443    48528212    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; exon_number "2";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA";
chr17   StringTie   transcript  48539225    48540614    1000    +   .   gene_id "MSTRG.100008"; transcript_id "MSTRG.100008.1";  class_code "u"; transcript_length "1390"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48539225    48540614    1000    +   .   gene_id "MSTRG.100008"; transcript_id "MSTRG.100008.1"; exon_number "1";  class_code "u"; transcript_length "1390"; lncRNA_type "LincRNA"; 
chr17   StringTie   transcript  49197365    49198218    1000    -   .   gene_id "MSTRG.100023"; transcript_id "MSTRG.100023.1";  class_code "u"; transcript_length "854"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    49197365    49198218    1000    -   .   gene_id "MSTRG.100023"; transcript_id "MSTRG.100023.1"; exon_number "1";  class_code "u"; transcript_length "854"; lncRNA_type "LincRNA";
chr17   StringTie   transcript  49191640    49196359    1000    +   .   gene_id "MSTRG.100024"; transcript_id "MSTRG.100024.1";  class_code "u"; transcript_length "4720"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    49191640    49196359    1000    +   .   gene_id "MSTRG.100024"; transcript_id "MSTRG.100024.1"; exon_number "1";  class_code "u"; transcript_length "4720"; lncRNA_type "LincRNA"; 
chr17   StringTie   transcript  49198593    49199192    1000    -   .   gene_id "MSTRG.100025"; transcript_id "MSTRG.100025.1";  class_code "u"; transcript_length "600"; lncRNA_type "LincRNA";

I also have a Int.csv file with ids like below: (The original csv file is with 55,000 ids)

"t_id"
"MSTRG.100023.1"
"MSTRG.100031.2"
"MSTRG.100038.1"
"MSTRG.100066.1"
"MSTRG.10013.2"
"MSTRG.10013.3"
"MSTRG.100143.1"
"MSTRG.100147.1"
"MSTRG.100150.1

I am working on cluster (SLURM). I used following command.

grep -f Int.csv Final.gtf > Transcripts_step.gtf

It is taking huge time so, I wrote a script and submitted a job using sbatch. For this I created run.sh script like below:

#!/bin/bash

# Set shell for job
#SBATCH --cpus-per-task=8
#SBATCH --mem=240G
#SBATCH --time=05:59:59

if [ -f ~/.bashrc ] ; then
    . ~/.bashrc
fi

echo "##CMD:>" $*
echo "##" `date`
eval $*
echo "##" `date`

Then I submitted a job like below:

sbatch run.sh "grep -f Int.csv Final.gtf > Transcripts_step.gtf"

The job had run more than 2 hours and the job got killed.

##CMD:> grep -f Int.csv Final.gtf > Transcripts_step.gtf

/var/lib/slurm/slurmd/job13500184/slurm_script: line 21: 21411 Killed                  
grep -f Int.csv Final.gtf > Transcripts_step.gtf

slurmstepd: error: Detected 1 oom-kill event(s) in step 13500184.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

What could be the problem and how to resolve this?

bash shell slurm grep gtf • 83 views
ADD COMMENTlink written 24 days ago by newbie70

Even with 240G of RAM the job was killed. Well ...

You may want to take a look at AGAT toolkit and see if any of the filter options would work here. You may also want to try a fixed string (-F) search with grep to speed things up.

ADD REPLYlink modified 24 days ago • written 24 days ago by genomax87k

you mean grep -F Int.csv Final.gtf > Transcripts_step.gtf ?

ADD REPLYlink written 24 days ago by newbie70

In general, grep -f does not scale well. For tasks similar to this, here's what I had done previously:

  1. Modify the GTF file to add two more columns. You can use awk for this. One column is just a serial number (if you are using awk you can use NR for this) and the second column in your case would be the MSTRG identifier copied from column 9 of GTF.
  2. Sort the "table" on identifier column
  3. Use join command to join the ID file and the modified GTF file
  4. Sort the output from step 3 on the line number column added in step 1 (this is so that the lines in GTF remain in the same order they were in the starting file) and remove that column from the output to get a valid GTF file.
ADD REPLYlink modified 24 days ago • written 24 days ago by vkkodali2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 689 users visited in the last hour