Hello All,
As I am using SNPiR (A pipeline for the detection of SNPs in RNA seq data) for identification of RNA-seq variants.
So during the step of filtering varients, I am using perl script to remove mismatches at 5' read ends with input of raw variants. My input raw_variants.txt file contains 46597831 total no of lines.
So the problem is this script is taking too long even after running for 5 days on 12 cores , it did't got finished.
It would be really helpful to know any alternative way to this step ?
Any help! Thanks a lot.
Post code and some lines from the variant file if you are seeking help to improve efficiency. Otherwise it might be helpful to split the raw_variants.txt file into multiple smaller files using the unix split command and running the perl script on the smaller files in parallel on a computing cluster.
@Vivek Thanks a lot for replying!
Here is the code :
!/usr/bin/perl
use lib qw(../);
use strict; use warnings; use diagnostics; use Getopt::Long; use SNPiR::config;
use File::Spec;
use File::Basename;
use Data::Dumper;
use File::Basename;
<h6>#</h6> <h6>#</h6>$Revision: $
Authors: Robert Piskol ( piskol@stanford.edu ), Gokul Ramaswami ( gokulr@stanford.edu )
Last modification $Author: piskol $
perl script that runs converts reads mapping to splice reference back to genomic coordinates
my ($INPUTFILE,$OUTPUTFILE,$BAMFILE,$ILLUMINAQUALS,$HELP); my $QUALOFFSET = 33; my $MINBASEQUAL = 25; parse_command_line();
sub parse_command_line { my $help;
}
my $minmismatch = 1;
open (my $INPUT , "<", $INPUTFILE) or die "error opening inputfile: $!\n"; open (my $OUTPUT, ">", $OUTPUTFILE) or die "error opening outputfile: $!\n"; open (my $OUTPUT_failed, ">", $OUTPUTFILE."_failed") or die "error opening outputfile: $!\n";
while (<$INPUT>) { chomp; my @fields = split; my $TEMPNAME = join '', $OUTPUTFILE,'_tmp'; my ($chr, $position) = ($fields[0], $fields[1]); my $bamposition = join '', $chr,':',$position,'-',$position; system("$SAMTOOLSEXE view $BAMFILE $bamposition > $TEMPNAME"); my $editnuc = $fields[4]; my ($newcov, $newmismatch) = (0,0); my $basequalFail; my $readPosFail;
} close $INPUT;
close $OUTPUT; close $OUTPUT_failed;
sub usage(){ print<<eof;< p="">
EOF
exit 1; }
Here are the few lines from Variants file :
chr1 14555 15, C . 0 chr1 14556 15, T . 0 chr1 14557 15, C . 0 chr1 14558 14, C . 0 chr1 14559 14, T . 0 chr1 14560 14, T . 0 chr1 14561 14, G . 0 chr1 14562 14, A . 0 chr1 14563 14, A . 0 chr1 14564 14, G . 0
Thanks!
Can you please reformat the code, rather than the code it would be important to see the proper formatting of the variant file.
chr1 14556 15, T . 0 chr1 14557 15, C . 0 chr1 14558 14, C . 0 chr1 14559 14, T . 0 chr1 14560 14, T . 0 chr1 14561 14, G . 0 chr1 14562 14, A . 0 chr1 14563 14, A . 0 chr1 14564 14, G . 0
Sorry about the format. Its auto correcting by itself.
It is a bit annoying way to understand both your code and the variant format file. I would still suggest you to format both your code and variant output file. It does not seem to me however in vcf format, else I could have suggested
vcflib
orvcftools
or Pierre's blog for the same.Actually regarding the variant file, in this pipeline for the first step is the unix script which converts vcf file into text file. So after that text file along with the bam file is the input for above perl script like
filter_mismatch_first6bp.pl -infile raw_variants.txt -outfile raw_variants.rmhex.txt -bamfile realignedBam.bam
Do you intend to use only this pipeline for your analysis or you can also take a look at the below answer using GATK for calling variants from RNA-Seq. The most important part for you is to understand what is happening at each step.
Actually I am more focusing towards identification of RNA editing sites. That is the reason I am using this pipeline.
Yes you can use the GATK pipeline for the same as well. You can take a look at this paper which uses RNA-Seq data for finding RNA-editing sites by calling variants with GATK, since it is a bit old you can always follow the latest pipeline to call the variants and work a way round as mentioned in the paper to find the RNA-editing sites. If you need help with a pipeline I can redirect to my github profile for the same where I have written an automated pipeline for variant calling with RNA-Seq data and then you can tweek it to your purpose and follow the mentioned publication to cater to your needs.
I have already finished the variant calling part. Fo it I have used from the above paper itself . Actually they are all the same author which they published the SNPiR. Right now I am stuck at the filtering part of variant Which I have mentioned in my question.