Error in identification of genomic variants from RNA-seq.
1
0
Entering edit mode
8.5 years ago
Ram ▴ 190

Hello All,

As I am using SNPiR (A pipeline for the detection of SNPs in RNA seq data) for identification of RNA-seq variants.

So during the step of filtering varients, I am using perl script to remove mismatches at 5' read ends with input of raw variants. My input raw_variants.txt file contains 46597831 total no of lines.

So the problem is this script is taking too long even after running for 5 days on 12 cores , it did't got finished.

It would be really helpful to know any alternative way to this step ?

Any help! Thanks a lot.

RNA-Seq SNP • 3.3k views
ADD COMMENT
0
Entering edit mode

Post code and some lines from the variant file if you are seeking help to improve efficiency. Otherwise it might be helpful to split the raw_variants.txt file into multiple smaller files using the unix split command and running the perl script on the smaller files in parallel on a computing cluster.

ADD REPLY
0
Entering edit mode

@Vivek Thanks a lot for replying!

Here is the code :

!/usr/bin/perl

use lib qw(../);

use strict; use warnings; use diagnostics; use Getopt::Long; use SNPiR::config;

use File::Spec;

use File::Basename;

use Data::Dumper;

use File::Basename;

<h6>#</h6> <h6>#</h6>

$Revision: $

Authors: Robert Piskol ( piskol@stanford.edu ), Gokul Ramaswami ( gokulr@stanford.edu )

Last modification $Author: piskol $

perl script that runs converts reads mapping to splice reference back to genomic coordinates

my ($INPUTFILE,$OUTPUTFILE,$BAMFILE,$ILLUMINAQUALS,$HELP); my $QUALOFFSET = 33; my $MINBASEQUAL = 25; parse_command_line();

sub parse_command_line { my $help;

usage() if (scalar @ARGV==0);

&GetOptions(
    "infile=s" => \$INPUTFILE,
    "outfile=s" => \$OUTPUTFILE,
    "bamfile=s" => \$BAMFILE,
    "minbasequal=s" => \$MINBASEQUAL,
    "illumina" => \$ILLUMINAQUALS,
    "help" => \$HELP
    );

usage() if ($help);

if(!$INPUTFILE || !$OUTPUTFILE || !$BAMFILE){
    usage();
}
if($ILLUMINAQUALS){
    $QUALOFFSET = 64;
}

}

my $minmismatch = 1;

open (my $INPUT , "<", $INPUTFILE) or die "error opening inputfile: $!\n"; open (my $OUTPUT, ">", $OUTPUTFILE) or die "error opening outputfile: $!\n"; open (my $OUTPUT_failed, ">", $OUTPUTFILE."_failed") or die "error opening outputfile: $!\n";

while (<$INPUT>) { chomp; my @fields = split; my $TEMPNAME = join '', $OUTPUTFILE,'_tmp'; my ($chr, $position) = ($fields[0], $fields[1]); my $bamposition = join '', $chr,':',$position,'-',$position; system("$SAMTOOLSEXE view $BAMFILE $bamposition > $TEMPNAME"); my $editnuc = $fields[4]; my ($newcov, $newmismatch) = (0,0); my $basequalFail; my $readPosFail;

open(my $TMPFILE, "<", $TEMPNAME);
while (<$TMPFILE>) {
    #print $_;
    chomp;
    my @bamfields = split;
    my ($alignment, $readstart, $cigar, $sequence, $qualities) = ($bamfields[1], $bamfields[3], $bamfields[5], $bamfields[9], $bamfields[10]);
    my @sequencebases = split(//,$sequence);
    my @qualscores = split(//,$qualities);
    my ($currentpos, $readpos) = ($readstart,1);
    my $base_readpos;
    my @cigarnums = split(/[MIDNSHP]/, $cigar);
    my @cigarletters = split(/[0-9]+/,$cigar);
    shift @cigarletters;

    for (my $i = 0; $i < @cigarnums; $i++) {
        if ($cigarletters[$i] =~ m/[ISH]/) {
            $readpos = $readpos + $cigarnums[$i];
        }
        elsif ($cigarletters[$i] =~ m/[DN]/) {
            $currentpos = $currentpos + $cigarnums[$i];
        }
        elsif ($cigarletters[$i] =~ m/M/) {
            for (my $j = 0; $j < $cigarnums[$i]; $j++) {
                $base_readpos = $readpos if ($currentpos == $position);
                $currentpos++;
                $readpos++; 
            }   
        }
    }
    next unless $base_readpos;
    my $revstrand = 0;
    $revstrand = 1 if ($alignment & 16);
    if (($revstrand == 0 && $base_readpos > 6) || ($revstrand == 1 && $base_readpos < $readpos - 5)) {
        if (ord($qualscores[$base_readpos-1]) >= $MINBASEQUAL+$QUALOFFSET) {
            $newcov++;
            $newmismatch++ if ($sequencebases[$base_readpos-1] eq $editnuc);    
        }
        else{
            #print "$bamposition failing quality ".ord($qualscores[$base_readpos-1])."\n";
            $basequalFail=1;
        }
    }
    else{
        $readPosFail=1;
        #print "$bamposition failing position\n";
    }
}
system("rm $TEMPNAME");
if ($newmismatch >= $minmismatch) {
    my $varfreq = sprintf("%.3f", $newmismatch/$newcov);
    print $OUTPUT "$fields[0]\t$fields[1]\t$newcov,$newmismatch\t$fields[3]\t$fields[4]\t$varfreq";
    for (my $i = 6; $i < @fields; $i++) {
        print $OUTPUT "\t$fields[$i]";
    }
    print $OUTPUT "\n";
}
if ($newmismatch < $minmismatch) {
    my $explain;
    $explain .= ' low basequal;' if($basequalFail);
    $explain .= ' mismatch at readend;' if($readPosFail);
    print $OUTPUT_failed "$fields[0]\t$fields[1]\t$newcov,$newmismatch\t$fields[3]\t$fields[4]\tNAN\treason:$explain";
    for (my $i = 6; $i < @fields; $i++) {
        print $OUTPUT_failed "\t$fields[$i]";
    }
    print $OUTPUT_failed "\n";
}

} close $INPUT;
close $OUTPUT; close $OUTPUT_failed;

sub usage(){ print<<eof;< p="">

EOF

exit 1; }

Here are the few lines from Variants file :

chr1 14555 15, C . 0 chr1 14556 15, T . 0 chr1 14557 15, C . 0 chr1 14558 14, C . 0 chr1 14559 14, T . 0 chr1 14560 14, T . 0 chr1 14561 14, G . 0 chr1 14562 14, A . 0 chr1 14563 14, A . 0 chr1 14564 14, G . 0

Thanks!

ADD REPLY
0
Entering edit mode

Can you please reformat the code, rather than the code it would be important to see the proper formatting of the variant file.

ADD REPLY
0
Entering edit mode
chr1    14555   15,     C       .       0

chr1 14556 15, T . 0 chr1 14557 15, C . 0 chr1 14558 14, C . 0 chr1 14559 14, T . 0 chr1 14560 14, T . 0 chr1 14561 14, G . 0 chr1 14562 14, A . 0 chr1 14563 14, A . 0 chr1 14564 14, G . 0

ADD REPLY
0
Entering edit mode

Sorry about the format. Its auto correcting by itself.

ADD REPLY
0
Entering edit mode

It is a bit annoying way to understand both your code and the variant format file. I would still suggest you to format both your code and variant output file. It does not seem to me however in vcf format, else I could have suggested vcflib or vcftools or Pierre's blog for the same.

ADD REPLY
0
Entering edit mode

Actually regarding the variant file, in this pipeline for the first step is the unix script which converts vcf file into text file. So after that text file along with the bam file is the input for above perl script like

filter_mismatch_first6bp.pl -infile raw_variants.txt -outfile raw_variants.rmhex.txt -bamfile realignedBam.bam

ADD REPLY
0
Entering edit mode

Do you intend to use only this pipeline for your analysis or you can also take a look at the below answer using GATK for calling variants from RNA-Seq. The most important part for you is to understand what is happening at each step.

ADD REPLY
0
Entering edit mode

Actually I am more focusing towards identification of RNA editing sites. That is the reason I am using this pipeline.

ADD REPLY
0
Entering edit mode

Yes you can use the GATK pipeline for the same as well. You can take a look at this paper which uses RNA-Seq data for finding RNA-editing sites by calling variants with GATK, since it is a bit old you can always follow the latest pipeline to call the variants and work a way round as mentioned in the paper to find the RNA-editing sites. If you need help with a pipeline I can redirect to my github profile for the same where I have written an automated pipeline for variant calling with RNA-Seq data and then you can tweek it to your purpose and follow the mentioned publication to cater to your needs.

ADD REPLY
0
Entering edit mode

I have already finished the variant calling part. Fo it I have used from the above paper itself . Actually they are all the same author which they published the SNPiR. Right now I am stuck at the filtering part of variant Which I have mentioned in my question.

ADD REPLY
0
Entering edit mode
8.5 years ago
ivivek_ngs ★ 5.2k

Have you taken a look at the GATK RNA-Seq best practise pipleline for that purpose? This is quite easy to use as well. Atleast for one sample with 30M reads for RNA-Seq in hg19 the entire variant calling pipeline should finish within 24 hours depening upon the allocated memory. It is something quite in use . I have never worked with SNPir so cannot comment on that. But you can take a look at it. I have worked with variant calling with GATK from RNA-Seq and it seems pretty straight forward.

ADD COMMENT

Login before adding your answer.

Traffic: 1260 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6