Question: Filtering out incorrect index combinations in fastq file
0
gravatar for craigdj91
6 days ago by
craigdj910
craigdj910 wrote:

I have a combined paired-end FASTQ file where each sequence begins with a 14 bp unique index, allowing me to group reads together that come from the same original template molecule--very similar concept to sample barcoding where reads are grouped together based on what sample they belong to; however, these are randomly-generated indexes, so I don't know the sequences, just that they are the first 14 of every read. The issue is that sometimes the R1 index hops to a different R2 index (again, very similar problem occurs in sample barcoding).

@Sequence1:Read1
CTGGCGCTCAGTATGATCAGGCGTCTGTCGTGCTCGCCTT
+
CCCCCCCCCCFFGGGGGGGGGGGGGGGHHGGGGGHGGGGG
@Sequence1:Read2
TCGTCGTATACAGAGATCAGGCGTCTGTCGTCGAGCCGCG
+
CCCCCCCFFFFFGGGGGGGGGGGGGGGHHGGGGEGGGGGG

I want to filter out aberrant combinations by looking at each R1 index individually and finding what combinations of R2 indexes are present. Then, I would take the combination with the most reads as "truth"

A--B has 100 reads
A--C has 6 reads
A--D has 11 reads

In the above example, I would take A--B as the "true" combination, and I can eliminate (or export to "trash" file) all of the false combinations. Then I want to look at the R2 indexes and see if those correspond to what I found in the R1 indexes.

B--A has 100 reads
B--X has 2 reads
B--Y has 7 reads

Therefore, I've confirmed that A--B is the true combination. The tricky part about this is ensuring that quality scores and sequence headers are carried through, and that the fastq paired-end format (R1 followed by R2 of same header) is followed.

The current code (below) is taking days to run (on smaller 200 MB files). I'm a biologist with limited bioinformatics (perl) experience, so if anyone has some insights, it would be much appreciated!

    my $inputfile = $ARGV[0];

##  Input and Output File handles
#####################################################################################
open (IN1, $inputfile                          ||die "Cannot open $inputfile $!\n");
open (OUT1, ">Filtered_$inputfile.fastq"       ||die "Cannot open selected out file      $!\n");
open (OUT2, ">Discarded_$inputfile.fastq"      ||die "Cannot open selected out file      $!\n");
while ($in1=<IN1>) {push (@InArray,$in1);}   #Loads R1 file into @R1 array
$MinimumFamilySize = 0;                      #Minimum family for forward R1UMI. Set to 0 for it to do nothing


##  Loads the Header, Quality score, and Reads into an array
#####################################################################################
$i=0;                                   #Sets counter to 0
while ($i<scalar(@InArray)){            #Goes through, line by line, the Read1 file

    $f = substr($InArray[$i],0,-1);     #Grabs read header and deletes the new line character
    push (@R1Headers,$f);               #Pushes the read header into an array

    push (@R1Reads  ,$InArray[$i+1]);   #Pushes the actual read into an array
    push (@R1UMI, substr($InArray[$i+1],0,14));

    chomp($InArray[$i+3]);              #Removes the new line character from the quality string
    push (@R1Quality,$InArray[$i+3]);   #Pushes the quality string into an array

    $f = substr($InArray[$i+4],0,-1);   #Grabs read header and deletes the new line character
    push (@R2Headers,$f);               #Pushes the read header into an array

    push (@R2Reads  ,$InArray[$i+5]);   #Pushes the actual read into an array
    push (@R2UMI, substr($InArray[$i+5],0,14));


    chomp($InArray[$i+7]);              #Removes the new line character from the quality string
    push (@R2Quality,$InArray[$i+7]);   #Pushes the quality string into an array
    $i=$i+8;                            #Increments counter by 4 to get to next read sequences
}

##Removes duplicate UMIs and UMIs that dont have large enough families
#####################################################################################
$counts{$_}++ for @R1UMI;                               #Creates a hash of R1UMI counts
while (($string1,$value1) = each(%counts)){

$value1 = $counts{$string1};
if($value1 >= $MinimumFamilySize){push (@R1UMIcomp,$string1);}
}
undef %counts;

## Pairs the forward UMI with its most common reverse UMI
#####################################################################################
$i=$j=0;
while ($i<scalar(@R1UMIcomp)){
    $j=0;

    while ($j<scalar(@R1UMI)){   
        if ($R1UMIcomp[$i] == $R1UMI[$j])
            {push(@TempArray,$R2UMI[$j]);} $j++;}

    $counts{$_}++ for @TempArray;                               
    while (($string1,$value1) = each(%counts)){
    $value1 = $counts{$string1};
    }

    @Occurences = values %counts;
    $max = max(@Occurences);
    %TempHash = reverse %counts;
    $winner = $TempHash{$max};

    %MasterList =();
    $MasterList{$R1UMIcomp[$i]} = $winner;

    $i++;
    undef %counts;
    undef %TempHash;
}

$i=$j=0;
while ($i<scalar(@R1UMI)){
    if ($R2UMI[$i] =~ $MasterList{$R1UMI[$i]}){

        $NewRead1 = $R2UMI[$i].''.$R1Reads[$i];
        $NewRead2 = substr($R2Reads[$i],14);

        $R2UMIQ = substr($R2Quality[$i],0,14);
        $NewR1Q = $R2UMIQ.''.$R1Quality[$i];
        $NewR2Q = substr($R2Quality[$i],14);

        print OUT1 "$R1Headers[$i]\n";
        print OUT1 "$NewRead1";
        print OUT1 "+\n";
        print OUT1 "$NewR1Q\n";
        print OUT1 "$R2Headers[$i]\n";
        print OUT1 "$NewRead2";
        print OUT1 "+\n";
        print OUT1 "$NewR2Q\n";
    }
    else{

        $NewRead1 = $R2UMI[$i].''.$R1Reads[$i];
        $NewRead2 = substr($R2Reads[$i],14);

        $R2UMIQ = substr($R2Quality[$i],0,14);
        $NewR1Q = $R2UMIQ.''.$R1Quality[$i];
        $NewR2Q = substr($R2Quality[$i],14);

        print OUT2 "$R1Headers[$i]\n";
        print OUT2 "$NewRead1";
        print OUT2 "+\n";
        print OUT2 "$NewR1Q\n";
        print OUT2 "$R2Headers[$i]\n";
        print OUT2 "$NewRead2";
        print OUT2 "+\n";
        print OUT2 "$NewR2Q\n";

    }

    $i++;
}
return -1;} #For Sub button
ADD COMMENTlink modified 6 days ago • written 6 days ago by craigdj910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1675 users visited in the last hour