Use AWK to interrogate one file through another rather than for and while loops
0
0
Entering edit mode
5.0 years ago
s1667153 • 0

Hello,

I have two files, 1 containing sequencing reads in a SAM like format, and the other contains the exon positions for HG38 and I am trying to use the positions of the exons in the second file to say whether a read in my first file is exonic or not. For ease Im a only using reads for chromosome1 and exons located on chromosome1. The desired output is an annotated copy of FileA which contains the reads that are located in exons, and those that are not, alternatively (as you can see in the code) if the echo prior to the break is removed then a filtered EXON only reads file is generated.

The INPUT files look like this

FileA: 6 columns with read position given in column 4 ($4), tab delimited

NAME FLAG CHR POS MAPQ CIGAR

FileB: 4 columns with exon start position and end position given in col2 and col3 ($2, $3) respectively. tab delimited

CHR START END GENE

I originally tried to use a while loop within a while loop to ask if the position of the read on each line of fileA was within the boundary of an exon ($2 $3) in fileB. There is a break in the statement so that the loop stops when the exon START position is greater than the read position (both files are ordered by bp position) so there is no point continuing with the loop for this read.

This is the code:

while IFS= read READ
   do      
      echo "$READ"
      position=$(echo $READ | awk '{print $4}')
          while IFS= read BED
               do
                   echo "$BED"
                   x=$(echo $BED | awk '{print $2}')
                   y=$(echo $BED | awk '{print $3}')
                   if (($position < "$x"))
                       then
                            echo "$READ" | awk '{print $0"\t"}' >> FileA_exon
                            break 
                   else 
                        if (($position >= "$x" && $position <= "$y"));
                            then 
                                 echo "$READ" | awk '{print $0"\t EXON"}' >> FileA_exon
                        fi      
                    fi 
              done < FileB
    done < FileA

FileA_exon: 7 columns as such

NAME FLAG CHR POS MAPQ CIGAR STATUS

It has been brought to my attention however that whilst this works, it could be improved upon with some colleagues suggesting that a script could be written that doesnt require any loops, nested or otherwise, and for without the need to repeatedly open FileB for each line of FileA. Perhaps in AWK... but I cannot think of a way to use information split over two files to get my desired output...

I'd really appreciate some advice and explanations where possible.

Thanks

AWK forloop exons if else • 970 views
ADD COMMENT
1
Entering edit mode

bedtools: convert your sam to bed with bamtobed and then call "bedtools interesect" with exon.bed

ADD REPLY

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6