Question: Use AWK to interrogate one file through another rather than for and while loops
0
gravatar for s1667153
15 months ago by
s16671530
Institute of Genetics and Molecular Medicine, Edinburgh (UK)
s16671530 wrote:

Hello,

I have two files, 1 containing sequencing reads in a SAM like format, and the other contains the exon positions for HG38 and I am trying to use the positions of the exons in the second file to say whether a read in my first file is exonic or not. For ease Im a only using reads for chromosome1 and exons located on chromosome1. The desired output is an annotated copy of FileA which contains the reads that are located in exons, and those that are not, alternatively (as you can see in the code) if the echo prior to the break is removed then a filtered EXON only reads file is generated.

The INPUT files look like this

FileA: 6 columns with read position given in column 4 ($4), tab delimited

NAME FLAG CHR POS MAPQ CIGAR

FileB: 4 columns with exon start position and end position given in col2 and col3 ($2, $3) respectively. tab delimited

CHR START END GENE

I originally tried to use a while loop within a while loop to ask if the position of the read on each line of fileA was within the boundary of an exon ($2 $3) in fileB. There is a break in the statement so that the loop stops when the exon START position is greater than the read position (both files are ordered by bp position) so there is no point continuing with the loop for this read.

This is the code:

while IFS= read READ
   do      
      echo "$READ"
      position=$(echo $READ | awk '{print $4}')
          while IFS= read BED
               do
                   echo "$BED"
                   x=$(echo $BED | awk '{print $2}')
                   y=$(echo $BED | awk '{print $3}')
                   if (($position < "$x"))
                       then
                            echo "$READ" | awk '{print $0"\t"}' >> FileA_exon
                            break 
                   else 
                        if (($position >= "$x" && $position <= "$y"));
                            then 
                                 echo "$READ" | awk '{print $0"\t EXON"}' >> FileA_exon
                        fi      
                    fi 
              done < FileB
    done < FileA

FileA_exon: 7 columns as such

NAME FLAG CHR POS MAPQ CIGAR STATUS

It has been brought to my attention however that whilst this works, it could be improved upon with some colleagues suggesting that a script could be written that doesnt require any loops, nested or otherwise, and for without the need to repeatedly open FileB for each line of FileA. Perhaps in AWK... but I cannot think of a way to use information split over two files to get my desired output...

I'd really appreciate some advice and explanations where possible.

Thanks

exons if awk else forloop • 347 views
ADD COMMENTlink written 15 months ago by s16671530
1

bedtools: convert your sam to bed with bamtobed and then call "bedtools interesect" with exon.bed

ADD REPLYlink written 15 months ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1954 users visited in the last hour