Question

Matching Query ID and Subject ID in a new file

0

Entering edit mode

4.0 years ago

jamie.pike ▴ 80

Hi,

I am trying to write a script that takes a list of genomes, performs BLASTN, creates a file containing the sequence of the BLASTN hit, and creates a file with the query ID, subject ID, location in the subject, and the sequence.

I have been pretty successful, the only issue is correctly matching up the query ID to BLASTN hit in the final file:

Here's the script:

#!/bin/bash
 #Insert the name of the query sequence
    Q=Example_query.fasta
    #insert name of file containing a list of genomes
    gL=genomeList

    #if Query has spaces in names, replace with "_":
    sed 's/ /_/g' ${Q} > Q

    for i in $(cat ${gL}); do

      #make BLASTdb, create an index with samtools, and perfrom BLASTN with percentage perc_identity set to 90%
      makeblastdb -in ${i} -dbtype nucl; samtools faidx ${i}; cut -f 1,2 ${i}.fai > ${i}.chrom.sizes;
      blastn -db ${i} -query ${Q} -outfmt 6 -perc_identity 90.00 -out ${Q}_against_${i}.blast;

      #using the BLASTN hits, create fastas using the subject from the hit.  The -b flag is set to 0 so just the hit will be printed, if -b is changed, it will print that number of bases either side of the hit.
      cut -f 1,2 ${Q}_against_${i}.blast | awk '{print $2,"\t",$1}' > query_hit_name_${i}.txt;
      cut -f 2,9,10 ${Q}_against_${i}.blast |awk '{if ($2>$3)print $1,$3,$2,".",".","-";else print $1,$2,$3,".",".","+";}' OFS='\t' |awk '{a=$2-1;print $1,a,$3,$4,$5,$6;}' OFS='\t'|bedtools sort > ${Q}_against_${i}.blast.bed;
      bedtools slop -i ${Q}_against_${i}.blast.bed -g ${i}.chrom.sizes -b 0 > ${Q}_against_${i}.blast.0.bed;
      bedtools getfasta -s -fi ${i} -bed ${Q}_against_${i}.blast.0.bed -fo ${Q}_against_${i}.blast.0.fasta;

      #Create a file which contains the query hit name, subject hit name, location in subject, and the sequence.
      awk 'BEGIN{RS=">";OFS="\t"}NR>1{print $1,$2}' ${Q}_against_${i}.blast.0.fasta |awk  '{gsub(":","\t",$0); print;}' | awk  '{gsub("-","\t",$0); print;}' > ${Q}_against_${i}.blast.table.txt;
      awk 'NR==FNR {a[$1]=$2; next} {print a[$1],"\t",$1,"\t",$2,"\t",$3,"\t",$4}' query_hit_name_${i}.txt ${Q}_against_${i}.blast.table.txt > location_and_seq_${i}_table.txt; done

Where there is more than one BLASTN hit per subject ID, only the last query ID is printed in location_and_seq_${i}_table.txt. There should be different QUERYIDs printed e.g. QUERYID1_sequence1,_20_bases , QUERYID2_sequence2,_60_bases ,etc.

The current output of location_and_seq_${i}_table.txt:

QUERYID2_Sequence5,_50_bases     SeqA_SUBJECT_Sequence1,_50_bases    47      97(+)   ATCGTAGGAGTCTCGAGATTTTTTACACCAGGAGAGTCTAGCTAGCTAGC
QUERYID2_Sequence5,_50_bases     SeqA_SUBJECT_Sequence1,_50_bases    47      97(+)   ATCGTAGGAGTCTCGAGATTTTTTACACCAGGAGAGTCTAGCTAGCTAGC
QUERYID2_Sequence5,_50_bases     SeqA_SUBJECT_Sequence1,_50_bases    47      97(+)   ATCGTAGGAGTCTCGAGATTTTTTACACCAGGAGAGTCTAGCTAGCTAGC
QUERYID2_Sequence5,_50_bases     SeqA_SUBJECT_Sequence1,_50_bases    47      97(+)   ATCGTAGGAGTCTCGAGATTTTTTACACCAGGAGAGTCTAGCTAGCTAGC
QUERYID2_Sequence5,_50_bases     SeqA_SUBJECT_Sequence1,_50_bases    47      97(+)   ATCGTAGGAGTCTCGAGATTTTTTACACCAGGAGAGTCTAGCTAGCTAGC
QUERYID2_Sequence5,_50_bases     SeqA_SUBJECT_Sequence1,_50_bases    47      97(+)   ATCGTAGGAGTCTCGAGATTTTTTACACCAGGAGAGTCTAGCTAGCTAGC
QUERYID2_Sequence5,_50_bases     SeqA_SUBJECT_Sequence1,_50_bases    47      97(+)   ATCGTAGGAGTCTCGAGATTTTTTACACCAGGAGAGTCTAGCTAGCTAGC
QUERYID2_Sequence5,_50_bases     SeqA_SUBJECT_Sequence1,_50_bases    47      97(+)   ATCGTAGGAGTCTCGAGATTTTTTACACCAGGAGAGTCTAGCTAGCTAGC

I think that the solution may lay in the awk commands:

 awk 'NR==FNR {a[$1]=$2; next} {print a[$1],"\t",$1,"\t",$2,"\t",$3,"\t",$4}' query_hit_name_${i}.txt ${Q}_against_${i}.blast.table.txt > location_and_seq_${i}_table.txt; done

I have tried allocating a unique number to each subject ID hit after BLASTN so that the queries can match up, but it then doesn't match up with the ${i}.chrom.sizes file so there is no output from bedtools.

Is there a way to fix the issue so that the queries correctly match, or is there a better way to do the whole thing - BLASTN, create a fasta containing the subject hits - and generate a file which contains the query ID, Subject, ID, locations and the sequences? Perhaps the sequences can be matched and added to the original BLAST file?

I appreciate that this is quite dense, and so is not the easiest thing to follow or understand, so am happy to try and explain in more detail if anyone is willing to tackle this.

BLASTN awk • 692 views

ADD COMMENT • link 4.0 years ago by jamie.pike ▴ 80