Hi everyone :)
I did a genome-guided transcriptome assembly of my human cell lab strain. As "genome" I used the human transcriptome (160987 sequences), which is provided by NCBI. Next, I wanted to blast my resulting lab strain transriptome fasta file (259991 sequences) against the very same NCBI human transcriptome fasta, i used for guiding the transcriptome assembly. Thus, I created a blast database of the NCBI fasta, using makeblastdb of NCBIs blast-2.9.0+. I did two independant blastn rounds, in which I used the option -num_alignments to have a small output, containing only one and three top hits per sequence, respectively.
blastn -query $Path$Query -db $Database -outfmt "6 sallacc" \ -out resultstab.txt -word_size 20 -evalue 0.000000000000001 \ -num_threads 8 -num_alignments 1
Subsequently, I used the "uniq" option, to remove all sequences from the output file, which appear twice or more times.
uniq resultstab.txt >resultstab2.txt
And I extracted the sequences:
blastdbcmd -db $Database -dbtype nucl -entry_batch resultstab2.txt \ -outfmt "%f" -out hitcontigs.fas -line_length 1000000
Now, my output fasta for the 3-top-hits-approach contains 472994 and the fasta for the 1-top-hit-approach contains 145562 fasta sequences. How can the output files contain more sequences than the reference file, I blasted against? What am I overseeing?
Thanks a lot in advance :)