Get the query names from the BAM file:
samtools view <input file> | cut -f1 | sort > BAM_headers.txt
Get the query names from your FASTQ file (assuming your read length is 101bp):
awk '$0 ~ /^@/ && length($0) < 101' <fastq_file> | sed 's/^@//' | sort > FASTQ_headers.txt
Use diff to compare the two files.
diff BAM_headers.txt FASTQ_headers.txt
Since all reads in the FASTQ headers file will also be in the BAM headers file, diff should only show query names that are in the FASTQ file that are not in the BAM file.
The output should show something like this:
That's your list of missing read names. You can parse the diff output to get the read names without leading '>' symbols, then use grep to get the actual reads from the FASTQ file if you'd like.
If you want to use this approach and need help figuring out the last two parts, let me know.
Every BAM file I've seen contains all reads, including unmapped ones. Are you sure that unmapped reads aren't present in your BAM file? Or am I misundersatnding
I've mainly used BWA, so that's been my experience. I just checked bowtie, and if you run it as /path/to/bowtie --sam <genome> <input_file>, all reads are still reported, including unmapped reads (FLAG is 4, RNAME is *, POS and MAPQ are 0)
bwa definately included unmapped reads, I'm not sure bowtie does by default.
These BAMs are produced by Tophat. My understanding is that Tophat only reports aligned reads from FASTQ, isn't that right?
I've never used TopHat, but a quick peek online makes it look like they only report mapped reads. First time I've seen that...
TopHat only puts in the mapped reads - it'll put them in multiple times if they map to many locations as well - so you may need to pipe column 1 into sort/uniq UNIX pipes to get unique mapped read names.