I am working with RNA-seq. After mapping a couple of files with paired-end data against a reference genome I have obtained a .bam file from which I have extracted the unmapped reads into another .bam file with:
samtools view -b -f 4 01_BS-S2.bam > unmapped.bam
and, afterwards, I have sorted it with:
samtools sort -n unmapped.bam myout
Then, an inspection of its content with:
samtools fgstat myout.bam
gives the next result:
19052725 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 mapped (0.00% : N/A) 19052725 + 0 paired in sequencing 9587227 + 0 read1 9465498 + 0 read2 0 + 0 properly paired (0.00% : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
in which I assume that there are single-end sequences unmapped due to incomplete alignments of some paired-ends (the number of read1 and read2 are different). In order to create a new partial transcriptome I need to convert this myout.bam into fastq files, so I execute:
samtools fastq -1 pe1.fq -2 pe2.fq -s se.fq myout.bam
whose result is (no other output is produced but this):
[M::bam2fq_mainloop_singletontrack] discarded 2125818 singletons [M::bam2fq_mainloop_singletontrack] processed 19052725 reads
in which the number of processed sequences matches with the output of the flagstat command. However, when I analyze the files p1.fastq and p2.fastq, I find that each of them contains 7920630 sequences, whereas the se.fastq contains the expected 2125818. Therefore, the number of sequences in these three files is: 7920630x2 + 2125818= 17967078 what does not match with the 19052725 processed.
Where are the remaining 1085647 reads (19052725 - 17967078)? Any idea?