I need to analyse some RNA-seq data with a special aligner for repetitive elements, but the "raw" data from the cohort I am analysing came as aligned BAM files (mapped.bam + unmapped.bam files). I can obtain the raw FASTQ files from a concatenated BAM file, following this tutorial.
However, this is still resulting in reads with a secondary alignment. I was wondering if it would be ok to keep only read pairs in the BAM file which have primary alignments, thus discarding reads which either one of the pair did not align, or reads that have additional alignments (otherwise they would be duplicated in the end FASTQ files). I can't see this being an issue... yet... but please let me know if this sounds correct.
I know there are several posts like this in this and other communities, but I did not manage to find a concise way of doing this yet.
Currently, my concatenated BAM file (mapped + unmapped BAM files) looks like the following:
$ samtools flagstat concatenated.bam
80893332 + 28760 in total (QC-passed reads + QC-failed reads) 5509466 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 74608978 + 0 mapped (92.23% : 0.00%) 75383866 + 28760 paired in sequencing 37950442 + 14107 read1 37433424 + 14653 read2 34757340 + 0 properly paired (46.11% : 0.00%) 65502368 + 0 with itself and mate mapped 3597144 + 0 singletons (4.77% : 0.00%) 723510 + 0 with mate mapped to a different chr 429114 + 0 with mate mapped to a different chr (mapQ>=5)
Hello rodd ,
I would just merge the
unmapped.bam, sort the resulting file by read name using
samtools sort -nand extract the reads to fastq using
samtools fastq merged_name_sorted.bam|bgzip -c > all.fastq.gz.
See also my issue on samtools github.
Hi finswimmer, Thanks you for your prompt response, and for the link to your post on samtools github. I will be following your advice (and the advice from our colleague who also responded to the thread).
But just out of curiosity, I am still finding some discrepancies in the number of reads after converting to fastq. See below number of reads in bam file, and reads in my fastq files:
So I have 918,570 reads missing in the fastq files (and they are not in samtools fastq -0 output or -s singletons_output).
Your second command isn't filtering out secondary alignments. Don't you want to do that?
Sorry, that was a typo - I did include it in my command, and have updated my previous reply.
I am only outputting the reads I want to the FASTQ files, which is great. But I am still curious as to why it's removing ~1 mi reads after the BAM-FASTQ conversion, when comparing to the output of
samtools view -c -F 0x100 merged_sorted.bam.
Count up how often each flag turns up in your bam. Finswimmer's link suggests that that -F 0x900 is turned on whether you want it or not, so maybe that's where your million reads are going.