Question: Converting mapped + unmapped BAM files to raw FASTQ (RNA-seq data)
0
gravatar for rodd
21 months ago by
rodd100
London, United Kingdom
rodd100 wrote:

Hi all,

I need to analyse some RNA-seq data with a special aligner for repetitive elements, but the "raw" data from the cohort I am analysing came as aligned BAM files (mapped.bam + unmapped.bam files). I can obtain the raw FASTQ files from a concatenated BAM file, following this tutorial.

However, this is still resulting in reads with a secondary alignment. I was wondering if it would be ok to keep only read pairs in the BAM file which have primary alignments, thus discarding reads which either one of the pair did not align, or reads that have additional alignments (otherwise they would be duplicated in the end FASTQ files). I can't see this being an issue... yet... but please let me know if this sounds correct.

I know there are several posts like this in this and other communities, but I did not manage to find a concise way of doing this yet.

Currently, my concatenated BAM file (mapped + unmapped BAM files) looks like the following:

$ samtools flagstat concatenated.bam

80893332 + 28760 in total (QC-passed reads + QC-failed reads)  
5509466 + 0 secondary 
0 + 0 supplementary 
0 + 0 duplicates 
74608978 + 0 mapped (92.23% : 0.00%) 
75383866 + 28760 paired in sequencing 
37950442 + 14107 read1
37433424 + 14653 read2
34757340 + 0 properly paired (46.11% : 0.00%) 
65502368 + 0 with itself and mate mapped
3597144 + 0 singletons (4.77% : 0.00%)
723510 + 0 with mate mapped to a different chr
429114 + 0 with mate mapped to a different chr (mapQ>=5)
ADD COMMENTlink modified 21 months ago by swbarnes28.1k • written 21 months ago by rodd100
1

Hello rodd ,

I would just merge the mapped.bam and unmapped.bam, sort the resulting file by read name using samtools sort -n and extract the reads to fastq using samtools fastq merged_name_sorted.bam|bgzip -c > all.fastq.gz.

See also my issue on samtools github.

fin swimmer

ADD REPLYlink modified 21 months ago • written 21 months ago by finswimmer13k

Hi finswimmer, Thanks you for your prompt response, and for the link to your post on samtools github. I will be following your advice (and the advice from our colleague who also responded to the thread).

But just out of curiosity, I am still finding some discrepancies in the number of reads after converting to fastq. See below number of reads in bam file, and reads in my fastq files:

  $ samtools view -c -F 0x100 merged_sorted.bam        # duplications not allowed
  **75,412,626**
  $ samtools fastq -0 merged_sorted_0.fastq -1 merged_sorted_1.fastq -2 merged_sorted_2.fastq -F 0x100 merged_sorted.bam
  [M::bam2fq_mainloop] discarded 0 singletons
  [M::bam2fq_mainloop] processed **75,412,626 reads**
  $ wc -l merged_sorted_1.fq
  148988112

  148988112 /4 = 37,247,028 reads per file
  37247028 * 2 for paired reads = **74,494,056**

So I have 918,570 reads missing in the fastq files (and they are not in samtools fastq -0 output or -s singletons_output).

ADD REPLYlink modified 21 months ago by finswimmer13k • written 21 months ago by rodd100

Your second command isn't filtering out secondary alignments. Don't you want to do that?

ADD REPLYlink written 21 months ago by swbarnes28.1k

Sorry, that was a typo - I did include it in my command, and have updated my previous reply.

I am only outputting the reads I want to the FASTQ files, which is great. But I am still curious as to why it's removing ~1 mi reads after the BAM-FASTQ conversion, when comparing to the output of samtools view -c -F 0x100 merged_sorted.bam.

ADD REPLYlink modified 21 months ago • written 21 months ago by rodd100
1

Count up how often each flag turns up in your bam. Finswimmer's link suggests that that -F 0x900 is turned on whether you want it or not, so maybe that's where your million reads are going.

ADD REPLYlink modified 21 months ago • written 21 months ago by swbarnes28.1k
1
gravatar for swbarnes2
21 months ago by
swbarnes28.1k
United States
swbarnes28.1k wrote:

I'd do

samtools fastq -F 256 mydata.bam > mydata.fastq

This will discard secondary alignments, so each read, including unmapped reads, will only turn up once in your fastq.

ADD COMMENTlink written 21 months ago by swbarnes28.1k

Thank you! (Just for the record, this gave the same number of reads as when I used -F 0x900 or -F 0x100, as suggested by finswimmer. These flags are so confusing...)

ADD REPLYlink written 21 months ago by rodd100
1

0x100 is hexadecimal for 256. 0x900 is also filtering for supplementary alignments, you probably don't have any of those.

ADD REPLYlink written 21 months ago by swbarnes28.1k

Thank you for your answers, it helped me a lot!

ADD REPLYlink written 21 months ago by rodd100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1294 users visited in the last hour