Question: Converting mapped + unmapped BAM files to raw FASTQ (RNA-seq data)
0
gravatar for rodd
11 months ago by
rodd50
London, United Kingdom
rodd50 wrote:

Hi all,

I need to analyse some RNA-seq data with a special aligner for repetitive elements, but the "raw" data from the cohort I am analysing came as aligned BAM files (mapped.bam + unmapped.bam files). I can obtain the raw FASTQ files from a concatenated BAM file, following this tutorial.

However, this is still resulting in reads with a secondary alignment. I was wondering if it would be ok to keep only read pairs in the BAM file which have primary alignments, thus discarding reads which either one of the pair did not align, or reads that have additional alignments (otherwise they would be duplicated in the end FASTQ files). I can't see this being an issue... yet... but please let me know if this sounds correct.

I know there are several posts like this in this and other communities, but I did not manage to find a concise way of doing this yet.

Currently, my concatenated BAM file (mapped + unmapped BAM files) looks like the following:

$ samtools flagstat concatenated.bam

80893332 + 28760 in total (QC-passed reads + QC-failed reads)  
5509466 + 0 secondary 
0 + 0 supplementary 
0 + 0 duplicates 
74608978 + 0 mapped (92.23% : 0.00%) 
75383866 + 28760 paired in sequencing 
37950442 + 14107 read1
37433424 + 14653 read2
34757340 + 0 properly paired (46.11% : 0.00%) 
65502368 + 0 with itself and mate mapped
3597144 + 0 singletons (4.77% : 0.00%)
723510 + 0 with mate mapped to a different chr
429114 + 0 with mate mapped to a different chr (mapQ>=5)
ADD COMMENTlink modified 11 months ago by swbarnes26.7k • written 11 months ago by rodd50
1

Hello rodd ,

I would just merge the mapped.bam and unmapped.bam, sort the resulting file by read name using samtools sort -n and extract the reads to fastq using samtools fastq merged_name_sorted.bam|bgzip -c > all.fastq.gz.

See also my issue on samtools github.

fin swimmer

ADD REPLYlink modified 11 months ago • written 11 months ago by finswimmer12k

Hi finswimmer, Thanks you for your prompt response, and for the link to your post on samtools github. I will be following your advice (and the advice from our colleague who also responded to the thread).

But just out of curiosity, I am still finding some discrepancies in the number of reads after converting to fastq. See below number of reads in bam file, and reads in my fastq files:

  $ samtools view -c -F 0x100 merged_sorted.bam        # duplications not allowed
  **75,412,626**
  $ samtools fastq -0 merged_sorted_0.fastq -1 merged_sorted_1.fastq -2 merged_sorted_2.fastq -F 0x100 merged_sorted.bam
  [M::bam2fq_mainloop] discarded 0 singletons
  [M::bam2fq_mainloop] processed **75,412,626 reads**
  $ wc -l merged_sorted_1.fq
  148988112

  148988112 /4 = 37,247,028 reads per file
  37247028 * 2 for paired reads = **74,494,056**

So I have 918,570 reads missing in the fastq files (and they are not in samtools fastq -0 output or -s singletons_output).

ADD REPLYlink modified 11 months ago by finswimmer12k • written 11 months ago by rodd50

Your second command isn't filtering out secondary alignments. Don't you want to do that?

ADD REPLYlink written 11 months ago by swbarnes26.7k

Sorry, that was a typo - I did include it in my command, and have updated my previous reply.

I am only outputting the reads I want to the FASTQ files, which is great. But I am still curious as to why it's removing ~1 mi reads after the BAM-FASTQ conversion, when comparing to the output of samtools view -c -F 0x100 merged_sorted.bam.

ADD REPLYlink modified 11 months ago • written 11 months ago by rodd50
1

Count up how often each flag turns up in your bam. Finswimmer's link suggests that that -F 0x900 is turned on whether you want it or not, so maybe that's where your million reads are going.

ADD REPLYlink modified 11 months ago • written 11 months ago by swbarnes26.7k
1
gravatar for swbarnes2
11 months ago by
swbarnes26.7k
United States
swbarnes26.7k wrote:

I'd do

samtools fastq -F 256 mydata.bam > mydata.fastq

This will discard secondary alignments, so each read, including unmapped reads, will only turn up once in your fastq.

ADD COMMENTlink written 11 months ago by swbarnes26.7k

Thank you! (Just for the record, this gave the same number of reads as when I used -F 0x900 or -F 0x100, as suggested by finswimmer. These flags are so confusing...)

ADD REPLYlink written 11 months ago by rodd50
1

0x100 is hexadecimal for 256. 0x900 is also filtering for supplementary alignments, you probably don't have any of those.

ADD REPLYlink written 11 months ago by swbarnes26.7k

Thank you for your answers, it helped me a lot!

ADD REPLYlink written 11 months ago by rodd50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2080 users visited in the last hour