Using samtools with Long Read RNASeq data
1
0
Entering edit mode
4.0 years ago
Joshi • 0

Hi - Would appreciate help with this one ..

I downloaded this particular ENCODE rnaseq dataset (BAM alignment). This a Long Read RNAseq sample. ttps://www.encodeproject.org/experiments/ENCSR293MOX/

  • The original file size for ENCFF653FOQ.bam is 300Mb
  • To view the RNASeq file in IGV, I first needed to index it
  • When I tried to index this using samtools index, it notified me that the BAM file wasn't sorted
  • After sorting, the size of ENCFF653FOQ.sorted.bam is 88Mb

I ran samtools flagstat on both the original and sorted bam files; and see no difference.

What is being lost or removed when sorting the Long Read RNASeq file? Is samtools the right tool for handling long read rna-seq data?

$ samtools flagstat ENCFF653FOQ.bam
647063 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
647063 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

$ samtools flagstat ENCFF653FOQ.sorted.bam
647063 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
647063 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

$ ls -l ENCFF653FOQ.bam ENCFF653FOQ.sorted.bam
-rw-r--r-- 1 287M Apr 28 14:09 ENCFF653FOQ.bam
-rw-r--r-- 1  84M Apr 28 19:10 ENCFF653FOQ.sorted.bam

$ samtools --version
samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
RNA-Seq Long read samtools • 1.3k views
ADD COMMENT
3
Entering edit mode
4.0 years ago
GenoMax 141k

What is being lost or removed when sorting the Long Read RNASeq file?

Nothing is being lost or gained. When files are sorted similar sequences may be brought next to each other. Similar sequences compress better so that is one likely reason the size of your sorted file is smaller.

As a general suggestion, do not use file sizes as a metric, unless it is to ensure that the file is non-zero bytes i.e. a tool ran and produced output.

ADD COMMENT

Login before adding your answer.

Traffic: 1959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6