Question: Using samtools with Long Read RNASeq data
0
gravatar for Joshi
11 weeks ago by
Joshi0
Australia
Joshi0 wrote:

Hi - Would appreciate help with this one ..

I downloaded this particular ENCODE rnaseq dataset (BAM alignment). This a Long Read RNAseq sample. ttps://www.encodeproject.org/experiments/ENCSR293MOX/

  • The original file size for ENCFF653FOQ.bam is 300Mb
  • To view the RNASeq file in IGV, I first needed to index it
  • When I tried to index this using samtools index, it notified me that the BAM file wasn't sorted
  • After sorting, the size of ENCFF653FOQ.sorted.bam is 88Mb

I ran samtools flagstat on both the original and sorted bam files; and see no difference.

What is being lost or removed when sorting the Long Read RNASeq file? Is samtools the right tool for handling long read rna-seq data?

$ samtools flagstat ENCFF653FOQ.bam
647063 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
647063 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

$ samtools flagstat ENCFF653FOQ.sorted.bam
647063 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
647063 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

$ ls -l ENCFF653FOQ.bam ENCFF653FOQ.sorted.bam
-rw-r--r-- 1 287M Apr 28 14:09 ENCFF653FOQ.bam
-rw-r--r-- 1  84M Apr 28 19:10 ENCFF653FOQ.sorted.bam

$ samtools --version
samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
rna-seq samtools long read • 147 views
ADD COMMENTlink modified 11 weeks ago by genomax86k • written 11 weeks ago by Joshi0
2
gravatar for genomax
11 weeks ago by
genomax86k
United States
genomax86k wrote:

What is being lost or removed when sorting the Long Read RNASeq file?

Nothing is being lost or gained. When files are sorted similar sequences may be brought next to each other. Similar sequences compress better so that is one likely reason the size of your sorted file is smaller.

As a general suggestion, do not use file sizes as a metric, unless it is to ensure that the file is non-zero bytes i.e. a tool ran and produced output.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by genomax86k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1708 users visited in the last hour