Question: how to extract counts from bam/merge-bam without gtf
gravatar for Shahzad
17 months ago by
Shahzad10 wrote:

I have aligned my RNA-seq PE on Transcriptome with no gtf file available using Bowtie2.

Used Samtools to make bam files and sorted.

Used sambamba merge to merge the sorted bam replicates and resorted using samtools.

bowtie2 -x ../refrence-transcriptome -1 rep1_1.fq -2 rep1_2.fq -S sample1.sam --no-unal 
samtools view -Sb 1.sam > 1.bam
samtools sort 1.bam -o 1_sorted.bam
samtools index 1rc_sorted.bam
sambamba merge mergesamples.bam 1.bam 2.bam 3.bam

I want to know what tools i can use to get counts with no gtf/gff/gff3 available and only have ref.fa?

hisat2 rna-seq bowtie2 samtools R • 797 views
ADD COMMENTlink modified 17 months ago by michael.ante3.6k • written 17 months ago by Shahzad10
gravatar for ATpoint
17 months ago by
ATpoint31k wrote:

A couple of things:

  1. You should align your RNA-seq data to the genome rather than the transcriptome, but this only makes sense of you have a GTF and splice sites. Given you had these, also use a splice-aware aligner such as hisat2 or star.
  2. Merging replicates (given these are biological replicates) is not recommended as you lose any information about biological variability between the samples.

EDIT: See also michael.ante's answer below on why alignment is an inferior choice here.

I you situation, given the absence of a GTF, I would use salmon to quantify your reads directly against your transcriptome, which you seem to have in fasta format. Salmon is a alignment-free (pseudoaligner) tool with extensive documentation. Spend some quality time reading it and come back in case of questions. Do the quantification for each fastq files separately, given these are biological replicates.

ADD COMMENTlink modified 17 months ago • written 17 months ago by ATpoint31k

thank you for guiding. reference genome is not available yet. I ll try salmon and let you know about the output. Thank you.

ADD REPLYlink modified 17 months ago • written 17 months ago by Shahzad10
gravatar for michael.ante
17 months ago by
michael.ante3.6k wrote:

Hi Shahzad,

If you have only the transcriptome annotation, I'd go for Salmon .

In case your species has multiple transcript per gene, you might run into troubles with the multimapping reads. Per default bowtie2 reports one alignment per read(-pair). Even if you have several with the same score. Thus, if several transcripts of a gene share an exon, all reads aligning on this exon will be assigned "randomly" to just one transcript.

If your species has only one transcript per gene, you can try to use samtools idxstats on your bam files to get the alignments per transcript.



ADD COMMENTlink written 17 months ago by michael.ante3.6k

how to proceed further with idxstat output. I have done it but I thing I have done any mistake DESeq did not accept it. Can you please guide about how to format the output tables for DESeq or EdgeR?

ADD REPLYlink modified 17 months ago • written 17 months ago by Shahzad10

With samtools idxstat, you get a table of the contig name, the contig's length, the number of alignments, and the number of unmapped reads. Having such a table in R. you only need to select the first column as row names and the third column as the counts. You also might discard the last row.

ADD REPLYlink written 17 months ago by michael.ante3.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1224 users visited in the last hour