Question: Extracting Read Count For Each Gene/Exon From Rna-Seq Bam Files
11
gravatar for Sahel
7.3 years ago by
Sahel250
CANADA
Sahel250 wrote:

Hello all,

I have some RNA-seq bam files which I would like to submit to edgeR for differential expression analysis. My problem is how to extract the read count for each gene/exon from the bam files. I have followed what explained here "http://seqanswers.com/wiki/How-to/RNASeq_analysis" but for some reason that I could not figure out, I only get 0 for the read counts. I also tested "samtools idxstats aln.bam" but it only ruterns read counts for each chromosome not genes. So I was wondering if there is an other way for extraction of read counts from bam files? Also it would be very appreciated if you could give an idea what might have gone wrong with the other method that I used and got only 0.

Thanks in advance :)

ADD COMMENTlink modified 3.2 years ago by Biostar ♦♦ 20 • written 7.3 years ago by Sahel250

It would be helpful to paste in the code you ran along with the result of sessionInfo().

ADD REPLYlink written 7.3 years ago by Steve Lianoglou5.0k

Hi there, I have a problem along the same line. I have used Trinity to do a de novo assembly and now I want to generate the counts from all my experimental cells using the de novo assemly as my reference. I want to use these counts doing differential expression work using edgeR, say. So in short, I have a de novo reference transcriptome in fasta format, and PE fastq files from each of my 30 experimental cells. I have not figured out if this is suitable for HTSeq, or if there is any other software suitable out there. Comments/suggestions appreciated. Thanks. jahn

ADD REPLYlink written 5.7 years ago by jahndavik0
9
gravatar for Rm
7.3 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

Try htseq-count: it will take annotation file like gtf or bed and your sam file

"htseq-count [options] <sam_file> <gff_file>"
ADD COMMENTlink written 7.3 years ago by Rm7.9k

Hi, Rm

I am wondering how to extract raw reads from the de novo assembly without a gff file available?Thanks!

ADD REPLYlink written 6.1 years ago by mad.cichlids110

@ Baoqing: Can you post it as saperate question to receive the answers

ADD REPLYlink written 6.1 years ago by Rm7.9k

oh, sorry, I will do that, thanks!

ADD REPLYlink written 6.1 years ago by mad.cichlids110
3
gravatar for David Langenberger
7.3 years ago by
Deutschland
David Langenberger9.0k wrote:

You could also try multicov from BEDtools:

bedtools multicov –bams run.bam -bed genes.bed

If you have more than one BAM file, you can easily get the read counts for all of them:

bedtools multicov –bams run1.bam run2.bam run3.bam -bed genes.bed

It can also handle 12 column BED files (with intron-exon structure, just counting the reads overlapping with the exons)... just download the gene annotations for your species and assembly, using the UCSC Genome Browser (Tables) and safe it as BED file. It's just that simple.

ADD COMMENTlink written 7.3 years ago by David Langenberger9.0k

Hi David, Thanks for the comment. I followed exactly your instructions... my data is hg18 so I got the bed file from UCSC for hg18 and ran: ./BEDTools-Version-2.16.2/bin/bedtools multicov -bams TAnalysis/sample1/sample1.sorted.bam -bed hg18.BED > Sample1_readCounts.txt but it immediately terminates and gives me all 0s in the readcount file... How I can check what is wrong ??? Thank you so much for your help :)

ADD REPLYlink written 7.3 years ago by Sahel250

Did you sort and index your bam file? And did you select "output in bed format" in the Table Browser?

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by David Langenberger9.0k

Yes, I did sort and index my bam files by doing this: (1) samtools sort sample1-3-lanes.bam sample1.sorted (2) samtools index sample1.sorted.bam . But when I look at the header of the sorted bam, it does not really look sorted! > samtools view -H sample1.sorted.bam | less -- > @HD VN:1.0 SO:coordinate @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:1 LN:247249719 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:M LN:16571 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 . Is this the source of problem? Do you know other tools for sorting bams? Thanks alot... :)

ADD REPLYlink written 7.3 years ago by Sahel250

'samtools sort' sorts the mappings by their position and not the header. So, 'bedtools multicov' does not throw any error message? It just terminates and then there are only 0s in the last column? Have you checked only the head of your file, or the complete output? 'multicov' is very fast, so perhaps for the first 1000 entries, there are just no overlaps. You could, by hand, check a gene, that has for sure some overlapping mappings.

ADD REPLYlink written 7.3 years ago by David Langenberger9.0k

"It can also handle 12 column BED files (with intron-exon structure, just counting the reads overlapping with the exons)... "

From what I tested (see below), it's not true; bed12 format is treated same as bed6 (i.e., counting reads overlapping with the whole transcript, not just exons).

$ grep ENST00000002596.5 gencode.v14.annotation.bed15 | cut -f1-12 | bedtools multicov -D -bams ~/scratch/Sample3576H01/acceptedhits.bam -bed stdin chr41139477311431389ENST00000002596.50-1140070511401629026964,1067,0,35549,518

$ grep ENST00000002596.5 gencode.v14.annotation.bed15 | cut -f1-6 | bedtools multicov -D -bams ~/scratch/Sample3576H01/acceptedhits.bam -bed stdin chr41139477311431389ENST00000002596.50-518

$ grep ENST00000002596.5 gencode.v14.annotation.gtf | grep exon | bedtools multicov -D -bams ~/scratch/Sample3576H01/acceptedhits.bam -bed stdin chr4HAVANAexon1143032311431389.-.geneid "ENSG00000002587.5"; transcriptid "ENST00000002596.5"; genetype "proteincoding"; genestatus "KNOWN"; genename "HS3ST1"; transcripttype "proteincoding"; transcriptstatus "KNOWN"; transcriptname "HS3ST1-001"; level 2; tag "basic"; tag "CCDS"; ccdsid "CCDS3408.1"; havanagene "OTTHUMG00000090547.4"; havanatranscript "OTTHUMT00000207073.3";33 chr4HAVANAexon1139477411401737.-.geneid "ENSG00000002587.5"; transcriptid "ENST00000002596.5"; genetype "proteincoding"; genestatus "KNOWN"; genename "HS3ST1"; transcripttype "proteincoding"; transcriptstatus "KNOWN"; transcriptname "HS3ST1-001"; level 2; tag "basic"; tag "CCDS"; ccdsid "CCDS3408.1"; havanagene "OTTHUMG00000090547.4"; havanatranscript "OTTHUMT00000207073.3";415

btw, I am using bedtools v2.16.2.

ADD REPLYlink written 6.7 years ago by Xianjun250
1

OK. I saw bedtools v2.17.0 has already added the option -split to fix the problem. Thanks

ADD REPLYlink written 6.7 years ago by Xianjun250

Hi David. Thanks so much for helping us. I tried to run this command But I did get results. I believe that the problem is the bed file. I would to know what the next columns I need to include to get my table in ucsc and then my genes.bed

name Name of gene chrom Reference sequence chromosome or scaffold strand + or - for strand txStart Transcription start position txEnd Transcription end position cdsStart Coding region start cdsEnd Coding region end exonCount Number of exons exonStarts Exon start positions exonEnds Exon end positions proteinID UniProt display ID, UniProt accession, or RefSeq protein ID alignID Unique identifier (GENCODE transcript ID for GENCODE Basic

Thanks

ADD REPLYlink written 2.8 years ago by jleandroj0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 542 users visited in the last hour