Question: Read Counts For Genes
1
gravatar for Varun Gupta
8.6 years ago by
Varun Gupta1.2k
United States
Varun Gupta1.2k wrote:

Hi everyone

I am working on RNA SEQ data for pombe genome. I mapped the reads with the pombe genome and created sam/bam files.

Now my question is i have a list of pombe genes with its coordinates.

I want to find the number of read counts for those genes.

How can i do that. Any code would be really helpful.

I will explain a bit more.

Suppose my gene coordinate for gene SPAC227.11c is 567876 568100

Now i want to find the read coordinates for the gene SPAC227.11c.

How should i do it.

Hope to hear from you guys soon

Regards Varun

read counts • 6.1k views
ADD COMMENTlink written 8.6 years ago by Varun Gupta1.2k
8
gravatar for David Langenberger
8.6 years ago by
Deutschland
David Langenberger9.5k wrote:

First of all you need a contig identifier, which is the header of the fasta file you mapped your data to (in human it would be the chromosome).

If your gene coordinate for SPAC227.11c is then something like:

contig = ??? (let's assume it is chr1 here)

start = 567876

end = 568100

Just call:

samtools view yourBAMfile.bam chr1:567876-568100

This command will give you all mapped read positions overlapping with SPAC227.11c.

samtools view yourBAMfile.bam chr1:567876-568100 | wc -l

will give you the read count then.

To make that thingy high throuhput, you can try BEDtools:

bedtools multicov -bams yourBAMfile.bam -bed yourGenes.bed

This call will return your bed file with a new column at the end, containing the read counts overlapping with your genes.

You can create a bed out of your coordinates by trying something like this:

   perl -ane 'print "yourContig\t$F[1]\t$F[2]\t$F[0]\t0\t*\n";' yourGenes.file >
    yourGenes.bed
ADD COMMENTlink written 8.6 years ago by David Langenberger9.5k

actually the contig-name your read was mapped to is written in the third column of your sam file.

ADD REPLYlink written 8.6 years ago by David Langenberger9.5k

Hey David

This works. But i will explain you how my yourGenes.file looks like column1 = chromosome column2 = gene_start(ofcourse 1-based) column3 = gene_end(ofcourse 1 -based)

Now for a bed file we just want to change column2 such that we have value 1 less than the original number i.e if column2 = gene_start is 567876 then in the output of bed file it would be 567875 and column3 = gene_end would remain the same which is 568100.

Is Your perl code above doing the same thing?

Let me know if i am wrong somewhere.

Varun

ADD REPLYlink written 8.6 years ago by Varun Gupta1.2k

Sorry, but I actually have no idea what you are talking about! ;)

But if you just want to subtract 1 of your start or end position, you can do it like this:

perl -ane 'print "$F[0]\t".($F[1]-1)."\t".($F[2]-1)."\t$F[0]\t0\t*\n";' yourGenes.file > yourGenes.bed

ADD REPLYlink written 8.6 years ago by David Langenberger9.5k

sorry David,

i have a bed file came from converting the accepted_hits.bam (tophat2 output) but in the column one there is no gene feature or just Latin numbers I, II, II, IV, so one...i used gtf file but i don't know where is the gene coordinate... then how i can get how many reads have mapped to each gene????

thank you

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by A3.9k
3
gravatar for Damian Kao
8.6 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can use [?]htseq-count[?]. It will take a .gtf annotation file and your .sam file as input.

ADD COMMENTlink written 8.6 years ago by Damian Kao15k

Hi Dk I don't have a gff file just the gene name and start and end coordinates.

ADD REPLYlink written 8.6 years ago by Varun Gupta1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 693 users visited in the last hour