Question: Count the number of reads for multiple genomic locations on different (not all) chromosomes
0
gravatar for vkgaur25
13 days ago by
vkgaur250
vkgaur250 wrote:

I have a Illumina cancer panel data and after generating a bam file I want to find the total number of reads mapped to genome on some coordinates that I got from a bed file. I know that I can do that using samtools view while mentioning the exact position but I have 212 coordinates to look at and if I give a bed file mentioning 212 coordinates with bam file it would output a file containing reads and other details with coordinates but not total counts of reads in every single coordinate. How can I do that?

rna-seq • 212 views
ADD COMMENTlink modified 10 days ago by finswimmer1.5k • written 13 days ago by vkgaur250

Thanks a lot everyone ! :)

ADD REPLYlink written 12 days ago by vkgaur250

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLYlink modified 10 days ago • written 10 days ago by WouterDeCoster27k

I am sorry , might sound silly but I am not finding upvote link.

ADD REPLYlink written 3 days ago by vkgaur250
1

See those images in screenshot above corresponding to the top left of each answer. Only upvote available on comments.

ADD REPLYlink written 3 days ago by genomax46k

This is what happens after no sleep at all. I am so stupid. Thanks a lot @genomax

ADD REPLYlink written 3 days ago by vkgaur250
3
gravatar for Alex Reynolds
12 days ago by
Alex Reynolds23k
Seattle, WA USA
Alex Reynolds23k wrote:

It's pretty easy with BEDOPS bedmap --count and bam2bed:

$ bedmap --echo --count --delim '\t' regions.bed <(bam2bed < reads.bam) > answer.bed

The file answer.bed will contain regions from the sorted reference file regions.bed, along with the counts of reads from reads.bam that overlap (or map to) those regions in the last column.

One base of overlap is considered when doing counts. If you want something more stringent, use --fraction-map or --fraction-ref or other options, depending on what you need. See bedmap --help or the online docs for more information on overlap criteria.

If you want to do more than --count reads, there are other operators, like --bases, --bases-uniq, --echo-overlap-size, and others, which can be useful for calculating overlap, coverage, and other statistics derived from mapped reads. See the online documentation for a full discussion.

The file regions.bed should be sorted. If you don't know the sort order of your regions of interest, just sort them first with sort-bed:

$ sort-bed regions.unsorted.bed > regions.bed

Then run bedmap as described. Easy-peasy, lemon-squeezy.

ADD COMMENTlink modified 10 days ago • written 12 days ago by Alex Reynolds23k

That sounds promising. Thanks a lot.

ADD REPLYlink written 3 days ago by vkgaur250
0
gravatar for finswimmer
13 days ago by
finswimmer1.5k
Germany
finswimmer1.5k wrote:

Hello,

I think featureCounts is your friend.

fin swimmer

ADD COMMENTlink written 13 days ago by finswimmer1.5k
1

featureCounts won't take a BED file. So coverageBed from BEDtools may be an easier option.

ADD REPLYlink written 12 days ago by genomax46k

Dear genomax,

in BEDtools coverageBED, one gets the following output (supposing A is a gene and B is reads from a bam file):

1 The number of features in B that overlapped (by at least one base pair) the A interval. 2 The number of bases in A that had non-zero coverage from features in B. 3 The length of the entry in A. 4 The fraction of bases in A that had non-zero coverage from features in B.

Is it correct to say that summing up the column 2 will give the total genic length (i.e. total number of base pairs along genes) that are targeted, for example, in an IP?

This is assuming that the genes are provided in a bed file with two bam files: IP.bam and Input.bam. and the bins enriched for reads in IP are filtered for further calculation of genic length.

Many thanks for your advice!

ADD REPLYlink written 10 days ago by Zee_S0

I tried bedtools. Worked, Thanks

ADD REPLYlink modified 3 days ago • written 3 days ago by vkgaur250
0
gravatar for finswimmer
10 days ago by
finswimmer1.5k
Germany
finswimmer1.5k wrote:

Hello,

just for fun here is another way using gnu parallel, samtools view and awk.

awk -v OFS="\t" '{print $1, $2+1,$3,$4}' genes.bed|parallel --colsep '\t' -k "samtools view your.bam {1}:{2}-{3}|wc -l|awk -v OFS=\"\t\" '{print \"{1}\",int({2})-1,\"{3}\",\"{4}\",\$1}'"

<del>It has just a minor flaw: The coordinates in your bam file are 0-based. samtools view uses 0-based. So your region will start 1 base early. But this might be ok in your case.</del>

fin swimmer

ADD COMMENTlink modified 10 days ago • written 10 days ago by finswimmer1.5k

Dear finswimmer,

Could you please elaborate why my method using bedtools coverageBED may be flawed? and why in the end you say it might not matter after all for the gene length calculation?

Im new to this type of analyses and it would help me a lot to understand your approach. Many thanks.

ADD REPLYlink written 10 days ago by Zee_S0

Hello annymous,

are you vkgaur25? Than why you changed your name?

I didn't say that bedtools coverageBED may be flawed. I described the restriction of my variant. In the original post, there was no question about gene length calculation. You (or vkgaur25) just ask for the number of reads mapped to a specific region. The number will not differ significant if the region is 1 base taller. Depending on the goal of this task this might not be crucial.

fin swimmer

ADD REPLYlink modified 10 days ago • written 10 days ago by finswimmer1.5k

Dear fin,

Thank you for your reply, I am not vkgaur25. My apologies if this was misleading from the original question. I happen to be doing a similar analysis as vkgaur25 so I was just wandering how one could go one step further and do an estimation of the overall length occupied by the number of reads in a specified region such as a gene.

could you please guide me on how to correct for this base change when calculating the gene length? as I describe above, my gene coordinates are in a bed file and my read coordinates are in a bam file.

Many thanks.

ADD REPLYlink written 10 days ago by Zee_S0
1

Maybe I misunderstand you, but if you have your start/end coordinates of the gene already in a bed file you just need to calculate the difference.

One can do this e.g. with awk:

awk -v OFS="\t" '{print $1, $2, $3, $4, $3-$2}' genes.bed

If this is not what you want, please start a new topic as this is not part of the original question here.

fin swimmer

ADD REPLYlink modified 10 days ago • written 10 days ago by finswimmer1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 957 users visited in the last hour