Question

Rna seq extracting read counts from BAM file - noob question

0

Entering edit mode

7.3 years ago

wbonxx • 0

Hello,

I have mapped rnaSeq data (illumina) with bwa and produced a bam file (using the human genome 38 as reference).

I would like to filter the bam file for the aligned sequences with the higher number of reads.

I want to use bedtools to produce the counts and then filter out the most expressed genes, but I'm having troubles processing the BAM file, could anyone explain me how to do it in a simple way (i.e. how do I sort for bed)?

At the end I'm looking for the 3 most expressed genes (I would then need to annotate the aligned sequences to look for the genes of interest). Thus what would be the fastest way to filter out and annotate the sequences with the highest number of mapped reads?

Sorry for the newbie question.

edit: 1- I need to sort and add counts to the mapped file 2- extract the top 3 genes after adding annotations

edit2: There may be something wrong here :/
+ 0 secondary 0 + 0 supplementary 0 + 0 duplicates 2 + 0 mapped (0.00% : N/A) 101118 + 0 paired in sequencing 50559 + 0 read1 50559 + 0 read2 2 + 0 properly paired (0.00% : N/A) 2 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

RNA-Seq sequencing alignment next-gen • 5.2k views

ADD COMMENT • link 7.3 years ago by wbonxx • 0

0

Entering edit mode

I'm not sure if I understood your question. Do you want to do the following steps?

Find the 3 most expressed genes in the dataset
Subset your bam to only show those 3 genes

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Add the number of reads x mapped sequence.
Annotate the sequences.
Sort out the first 3-5 more expressed genes.

ADD REPLY • link 7.3 years ago by wbonxx • 0

score 0 · Answer 1 · 2017-01-03

0

Entering edit mode

7.3 years ago

bruce.moran ▴ 960

I would recommend using the featureCounts tool which produces the output that you want. It takes sorted or unsorted BAMs, and produces read counts per feature. It is pretty easy to set up, but any specific questions leave a comment.

To remove the top three most expressed genes, you would sort on the final column (column 7) of output which is the read counts using something like

sort -k7,7n <featureCounts.output>

then edit manually, or use grep to remove from output.

ADD COMMENT • link 7.3 years ago by bruce.moran ▴ 960

0

Entering edit mode

Hi Bruce,

I would like to get a grip on a standard tool, like bedtools, but I will surely give a look at the posted link.

What about adding the counts with bedtools and then annotating the mapped bam file?

ADD REPLY • link 7.3 years ago by wbonxx • 0

0

Entering edit mode

I'd say featureCounts is a standard tool, albeit with a limited functionality compared to BEDTools. HTSeq-counts is also a standard tool that does what you want, but it can take quite a long time comparatively. BEDTools is great, and I use it a lot, but the previous tools specifically annotate and count reads for features, which is what you want to do. I would say that getting used to running a lot of different tools is reasonably important training in bioinformatics.

ADD REPLY • link 7.3 years ago by bruce.moran ▴ 960

0

Entering edit mode

Hi Bruce,

I need specifically to lear how to use bedtools.

But my first impression of featureCounts or the whole package is rather bad. Documentation is poor, website is poor, requires mathlab...

My feeling is that is better to focus on the best suite available for a given function, rather then to keep using tools developed trashing public money, that do something redundant that others do better.

ADD REPLY • link 7.3 years ago by wbonxx • 0

0

Entering edit mode

I need specifically to lear how to use bedtools.

Ok, then ask how to accomplish the task with BEDTools in the first instance. Your question 'supposes' to use BEDTools.