Question

Outputting Unassigned reads from Rsubread featureCounts

0

Entering edit mode

3.2 years ago

David M ▴ 10

I am fairly new RNA-seq analysis and recently analyzed my first few datasets. To assign reads from my mapped bam file I used the featureCounts function of Rsubread in R which worked really well with >98% of my reads being assigned to features.

That being said I work in an organism with a genome that may have gaps in its annotation in particular for the type of features that I am interested in. What I am hoping to find out is if there is a way to output the reads that were not assigned to features by featureCounts so that I can take them and process them using another software to determine if there is particular enrichment in certain unannotated regions that may represent novel RNAs that have not been characterized.

Any advice on if this is possible and how to do it would be greatly appreciated. Alternatively if anyone has a suggestion on another way of doing this I would also certainly be open to that!

Thank you.

RNA-Seq R featureCounts Rsubread • 2.4k views

ADD COMMENT • link updated 3.2 years ago by Carlo Yague 8.7k • written 3.2 years ago by David M ▴ 10

GenoMax · Answer 1 · 2021-01-25

1

Entering edit mode

3.2 years ago

Carlo Yague 8.7k

One way to solve this is to use the reportReads option of featureCounts.

reportReads: logical indicating if read counting result for each read/fragment is saved to a file. If TRUE, read counting results for reads/fragments will be saved to a tab-delimited file that contains four columns including name of read/fragment, status(assigned or the reason if not assigned), name of target feature/meta-feature and number of hits if the read/fragment is counted multiple times. Name of the file is the same as name of the input read file except a suffix `.featureCounts' is added. Multiple files will be generated if there is more than one input read file.

From there, you can parse the file to extract the names of the unassigned reads and save it into unassigned_IDs.txt. Then you could use picard FilterSamReads with READ_LIST_FILE=unassigned_IDs.txt. A simple grep -f unassigned_IDs.txt aligned.bam would also work, but is slower than picard for this task.

ADD COMMENT • link 3.2 years ago by Carlo Yague 8.7k

1

Entering edit mode

I would suggest using filterreadsbyname.sh from BBMap for extracting reads from your original data files. Should be plenty fast and no need for reads-groups etc.

ADD REPLY • link 3.2 years ago by GenoMax 141k

0

Entering edit mode

I will give this a try once I have the report reads output as well! Thank you!

ADD REPLY • link 3.2 years ago by David M ▴ 10

0

Entering edit mode

Thank you very much for the response! I will give this a try. I had try doing something with reportReads option before, but not what you are recommending!

ADD REPLY • link 3.2 years ago by David M ▴ 10

0

Entering edit mode

I was hoping to perhaps get some more advice here. I have generated a txt file containing the names of all reads that were not assigned to features, and the number corresponds perfectly with the number of Unassigned_NoFeature reads I get from featureCounts. However, after I run picard on the original bam file to get out these reads it brings pretty large number of additional reads that would not be in the txt file list.

When I run my new picard output bam file back through featureCounts I get the same number of Unassigned_NoFeature reads (suggesting that my list has all of these as expected), in the case of one example of mine approximately 500,000 but I also have around 3,400,000 reads that are assigned to features, which should not have been pulled out by picard.

The general outline of the picard command I used is:

java -jar picard.jar FilterSamReads INPUT=Totalmappedlibrary.bam OUTPUT=NoFeatures_output.bam READ_LIST_FILE=Unassigned_NoFeatures.txt FILTER=includeReadList

Is there a possible reason for this? Perhaps an additional command I should be using here or something? Any additional advice would be greatly appreciated!

ADD REPLY • link 3.2 years ago by David M ▴ 10

1

Entering edit mode

That is weird... A possibility is that you have additional alignments for some reads (supplementary alignments or secondary alignments), so one read ID corresponds to multiple sam/bam entries.

what is the output of:

samtools flagstat NoFeatures_output.bam

ADD REPLY • link 3.2 years ago by Carlo Yague 8.7k

1

Entering edit mode

I went back and used BBmap to pull out the reads from the Fastq files and then remapped it. Upon inspecting the mapping output and trying this with featureCounts I believe it likely was not an issue with picard but instead a setting on featureCounts and you were correct it was how multimapping reads were being interpreted I think.

This should essential resolve my issue! Thank you very much for your help!

ADD REPLY • link 3.2 years ago by David M ▴ 10

0

Entering edit mode

I thought maybe something similar but was not certain how to check this. Based no the output I am not sure that is the case however.

Samtools flagstat on one of the output.bam files gives:

7614031 + 0 in total (QC-passed reads + QC-failed reads)

41579 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

7614031 + 0 mapped (100.00% : N/A)

7572452 + 0 paired in sequencing

3786226 + 0 read1

3786226 + 0 read2

7572452 + 0 properly paired (100.00% : N/A)

7572452 + 0 with itself and mate mapped

0 + 0 singletons (0.00% : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

Thank you!

ADD REPLY • link 3.2 years ago by David M ▴ 10

1

Entering edit mode

Can you post output of head -5 Unassigned_NoFeatures.txt?

At least with BBMap you need to remove the starting @ in the fastq ID's in the list file. Not sure if that is what picard needs as well.

ADD REPLY • link 3.2 years ago by GenoMax 141k

0

Entering edit mode

Yes certainly! The output is:

GWNJ-0957:691:GW2011143517th:5:2122:3660:21298
GWNJ-0957:691:GW2011143517th:5:1108:31659:34922
GWNJ-0957:691:GW2011143517th:5:1120:16853:53574
GWNJ-0957:691:GW2011143517th:5:1208:14966:16516
GWNJ-0957:691:GW2011143517th:5:1209:17056:43026

I had looked into BBmap and could only find the ability to input Fastq files. I am hoping to used the already mapped bam files as input. Does BBmap accept .bam as input rather than Fastq with the same settings, or perhaps is there a flag that I for some reason could not find for this?

Thank you again!

ADD REPLY • link updated 3.2 years ago by GenoMax 141k • written 3.2 years ago by David M ▴ 10

1

Entering edit mode

In case you are not able to make picard work then you can get the reads from your original data file and simply map them again. To extract the reads you can run the following

filterbyname.sh in=R1.fq.gz in2=R2.fq.gz out=filt_R1.fq.gz out2=filt_R2.fq.gz names=Unassigned_NoFeatures.txt include=t

ADD REPLY • link 3.2 years ago by GenoMax 141k

0

Entering edit mode

Great thank you! It is good to know there is an alternative option if picard does not work! I appreciate the help!