Removing uncovered transcripts from multi FASTA reference file
27 days ago
Hi everyone :)

for RNASeq analyses, I have a reference file, containing multiple transcript sequences (it´s a subset of the NCBI human hg38 transcriptome). I found, that some of the transcripts are not even covered by a single read (especially if there are several transcript variants) and would like to exclude them from the file. Is there a way how I could filter those sub-FASTAS of my FASTA, that are covered by less than X reads?

I tried to search for an answer, but didn´t find any helpful posting, yet.

Thanks a lot in advance and have a nice day :)

RNASeq NGS Mapping reference FASTA • 216 views
that some of the transcripts are not even covered by a single read

What did you use to align the data and how did you handle multi-mapping reads?

You should consider using a program like salmon (LINK) that will distribute reads across a set of transcripts. It is fast and is the appropriate way of handling data that came from a set of alternately spliced transcripts.

Seconding this. It will produce a table with the transcript name and the counts for that transcript. From there it is just parsing out the transcripts that have a zero and then remove them from the original fasta file, either with grep, samtools faidx or similar approaches. Does that make sense to you? Or you load the file into R and use the Biostrings package for the subsetting.

Thank you as well ATpoint, sounds like this was exactly what I was looking for :) I will try that!

I´m using BBMap for mapping. As I did not use the "ambig=" option, it should by default map the multi-mapping reads to the first encountered best site.

Great, thanks for the suggestion! I will have a closer look to salmon :)