Question

Removing uncovered transcripts from multi FASTA reference file

0

Entering edit mode

2.6 years ago

ella • 0

Hi everyone :)

for RNASeq analyses, I have a reference file, containing multiple transcript sequences (it´s a subset of the NCBI human hg38 transcriptome). I found, that some of the transcripts are not even covered by a single read (especially if there are several transcript variants) and would like to exclude them from the file. Is there a way how I could filter those sub-FASTAS of my FASTA, that are covered by less than X reads?

I tried to search for an answer, but didn´t find any helpful posting, yet.

Thanks a lot in advance and have a nice day :)

RNASeq NGS Mapping reference FASTA • 1.2k views

ADD COMMENT • link updated 2.4 years ago by GenoMax 141k • written 2.6 years ago by ella • 0

score 1 · Answer 1 · 2021-09-22

1

Entering edit mode

2.6 years ago

GenoMax 141k

that some of the transcripts are not even covered by a single read

What did you use to align the data and how did you handle multi-mapping reads?

You should consider using a program like salmon (LINK) that will distribute reads across a set of transcripts. It is fast and is the appropriate way of handling data that came from a set of alternately spliced transcripts.

ADD COMMENT • link 2.6 years ago by GenoMax 141k

1

Entering edit mode

Seconding this. It will produce a table with the transcript name and the counts for that transcript. From there it is just parsing out the transcripts that have a zero and then remove them from the original fasta file, either with grep, samtools faidx or similar approaches. Does that make sense to you? Or you load the file into R and use the Biostrings package for the subsetting.

ADD REPLY • link 2.6 years ago by ATpoint 81k

0

Entering edit mode

Thank you as well ATpoint, sounds like this was exactly what I was looking for :) I will try that!

ADD REPLY • link 2.6 years ago by ella • 0

0

Entering edit mode

Thanks for your fast answer.

I´m using BBMap for mapping. As I did not use the "ambig=" option, it should by default map the multi-mapping reads to the first encountered best site.

Great, thanks for the suggestion! I will have a closer look to salmon :)

ADD REPLY • link 2.6 years ago by ella • 0

0

Entering edit mode

Hi GenoMax, I was able to install and use salmon, as suggested by you. :) After a storage node crash at the computing center I am using, I have to re-install salmon. Unfortunately error after error after error occurs and I did not yet manage to set it up again. I think the cmake installation is causing the main problem... Is there any other program or R package to simply give me the number of reads, mapped to each fasta of my reference file (using a BAM file as input)? Or is there a way to do it via samtools?

Thanks again for your help!

ADD REPLY • link 2.4 years ago by ella • 0

1

Entering edit mode

Consider using conda to re-install salmon.

to simply give me the number of reads, mapped to each fasta of my reference file (using a BAM file as input)?

You can try samtools idxstats with a sorted/indexed BAM file.

ADD REPLY • link 2.4 years ago by GenoMax 141k