Question

featureCounts cDNA vs DNA files

0

Entering edit mode

2.2 years ago

camppatrick ▴ 10

featureCounts from the subRead package works quite well with cDNA reads aligned to genomic DNA_ref.fa files, but I am having difficulty getting featureCounts to work with cDNA files aligned to the cDNA_ref.fa.

mRatBN7 GTF File:

https://www.ensembl.org/Rattus_norvegicus/Info/Index?db=core

DNA ref.fa File:

http://ftp.ensembl.org/pub/release-105/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz

cDNA ref File:

http://ftp.ensembl.org/pub/release-105/fasta/rattus_norvegicus/cdna/Rattus_norvegicus.mRatBN7.2.cdna.all.fa.gz

minimap2 used for mapping Nanopore cDNA reads over to the different reference files. The data revealed by samtools flagstat indicates that my cDNA data aligns better to the cDNA?_ref.fa file, and would like to use this data... if only featureCounts would cooperate. But I always get 0.00% reads aligned with the cDNA .

featureCounts( DNA.sam / cDNA.sam, annot.ext = GTF_mRatBN7, isGTFAnnotationFile = TRUE)

It was suggested to me to use samtools idxstats and using the table directly from there as my gene output, but it doesn't quite feel right.

Any suggestions?

PS. I have to use R for subReads package as it is not yet fluid with the newer Apple M1 chips, however I do not suspect this is the problem as I experienced this also on Ubuntu terminal.

cDNA featureCounts subRead RNA • 1.7k views

ADD COMMENT • link 2.1 years ago by camppatrick ▴ 10

0

Entering edit mode

featureCounts can also be run as a separate tool, but as GenoMax stated, even the R version should run seamlessly on M1 chips

ADD REPLY • link 2.2 years ago by Istvan Albert 100k

score 2 · Answer 1 · 2022-02-21

2

Entering edit mode

2.2 years ago

GenoMax 141k

Problem in this case is likely that your GTF file references the chromosome names where as your cDNA file has names for cDNA's. So those two don't match. You could create a simple SAF format file for your cDNA simply listing the names of cDNA's and their start/stops. You could do that using the suggestion you got

It was suggested to me to use samtools idxstats and using the table directly from there as my gene output, but it doesn't quite feel right.

SAF file format example :

GeneID      Chr Start   End Strand
497097      chr1    3204563 3207049 -
497097      chr1    3411783 3411982 -
497097      chr1    3660633 3661579 -
100503874   chr1    3637390 3640590 -
100503874   chr1    3648928 3648985 -
100038431   chr1    3670236 3671869 -

I have to use R for subReads package as it is not yet fluid with the newer Apple M1 chips

Not sure what this means. Anything coded for intel chips should work on M1 via rosetta translation.

ADD COMMENT • link 2.2 years ago by GenoMax 141k

0

Entering edit mode

Header format from bam file

Created SAF file

Hello, Thanks for the input. I created a SAF file of this format using the GTF file and some regex magic in R. I now am running:

featureCounts(<data.bam>, annot.ext = <saf_file.txt> )

And featureCounts runs immediately, returning 0.00% reads.

Things I have tried with no success:

saving the saf_file.txt as saf_file.saf
Putting SN: in front of each transcript name for the saf file
leaving "chr" out of the Chr column

ADD REPLY • link 2.2 years ago by camppatrick ▴ 10

0

Entering edit mode

Subread

Attached error message from trying to make the Subreader file for MacOS. Rosetta is installed on my computer, checked by forcing some apps to open using rosetta.

But I believe this issue is probably better suited for its own thread.

ADD REPLY • link 2.2 years ago by camppatrick ▴ 10

1

Entering edit mode

I created a SAF file of this format using the GTF file and some regex magic in R.

This is not correct. cDNA entries in your transcript files are standalone sequences. They are of length indicated in first screenshot by LN field. You can't substitute chromosomal coordinates in your SAF file. So the entry in your SAF files should look like this.

ENSRNOT00000033201.6 chr12 1 276 -

When you count the BAM file aligned against the cDNA reference these are the sequences/lengths you need to use.

Edit: Added chr field that was missing.

ADD REPLY • link 2.1 years ago by GenoMax 141k

0

Entering edit mode

The subread user guide says these columns are necessary : GeneID Chr Start End Strand

Leaving out the Chr column generates the error that the data is not in saf format.

Unfortunately, adding in the chromosome column (and playing around with the formatting) still results in 0.0% alignments. Am I just on the wrong path here? Should Nanopore cDNA reads just be mapped against the genome and not the transcriptome? Programs such as IGV, htseq, and featureCounts all seem to work a lot smoother with the genomic mapping, and there really aren't any forum threads online of people facing this problem.

ADD REPLY • link 2.1 years ago by camppatrick ▴ 10

1

Entering edit mode

My apologies for missing that.. Can you add chr names from your other file but use the lengths of transcripts as start and stop?

ADD REPLY • link 2.1 years ago by GenoMax 141k

0

Entering edit mode

saf file

featureCounts output

I am posting the example saf file I made. To note is that I tried many different variations of formatting on the GeneID, Chr, and Strand columns, ie: changing +/-, with/without "chr", etc... I also made sure to include transcripts that had high counts in the idxstats output.

These all yield 0% reads.

ADD REPLY • link 2.1 years ago by camppatrick ▴ 10

0

Entering edit mode

Keep in mind that featureCounts does not count multi-mapping reads by default. If you have more than one transcript for genes, this is likely happening. You can add -M option to count multi-mapped reads.

In any case you transcripts are small and not sure how long your nanopore reads are. Examining alignments carefully is in order.

ADD REPLY • link 2.1 years ago by GenoMax 141k

0

Entering edit mode

Hm. Unfortunately I don't think it is an issue of multi mapping, but still rather featureCounts is not interpreting some syntax somewhere.

The option -M also produces 0 reads.

To note: I am using a small subset of data for quick alignment and pipeline testing. Hence the low number of reads (this is also just barcode 1 of 12. My reads are 100-1000bp in length, and I have a few thousand reads mapping to the genome (minimap -ax splice ref = genomic file), and ~30000 total primary alignments to the cDNA reference.

Perhaps another column of information is needed in the .saf file? Or a specific syntax of some column is false. I am at a loss.

ADD REPLY • link 2.1 years ago by camppatrick ▴ 10