Hi all!
I have a scRNA-seq dataset and I use STAR for the alignment.
In the experimental conditions, my colleague added a certain gene and I would like to detect its transcripts levels. Therefore I added this bacterial gene to my genome annotation (I am working with the zebrafish) as follows at the end of the gtf file:
77 AddedGenes exon 1 1422 . + 0 gene_id "dcm"; gene_name "dcm"; transcript_id "dcm"; exon_id "dcm"; exon_number "1";
My final output is a count matrix, but here is the problem: I don't get to see the "dcm" gene name in the list of genes (while even genes with 0 expression are still reported in the count matrix).
To debug the source of the problem, I manually aligned my fastq files to a fasta file containing just the dcm gene using bwa. This reported matches.
But then I tried with STAR as this is the aligner I use in my workflow. Doing the following:
STAR --runMode genomeGenerate --genomeDir star_dcm_index --genomeFastaFiles dcm.fa --runThreadN 4 --genomeSAindexNbases 5
STAR --genomeDir star_dcm_index --readFilesIn myfile_R1.fastq.gz --readFilesCommand zcat --runThreadN 4 --outSAMtype SAM --outFileNamePrefix star_alignment/
samtools view star_alignment/Aligned.out.sam
But this does not report any match. So I think the problem of not detecting the dcm in the list of genes in the count matrix might come from here.
Has anyone any idea about what is occurring? I guess someone experienced with aligners would understand the cause of the issue.
Thank you very much for your help and time:)
What is this gene supposed to do? Is it being integrated in zebrafish genome and then expected to be expressed?
Are there other genes in zebrafish genome that have sequence similarity to this gene?
You could try using
bbduk.sh
from BBMap suite in filter mode to see if there are any reads that match the sequence ofdcm
like below.Hello,
Thanks for your replies.
GenoMax concerning your questions "What is this gene supposed to do? Is it being integrated in zebrafish genome and then expected to be expressed?" -->The gene I am trying to add is a bacterial gene and is called DCM. It is therefore not included in the genome of my model under study (the zebrafish). I am trying to assess the levels of DCM because it is of interest in our experiment. It is expected to be expressed but in low quantities. Moreover according to blast results, there is no similar sequence in the zebrafish genome.
Using the bbduk.sh I get two reads matching the dcm and this is the same result as when I use bwa to align; while STAR outputs 0 match.
rfran010 If I dig in a bit in the results of the alignment, for example for R1 of a given cell I get two reads (out of 759'252 reads) matching the DCM gene and having the output SAM file is as follows:
So there is very few read matching but high score. Moreover, my pipeline generating the count matrix does:
Thanks again for your time!
Looking at the reads, perhaps parameters like
--outFilterScoreMinOverLread
or--outFilterMatchNminOverLread
would need adjustment? See manual, but default seems to be set at 0.66 of total read length (for PE, includes total paired length).First read, seems like would score 40/61 = 0.656, so maybe is expected to be filtered out in STAR alignment. Technically, I think the second read should be okay, since it seems like it would score high enough, but paired end mapping may change that?
I don't know how scoring with STAR works exactly, but I took a quick look at some of my star alignments. I get total alignment score of 300 for paired reads that have 151 matches (302 total matches with no gaps/mismatches/clipping etc.). So, maybe even your second alignment isn't expected to qualify with default settings.
and
Then it appears that your gene is not expressed since the alignments you show are only for 2/3rd part of the read (that matches DCM).
If the gene is not stably integrated into zebrafish genome then this result may not be surprising. Unless you were driving the expression of the gene with a promoter that works in zebrafish (I am speculating since we don't know the exact experiment you are doing).
More information on this would be helpful. If you aligned to a fasta with only the dcm gene, then maybe it's matches are more spurious. Did you check into this?
How are you getting the count matrix? If it's in the GTF, it should come up as a count, but depends on the command. For example, it may not count if you don't have gene/exon structure in the GTF (meaning gene and exon entries in column 3)...
Hi both GenoMax and rfran010
I tried using lower --outFilterScoreMinOverLread or --outFilterMatchNminOverLread as suggested by rfran010 but even for extremely low thresholds it does not output any match.
As stressed out by GenoMax , having only 2/3rd of the read matching the gene (for only two reads or less depending on the cell) might suggest that by gene is not expressed. The experiment I am doing is the following: a Decylmethyltransferase (DCM) is coupled to the Polymerase, such that when the Polymerase starts the transcription, the DCM should be expressed and its translated protein should methylate the active genes (as it is coupled to the polymerase). While we indeed observe DCM methylation in our data, we might not have DCM transcripts levels high enough to detect them especially as the timepoint of cells sequencing is somehow advanced so the transcripts signal might have decreased. What I don't understand tho is why some genes with zero expression levels are in the count matrix while the DCM is not.
Thanks again!
That is likely because you only added one line for the gene with attribute
exon
.As noted by @rfran010, if you add another line with attribute
gene
then you should see counts in your matrix. So the complete "gene" entry will look likeIf that is the case then you are below detection limit for tha transcript in the amount of sequencing you have done. You could do deeper sequencing of the library or try qPCR to show that the transcripts for DCM are present.
You are right, I will try those things.
Also I realized that the genes I am adding to my zebrafish genome are bacterial, which means they don't have splicing and therefore STAR might not be working because it is splice-aware.
I will run my pipeline normally with STAR and add a rule to run bwa to align my RNAseq data against my two bacterial genes and finally add the corresponding expression levels to the count matrix derived from STAR alignment.
Thanks again for your time:)
I managed to detect my gene, I think the problem was related to the gtf missing some fields, but with
it worked!
However, I have very low levels of dcm. As this is a bacterial gene (=not spliced) while STAR is splice-aware, I suspect these very low transcripts levels to be due to STAR expecting some introns and splice junctions. I would like to align my RNAseq reads to just my bacterial gene DCM with an aligner that works fine with non-spliced genomes (such as bwa mem or bbmap). But I am worried that it will be biased if I use as a reference just the gene of interest. Has anyone already done that in the past?
I don't think this has anything to do with
STAR
and its splice-aware nature. This sequence is not present in your data as we proved withbbduk.sh
which looked reads that matches DCM sequence.You can try
bwa
orbbmap.sh
(the aligner) against just the DCM reference but the result should not change significantly. If it does then you will want to examine the alignments closely. They may be spurious short stretches that map by chance.I wouldn't focus too much about using a different aligner in your pipeline, instead worrying about expression efficiency, sequencing depth or detection sensitivity if you plan to run the experiment again.
In theory, STAR handles unspliced reads perfectly fine, indeed many genes don't have introns and there's longer exons/RNAs that still need efficient alignment. Plus, although I'm not highly technical, the way I understand it, STAR finds a seed, then extends the read. Once the read doesn't align anymore, it then does a seed search with the remaining sequence. Feeding STAR splice junctions just helps STAR assign reads to the splice junction instead of calling them as unaligned. ( I could be mistaken)
In my experience, the problem is usually unspliced reads being aligned as spliced, but it seems that's mostly at repeats. Also, I didn't have any noticeable problem when I used E coli RNA as a spike in with mammalian RNA extracts.
I'm not sure if 2/3rd of the read alignment would be expected. If this can be accounted for by library fragment size, how you add UMIs, how adapters are added, or some other preparation step, then 2/760k reads maybe isn't too bad, but greater seq depth could give more confidence.
That said, I'm curious how you couple DCM to the polymerase? If it's a fusion, then maybe there's some more adjustments you can make on the analysis end?