STAR alignment not detecting some transcripts / custom reference
0
0
Entering edit mode
4 weeks ago
npont ▴ 10

Hi all!

I have a scRNA-seq dataset and I use STAR for the alignment.

In the experimental conditions, my colleague added a certain gene and I would like to detect its transcripts levels. Therefore I added this bacterial gene to my genome annotation (I am working with the zebrafish) as follows at the end of the gtf file:

77      AddedGenes      exon    1       1422    .       +       0       gene_id "dcm"; gene_name "dcm"; transcript_id "dcm"; exon_id "dcm"; exon_number "1";

My final output is a count matrix, but here is the problem: I don't get to see the "dcm" gene name in the list of genes (while even genes with 0 expression are still reported in the count matrix).

To debug the source of the problem, I manually aligned my fastq files to a fasta file containing just the dcm gene using bwa. This reported matches.

But then I tried with STAR as this is the aligner I use in my workflow. Doing the following:

STAR --runMode genomeGenerate --genomeDir star_dcm_index --genomeFastaFiles dcm.fa --runThreadN 4 --genomeSAindexNbases 5

STAR --genomeDir star_dcm_index --readFilesIn myfile_R1.fastq.gz --readFilesCommand zcat --runThreadN 4 --outSAMtype SAM --outFileNamePrefix star_alignment/

samtools view star_alignment/Aligned.out.sam

But this does not report any match. So I think the problem of not detecting the dcm in the list of genes in the count matrix might come from here.

Has anyone any idea about what is occurring? I guess someone experienced with aligners would understand the cause of the issue.

Thank you very much for your help and time:)

star bwa rna-seq alignment scrna-seq • 1.4k views
ADD COMMENT
0
Entering edit mode

What is this gene supposed to do? Is it being integrated in zebrafish genome and then expected to be expressed?

Are there other genes in zebrafish genome that have sequence similarity to this gene?

You could try using bbduk.sh from BBMap suite in filter mode to see if there are any reads that match the sequence of dcm like below.

bbduk.sh -Xmx4g in=your_R1.fastq.gz ref=dcm.fa outm=reads_for_dcm.fastq.gz
ADD REPLY
0
Entering edit mode

Hello,

Thanks for your replies.

GenoMax concerning your questions "What is this gene supposed to do? Is it being integrated in zebrafish genome and then expected to be expressed?" -->The gene I am trying to add is a bacterial gene and is called DCM. It is therefore not included in the genome of my model under study (the zebrafish). I am trying to assess the levels of DCM because it is of interest in our experiment. It is expected to be expressed but in low quantities. Moreover according to blast results, there is no similar sequence in the zebrafish genome.

Using the bbduk.sh I get two reads matching the dcm and this is the same result as when I use bwa to align; while STAR outputs 0 match.

rfran010 If I dig in a bit in the results of the alignment, for example for R1 of a given cell I get two reads (out of 759'252 reads) matching the DCM gene and having the output SAM file is as follows:

VH00505:409:AAGHFNHM5:1:1104:60154:31594        0       77      1217    60      21S40M  *       0       0       ATTGCGCAATGTGGGGAACGACTCGGGAATGCGCGCGCTTAATGGGTTTTGAAGCGCCGGG   IIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIII9III       NM:i:0  MD:Z:40 AS:i:40 XS:i:0

VH00505:409:AAGHFNHM5:1:2303:8442:32749 0       77      1217    60      20S41M  *       0       0       ATTGCGCAATGGGGGCACGACTCGGGAATGCGCGCGCTTAATGGGTTTTGAAGCGCCGGGA   9IIIIIIIII9IIII9IIIIIIIIIII9IIIIIIIIIIII99IIIIIIIIIIIIII9IIII       NM:i:0  MD:Z:41 AS:i:41 XS:i:0

So there is very few read matching but high score. Moreover, my pipeline generating the count matrix does:

  • add UMI
  • align with STAR as follows STAR --runThreadN 20 --genomeDir {input.genomedir} --genomeLoad LoadAndRemove --readFilesCommand zcat --outFileNamePrefix {params.prefix} --outSAMtype BAM Unsorted --readFilesIn {input.R1} {input.R2}
  • sort reads by name for HTseq to be used in umicount
  • calls umicount of https://github.com/leoforster/umicount/blob/master/umicount/umicount.py --> looking at the code it considers both genes and exons so I don't think this is the problem

Thanks again for your time!

ADD REPLY
0
Entering edit mode

Looking at the reads, perhaps parameters like --outFilterScoreMinOverLread or --outFilterMatchNminOverLread would need adjustment? See manual, but default seems to be set at 0.66 of total read length (for PE, includes total paired length).

First read, seems like would score 40/61 = 0.656, so maybe is expected to be filtered out in STAR alignment. Technically, I think the second read should be okay, since it seems like it would score high enough, but paired end mapping may change that?

ADD REPLY
0
Entering edit mode

I don't know how scoring with STAR works exactly, but I took a quick look at some of my star alignments. I get total alignment score of 300 for paired reads that have 151 matches (302 total matches with no gaps/mismatches/clipping etc.). So, maybe even your second alignment isn't expected to qualify with default settings.

ADD REPLY
0
Entering edit mode

Using the bbduk.sh I get two reads matching the dcm and this is the same result as when I use bwa to align; while STAR outputs 0 match.

and

It is expected to be expressed but in low quantities.

Then it appears that your gene is not expressed since the alignments you show are only for 2/3rd part of the read (that matches DCM).

If the gene is not stably integrated into zebrafish genome then this result may not be surprising. Unless you were driving the expression of the gene with a promoter that works in zebrafish (I am speculating since we don't know the exact experiment you are doing).

ADD REPLY
0
Entering edit mode

To debug the source of the problem, I manually aligned my fastq files to a fasta file containing just the dcm gene using bwa. This reported matches.

More information on this would be helpful. If you aligned to a fasta with only the dcm gene, then maybe it's matches are more spurious. Did you check into this?

How are you getting the count matrix? If it's in the GTF, it should come up as a count, but depends on the command. For example, it may not count if you don't have gene/exon structure in the GTF (meaning gene and exon entries in column 3)...

ADD REPLY
0
Entering edit mode

Hi both GenoMax and rfran010

I tried using lower --outFilterScoreMinOverLread or --outFilterMatchNminOverLread as suggested by rfran010 but even for extremely low thresholds it does not output any match.

As stressed out by GenoMax , having only 2/3rd of the read matching the gene (for only two reads or less depending on the cell) might suggest that by gene is not expressed. The experiment I am doing is the following: a Decylmethyltransferase (DCM) is coupled to the Polymerase, such that when the Polymerase starts the transcription, the DCM should be expressed and its translated protein should methylate the active genes (as it is coupled to the polymerase). While we indeed observe DCM methylation in our data, we might not have DCM transcripts levels high enough to detect them especially as the timepoint of cells sequencing is somehow advanced so the transcripts signal might have decreased. What I don't understand tho is why some genes with zero expression levels are in the count matrix while the DCM is not.

Thanks again!

ADD REPLY
0
Entering edit mode

What I don't understand tho is why some genes with zero expression levels are in the count matrix while the DCM is not.

That is likely because you only added one line for the gene with attribute exon.

As noted by @rfran010, if you add another line with attribute gene then you should see counts in your matrix. So the complete "gene" entry will look like

77      AddedGenes      gene    1       1422    .       +       0       gene_id "dcm"; gene_name "dcm"
77      AddedGenes      exon    1       1422    .       +       0       gene_id "dcm"; gene_name "dcm"; transcript_id "dcm"; exon_id "dcm"; exon_number "1";

While we indeed observe DCM methylation in our data,

If that is the case then you are below detection limit for tha transcript in the amount of sequencing you have done. You could do deeper sequencing of the library or try qPCR to show that the transcripts for DCM are present.

ADD REPLY
0
Entering edit mode

You are right, I will try those things.

Also I realized that the genes I am adding to my zebrafish genome are bacterial, which means they don't have splicing and therefore STAR might not be working because it is splice-aware.

I will run my pipeline normally with STAR and add a rule to run bwa to align my RNAseq data against my two bacterial genes and finally add the corresponding expression levels to the count matrix derived from STAR alignment.

Thanks again for your time:)

ADD REPLY
0
Entering edit mode

I managed to detect my gene, I think the problem was related to the gtf missing some fields, but with

77      AddedGenes      gene    1       1422    .       +       0       gene_id "dcm"; gene_name "dcm"
77      AddedGenes      exon    1       1422    .       +       0       gene_id "dcm"; gene_name "dcm"; transcript_id "dcm"; exon_id "dcm"; exon_number "1";

it worked!

However, I have very low levels of dcm. As this is a bacterial gene (=not spliced) while STAR is splice-aware, I suspect these very low transcripts levels to be due to STAR expecting some introns and splice junctions. I would like to align my RNAseq reads to just my bacterial gene DCM with an aligner that works fine with non-spliced genomes (such as bwa mem or bbmap). But I am worried that it will be biased if I use as a reference just the gene of interest. Has anyone already done that in the past?

ADD REPLY
0
Entering edit mode

I don't think this has anything to do with STAR and its splice-aware nature. This sequence is not present in your data as we proved with bbduk.sh which looked reads that matches DCM sequence.

You can try bwa orbbmap.sh (the aligner) against just the DCM reference but the result should not change significantly. If it does then you will want to examine the alignments closely. They may be spurious short stretches that map by chance.

ADD REPLY
0
Entering edit mode

I wouldn't focus too much about using a different aligner in your pipeline, instead worrying about expression efficiency, sequencing depth or detection sensitivity if you plan to run the experiment again.

In theory, STAR handles unspliced reads perfectly fine, indeed many genes don't have introns and there's longer exons/RNAs that still need efficient alignment. Plus, although I'm not highly technical, the way I understand it, STAR finds a seed, then extends the read. Once the read doesn't align anymore, it then does a seed search with the remaining sequence. Feeding STAR splice junctions just helps STAR assign reads to the splice junction instead of calling them as unaligned. ( I could be mistaken)

In my experience, the problem is usually unspliced reads being aligned as spliced, but it seems that's mostly at repeats. Also, I didn't have any noticeable problem when I used E coli RNA as a spike in with mammalian RNA extracts.

I'm not sure if 2/3rd of the read alignment would be expected. If this can be accounted for by library fragment size, how you add UMIs, how adapters are added, or some other preparation step, then 2/760k reads maybe isn't too bad, but greater seq depth could give more confidence.

That said, I'm curious how you couple DCM to the polymerase? If it's a fusion, then maybe there's some more adjustments you can make on the analysis end?

ADD REPLY

Login before adding your answer.

Traffic: 3898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6