Why my CD31+ endothelial cells RNAseq didn't express CD31?
1
0
Entering edit mode
3.0 years ago
Youyy • 0

Hi, I was doing RNAseq for CD31 selected human endothelial cells. Unfortunately, My cells didn't express CD31. I don't know why. The RNAseq method is SMART-Seq Stranded Kit (Takara).

I followed the general pipeline, Quality Control-Cutadapt-HISAT2-Stringtie. For the Stringtie, I used the reference file (hg19) to guide assembly.

For the gene abundance estimate of the Stringtie, the CD31(PECAM1) expression(FPKM) is 0.

Has anyone ever met this problem? What's the possible issue? Thanks.

CD31 HISAT2 endothelialcells Stringtie RNAseq • 3.7k views
ADD COMMENT
1
Entering edit mode

Why did you use Stringtie after HISAT2? For the purpose meassuring gene expression it might be sufficient to apply the tool featureCounts on the BAM files from your alignment with HISAT2? This would give you the gene count matrix per sample.

ADD REPLY
0
Entering edit mode

Hi, thank you for your response. I just followed some papers that they had used Stringtie after HISAT2. I will try the FeatureCounts.

ADD REPLY
1
Entering edit mode

Yeah, that "tuxedo" pipeline hisat2-stringtie-ballgown was published in a high impact journal by a well-known group but was actually (at that time) a pipeline for transcript-level rather than gene-level analysis. It is (imho) overly complicated and mostly unnecessary for gene-level analysis. These days the lightweight quantifiers such as salmon and kallisto seem to be preferred by many users, which quantify reads directly against a transcriptome, and then aggregate transcript abundance estimation to the gene level, e.g. via tximport from Bioconductor. No need for stringtie assembly when working with well-annotated organisms unless you are specifically interested in novel transcripts.

ADD REPLY
0
Entering edit mode

I used the featurecounts, however, the geneid of the PECAM1(CD31) is 5175, right? I did search on NCBI. However, the aligned reads for this one is 0.

ADD REPLY
0
Entering edit mode

Is this single-cell RNA-seq? If yes, what platform (e.g. 10X Chromium)?

ADD REPLY
0
Entering edit mode

Thank you for your response.

Not sc RNAseq, but by followed SMART-Seq Stranded Kit (Takara).

ADD REPLY
0
Entering edit mode

Just to double-check, was alignment also done with hg19 as reference? As suggested below try to simply plug the bam file into featureCounts making sure that the genome version of the GTF is the same as your reference genome and then see whether CD31 comes out with some counts. I assume CD31 was confirmed (or even used as selection marker) via FACS?

ADD REPLY
0
Entering edit mode

Yes, I used hg19, a gtf file. I am trying the featureCounts, I never used it before. I am not sure what method they used to confirm the CD31, I only know that they selected the endothelial cells based on CD31 marker.

ADD REPLY
1
Entering edit mode
3.0 years ago
Michael 54k

Are you sure there is no mismatch between annotation file and genome build? I wouldn't rely on a single pipeline in this case, and specifically not the pipeline you are using.

  1. Try to look at the BAM files from the alignment, extract the genomic regions of CD31 from the BAM files and count reads using samtools and visualize the genomic region in IGV. If good coverage is detected, the problem is with stringtie, goto 5 If still no or very few reads are found in this region:
  2. check genomic coordinates again,
  3. run alignment from scratch with a different (more sensitive) aligner (e.g. STAR),
  4. do a QC on the raw reads and the results, if sufficient:
  5. recount, eg. with HTseq, featureCount, if transcript counts needed try replacing step 3,5 by Salmon or Kallisto first.
ADD COMMENT
2
Entering edit mode

You could also load the BAM file into IGV and check whether reads cover CD31, and if so what mapping quality these have. Maybe the gene has some biases such as low complexity that make alignment difficult. Is the kit enriching for the UTRs or is this a full length method?

Edit: Sorry, Michael was already suggesting that, did not properly read his repsonse.

ADD REPLY
1
Entering edit mode

Thant's what I am proposing in step 1. I wasn't quite ready typing :) For practical purposes I would also make small versions (restricted to region of interest) of the BAM files with samtools on the server first, and also count them first, before transfering and loading the whole files. If they are huge, it's just difficult to load them in IGV.

ADD REPLY
0
Entering edit mode

Thank you for your response.

ADD REPLY
0
Entering edit mode

Thank you for your response. let me look at the BAM file.

ADD REPLY
2
Entering edit mode
samtools view -c bamfile.bam chr17:62400411-62407083

should do the counting for hg19

ADD REPLY
0
Entering edit mode

Extract read depth at those coordinates using standard tools.

ADD REPLY
0
Entering edit mode

that is what the command is supposed to do

ADD REPLY
0
Entering edit mode

It was meant for OP, not to your post.. sorry. Instead of loading BAM (subset or not) in IGV, I was suggesting OP to calculate coverage for OP's comment ("let me look at the BAM file")

ADD REPLY
0
Entering edit mode

Ok, no problem. anyway TMTOWTDI

ADD REPLY
0
Entering edit mode

Thank you so much. I ran the command line, the result is 117. Any suggestions for that?

ADD REPLY
2
Entering edit mode

That means there are 117 reads aligned to the CD31 region, which is not 0 and therefore a good result. Now, you could extract the region into a small bam file and visualize it with IGV.

samtools view -o cd31only.bam bamfile.bam chr17:62400411-62407083

Load the small file into IGV together with the genome reference and annotation file and visualize the region.

One more thing: what about strandedness? The read pairs are likely "second (reverse) strand first" did you take this into account when running the pipeline?

ADD REPLY
0
Entering edit mode

Sure, Thank you so much. The strandness is Forward, for every step, I set up the forward strand if there is an option. Because the Smart-seq protocol says it is forward, as follows: Strand-of-Origin Information • Read 1 matches the antisense sequence of the input RNA. • If you are performing paired-end sequencing, Read 2 will correspond to the sense strand (see Figure 3).

ADD REPLY
0
Entering edit mode

Isn't what you describe the opposite: "reverse or 2nd strand first"? I think we are reaching the core of the problem. That is due to the way genome annotations work. When a gene is annotated, the mRNA sequence is of the same directionality as the gene body sequence, such that the mRNA sequence is matching the same strandednes as the gene is annotate to have. That means, if you get antisense sequence of the input RNA, that means it is "reverse" not "forward" sense.

ADD REPLY
0
Entering edit mode

Hi, read1 matches the antisense sequence, so it should be sense(Forward), right? I switched my Stringtie and FeatureCounts with reverse, but I still can not find the CD31 expression. Now I am re-running the HISAT2 with reverse.

ADD REPLY
0
Entering edit mode

No, this means that read1 IS the antisense sequence and therefore is in antisense (reverse) orientation also wrt the gene sequence. "Matches" should be interpreted as has the identical sequence, what you are thinking of is Watson-Crick pairing, that is not the case here.

ADD REPLY
0
Entering edit mode

I went through HISAT2 with reverse, featurecounts with reverse, the 5175 is still 0. I think there must be some other problems.

ADD REPLY
0
Entering edit mode

Indeed, possibly all those reads are intronic. Please do the visualization of the region with gene models in IGV and post a screen shot, alternatively provide the bam file(s).

ADD REPLY
0
Entering edit mode

Sure, thanks, I will do it. But I it's SMART-Seq, by following RNA isolation and cDNA synthesis, I think the reads from intronic couldn't be counted, right?

ADD REPLY
0
Entering edit mode

I have another quesiton. I used featurecount for the bam file. The gene ID of PCAM1 is 5175, right? I could be wrong. https://www.ncbi.nlm.nih.gov/gene/5175 It's 0 for the 5175

ADD REPLY
0
Entering edit mode

I am not sure about the gene id. This is what I got from UCSC genome browser. You mentioned hg19 in the beginning. It might well be that you are mixing up assemblies and annotation versions: (GRCh37 == hg19) vs. (GRCh38/hg38).

Human Gene PECAM1 (uc002jef.2) Description and Page Index DSequence Note: Due to a gap in the GRCh37 reference genome sequence, only a portion of the 3'-most exon of NM_000442.4 aligns to the reference genome sequence. Transcript (Including UTRs)

Position: hg19
chr17:62,396,777-62,407,083 Size: 10,307 Total Exon Count: 2 Strand: -

The Note might also be relevant here. I'd try to download the latest assembly and annotation file from ENSEMBL and start from scratch. The following two should match each other.

Genomic sequence: http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

Annotation file: http://ftp.ensembl.org/pub/release-103/gff3/homo_sapiens/Homo_sapiens.GRCh38.103.gff3.gz

ADD REPLY
0
Entering edit mode

Please check the quality (phred score) of the mapped reads. Check the reads mapped to house keeping genes in experimental condition. (like beta-actin, GAPDH if it is human sample) and also coverage of spike-ins if there are any.

ADD REPLY
0
Entering edit mode

Thank you for your response. The result of FastQC was good. I will check the things you mentioned.

ADD REPLY
0
Entering edit mode

Thank you for your response. Let me look at the BAM file.

ADD REPLY

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6