Low assigned alignments
1
0
Entering edit mode
6 weeks ago
Simon Ahn ▴ 10

Basecalls performed using CASAVA version v1.8.2 Trimmed reads with fastx_quality_trimmer 0.0.13 with a quality treshhold of 18 and a length of 20 Aligned with Bowtie 2.1.0 and Tophat 2.0.10 using Gencode v19 junctions Samtools 0.1.19-44428cd to make a bam, sort, index Raw counts were generated using htseq_count 0.6.1 using the UCSC HG19 known gene transcripts from Illumina iGenomes edgeR 3.2.4 in R 3.1.0 were used to normalize counts between samples and to make comparisons between groups. Genome_build: hg19 Supplementary_files_format_and_content: Normalized reads in CPM for each detected gene along with differential gene expression log2 (fold change) and p-values for each time point

This is the data processing detail for a paper. I downloaded fastq files from this paper and used STAR alignment and featureCounts to get my read matrix with the reference of Homo_sapiens.GRCh38 from ENSEMBL. However, with featureCounts, only 3% reads were aligned. (This is single-end fastq file so I used -s 0, -s 1, -s 2 option for featureCounts but all results were bad) What went wrong with my pipeline? Is it related to reference genome that I used?

fastq-alignment RNAseq raw-count • 408 views
ADD COMMENT
0
Entering edit mode

Whatever version of the genome you are using, you should see comparable mapping rates and assigned rates. But, and here is the point, you must use the same genome build for reference genome and GFF/GTF annotation file. I am guessing this is the most likely explanation of what went wrong. If you are using ENSEMBL, download both, assembly and GFF file from the same source. Check also that both files use the same identifiers for chromosomes ("1" vs. "chr1" is the archetypical source of confusion).

ADD REPLY
0
Entering edit mode

Thanks to your comment. But I downloaded same genome build and it worked fine with other dataset.

ADD REPLY
0
Entering edit mode

Possibly, if you could provide all the details, exact SRA id of dataset, complete download paths of assembly and gff (ftp://...), and all parameters used with STAR and featureCounts, that might become reproducible. I could offer to give it a short try tomorrow.

ADD REPLY
0
Entering edit mode

I really appreciate your help. Here are some specific details.

  1. fastq downloaded from SRA: SRP050036 using fasterq-dump (only used SRR1657054 for test)
  2. ENSEMBL reference file created with STAR

    FASTA: http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

    GTF: http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz

    STAR --runThreadN 64 --runMode genomeGenerate --genomeDir [output folder name] \ --genomeFastaFiles [Homo_sapiens.GRCh38.dna.toplevel.fa path] \ --sjdbGTFfile [Homo_sapiens.GRCh38.104.gtf path]

  3. STAR aligned (version: 2.7.9a)

    STAR --runThreadN 32 --genomeDir [ENSEMBL reference file path] \ --readFilesIn [SRR1657054.fastq path] \ --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts \ --outFileNamePrefix [output file name]

  4. Extract count matrix using featureCounts

    featureCounts -T 10 -s 0 -a [Homo_sapiens.GRCh38.104.gtf path]\ -o [output file name] \ [SRR1657054Aligned.sortedByCoord.out.bam path]

    (I tried -s 1 and -s 2 also, but didn't get better result)

ADD REPLY
1
Entering edit mode
6 weeks ago
KoppesEA ▴ 70

From the text copied in your post it appears you are referring to this data set from GEO https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1550110

It sounds like you tried to run each -s option which specifies specifies single-end read strand specificity. You can directly check this with ReSQC infer-experiment.py http://rseqc.sourceforge.net/#infer-experiment-py

use some code like

infer_experiment.py -r /YourDIR/Homo_sapiens.GRCh38.104.bed -i /BAMDIR/YourBAMalignment.bam

however even if you had the reads in wrong direction with the -s 0 (unstranded) you should have gotten 40-50% mapped to features and then with either -s 1 or -s 2 giving a higher or lower % so like the comment above check that your reference genome and annotation file match. Perhaps even try another read count program like HTSeq.

ADD COMMENT

Login before adding your answer.

Traffic: 2479 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6