Hi all,
I'd like to generate gene count files for downstream DESeq2 analysis and heirarchical clustering from my RNA-seq data but I'm not sure how to go about it with what I have already done.
I have 100bp PE RNA-seq data generated with the TruSeq Stranded mRNA Library Prep Kit. Because there were many overlapping reads in the paired data (median insert size was 200bp), we used PEAR to merge PE read overlaps into a SE fastq file. Reads which were not overlapping were retained in two fastq files, R1 and R2 for PE mapping. The three files were then mapped using STAR. It should be noted that PE files were very small compared to SE files.
for PEAR generated SE:
STAR --genomeDir /Resource/ --readFilesIn ../wz745.assembled.fastq --outFileNamePrefix ./wz745.assembled.fastq --outTmpDir ./wz745.assembled.fastq/TMP --readFilesCommand cat --runThreadN 3 --outSAMtype BAM SortedByCoordinate
for PEAR PE:
STAR --genomeDir /Resource/ --readFilesIn ../wz745.unassembled.forward.fastq ../wz745.unassembled.reverse.fastq --outFileNamePrefix ./wz745.unassembled.forward.fastq --outTmpDir ./wz745.unassembled.forward.fastq/TMP --readFilesCommand cat --runThreadN 3 --outSAMtype BAM SortedByCoordinate
The resulting bam files were merged using samtools 'merge' into a single bam.
Next, I wanted to use htseq-count to generate .tsv count files for each gene. I've done this but one file failed (of 53) to produce output and gave a mate record missing warning Warning: Mate records missing for 1 records; first such record:
This made me realize htseq-count probably doesn't support both paired-end and single-end data in the same file. So I separated the PE and SE reads as suggested previously in another thread. Error at the end of HT-seq Analysis and output.txt with 0 bytes in size
I then ran htseq-count for each individual file:
SE:
htseq-count --format bam --order pos --mode intersection-strict --stranded yes --minaqual 1 --type exon --idattr gene_id wz745.se.bam /Resource/Homo_sapiens.GRCh37.75.gtf > wz745.se.yes.tsv
PE:
htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id wz745.pe.bam /Resource/Homo_sapiens.GRCh37.75.gtf > wz745.pe.tsv
What I don't understand is when I run --stranded reverse or --stranded yes on the singular PE file I again get the mate record missing warning and a file of mostly 0 counts. This might be a problem with the PEAR output or a legitimate count since the input files are very small.
If I run --stranded yes on the SE file, I get a file with very few genes with counts above 0. But if I run the --stranded reverse on the SE file I get almost the same output as when I run --stranded reverse on the merged file with SE and PE together. I'm not sure which flags I should be using or if it is even necessary to split out the files into SE and PE from PEAR input format. I was naively thinking individual files SE --stranded yes and PE --stranded reverse should give me all the counts per gene when added up, but if I do that, I have almost no counts. I'm definitely doing something wrong.
Any advice is appreciated,
Tesa