Question

htseq-count using overlapping reads assembled with PEAR

0

Entering edit mode

8.4 years ago

t2 ▴ 60

Hi all,

I'd like to generate gene count files for downstream DESeq2 analysis and hierarchical clustering from my RNA-seq data but I'm not sure how to go about it with what I have already done.

I have 100bp PE RNA-seq data generated with the TruSeq Stranded mRNA Library Prep Kit. Because there were many overlapping reads in the paired data (median insert size was 200bp), we used PEAR to merge PE read overlaps into a SE fastq file. Reads which were not overlapping were retained in two fastq files, R1 and R2 for PE mapping. The three files were then mapped using STAR. It should be noted that PE files were very small compared to SE files.

For PEAR generated SE:

STAR \
  --genomeDir /Resource/ \
  --readFilesIn ../wz745.assembled.fastq \
  --outFileNamePrefix ./wz745.assembled.fastq \
  --outTmpDir ./wz745.assembled.fastq/TMP \
  --readFilesCommand cat \
  --runThreadN 3 \
  --outSAMtype BAM SortedByCoordinate

For PEAR PE:

STAR \
  --genomeDir /Resource/ \
  --readFilesIn ../wz745.unassembled.forward.fastq ../wz745.unassembled.reverse.fastq \
  --outFileNamePrefix ./wz745.unassembled.forward.fastq \
  --outTmpDir ./wz745.unassembled.forward.fastq/TMP \
  --readFilesCommand cat \
  --runThreadN 3 \
  --outSAMtype BAM SortedByCoordinate

The resulting bam files were merged using samtools 'merge' into a single bam.

Next, I wanted to use htseq-count to generate .tsv count files for each gene. I've done this but one file failed (of 53) to produce output and gave a mate record missing warning Warning: Mate records missing for 1 records; first such record: This made me realize htseq-count probably doesn't support both paired-end and single-end data in the same file. So I separated the PE and SE reads as suggested previously in another thread. Error at the end of HT-seq Analysis and output.txt with 0 bytes in size

I then ran htseq-count for each individual file:

SE:

htseq-count \
  --format bam \
  --order pos \
  --mode intersection-strict \
  --stranded yes \
  --minaqual 1 \
  --type exon \
  --idattr gene_id \
  wz745.se.bam \
  /Resource/Homo_sapiens.GRCh37.75.gtf > wz745.se.yes.tsv

PE:

htseq-count \
  --format bam \
  --order pos \
  --mode intersection-strict \
  --stranded reverse \
  --minaqual 1 \
  --type exon \
  --idattr gene_id \
  wz745.pe.bam \
  /Resource/Homo_sapiens.GRCh37.75.gtf > wz745.pe.tsv

What I don't understand is when I run --stranded reverse or --stranded yes on the singular PE file I again get the mate record missing warning and a file of mostly 0 counts. This might be a problem with the PEAR output or a legitimate count since the input files are very small.

If I run --stranded yes on the SE file, I get a file with very few genes with counts above 0. But if I run the --stranded reverse on the SE file I get almost the same output as when I run --stranded reverse on the merged file with SE and PE together. I'm not sure which flags I should be using or if it is even necessary to split out the files into SE and PE from PEAR input format. I was naively thinking individual files SE --stranded yes and PE --stranded reverse should give me all the counts per gene when added up, but if I do that, I have almost no counts. I'm definitely doing something wrong.

Any advice is appreciated

Tesa

RNA-Seq • 2.8k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by t2 ▴ 60

Ram · Accepted Answer · 2015-11-23

2

Entering edit mode

8.4 years ago

h.mon 35k

First, STAR has a parameter to provide counts: --quantMode GeneCounts, see section 7 on page 14 of STAR manual, which states:

The counts coincide with those produced by htseq-count with default parameters.

Second, merging overlapping reads may be really helpful for assembling the transcriptome, but I am not sure it is wise to merge reads for quantification of expression: you are introducing biases which will be difficult to account for later.

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon,

Thanks for your input. I will take a look at the STAR GeneCounts methodology. I will go back to the original fastqs and re-align with STAR without merging overlapping reads to avoid introducing bias.

Cheers,
Tesa

ADD REPLY • link updated 20 months ago by Ram 43k • written 8.4 years ago by t2 ▴ 60