Question

Extract read counts from fastq files

0

Entering edit mode

7.8 years ago

feng0049 • 0

I got a set of human RNA fastq files and was asked to convert them to read counts to feed into edgeR package for further analysis. I am very new to RNA analysis, and I have done a lot of researches online to find out the process. However, I am still not confirmed about the correctness of my process. Below is how I convert pair-end fastq files to read counts and I want to confirm whether I am in the right track.

First, I am using bowtie2 to get .sam files from paired fastq files.The reference I am using is H.sapiens, NCBI GRCh38 provided from bowtie2 website link. Following is the code to extract .sam files from 2 fastq files obtained from the same cells:

bowtie2 -p16 -x -1 A.fastq.gz -2 B.fastq.gz -S X.sam

The alignment result summary is as follows:

3014347 reads; of these:
  3014347 (100.00%) were paired; of these:
    1417553 (47.03%) aligned concordantly 0 times
    812147 (26.94%) aligned concordantly exactly 1 time
    784647 (26.03%) aligned concordantly >1 times
    ----
    1417553 pairs aligned concordantly 0 times; of these:
      232981 (16.44%) aligned discordantly 1 time
    ----
    1184572 pairs aligned 0 times concordantly or discordantly; of these:
      2369144 mates make up the pairs; of these:
        1790438 (75.57%) aligned 0 times
        319377 (13.48%) aligned exactly 1 time
        259329 (10.95%) aligned >1 times
70.30% overall alignment rate

May I know whether this a good alignment?

Then the generated sam files are processed using samtools with the following steps:

samtools view -h -b X.sam > X.bam
samtools sort -o X_sorted.bam -O bam -n X.bam
samtools view X_sorted.bam > X_sorted.sam

Then I am using htseq to extract read counts from sam files:

htseq-count -s no -a 10 X_sorted.sam hg38.gtf > X.count

The index hg38.gtf using here is obtained from UCSC table with the following process:

 UCSC->Table->clade(Mammal)->genome(Human)->assembly(DEC.2013(GRch38/hg38))->group(Genes and Gene Prediction Tracks)->track(GENCODE 22)->table(knownGene)->output format(selected fields from primary and related tables)->output file: GTF->get output

Below is one of the counts summary:

__no_feature     89693
__ambiguous 395107
__too_low_aQual 757839
__not_aligned     900386
__alignment_not_unique  0

Is this considerably good or not.

Thank you for any help in advance!

RNA-Seq alignment • 5.6k views

ADD COMMENT • link 7.8 years ago by feng0049 • 0

score 1 · Answer 1 · 2016-06-27

For learning purpose for the beginning it is fine however you do not to create an intermediate sam file. You can directly map with bowtie2 and create bam file which you can sort and then use htseq or featurecounts to extract count data for each sample. If you have properly extracted the gene annotation fike (gff/gtf) file and you have series of samples in bam format then use all of them to construct a matrix of count data for transcript id which can be then summarized to gene count using tximport.

Take a look at the below links: rnaseqGene

Generating Counts Data From Fastq Sequence Files

Extracting Read Count For Each Gene/Exon From Rna-Seq Bam Files

You have to search properly the tutorials online and understand each steps carefully.

Try to follow this publication and this one as well for new methods for quantification, benchmarking and consequences in DE analysis, should be able to give you information of latest tools in use.