Question: Extract read counts from fastq files
gravatar for feng0049
2.8 years ago by
feng00490 wrote:

I got a set of human RNA fastq files and was asked to convert them to read counts to feed into edgeR package for further analysis. I am very new to RNA analysis, and I have done a lot of researches online to find out the process. However, I am still not confirmed about the correctness of my process. Below is how I convert pair-end fastq files to read counts and I want to confirm whether I am in the right track.

First, I am using bowtie2 to get .sam files from paired fastq files.The reference I am using is H.sapiens, NCBI GRCh38 provided from bowtie2 website link. Following is the code to extract .sam files from 2 fastq files obtained from the same cells:

bowtie2 -p16 -x -1 A.fastq.gz -2 B.fastq.gz -S X.sam

The alignment result summary is as follows:

3014347 reads; of these:
  3014347 (100.00%) were paired; of these:
    1417553 (47.03%) aligned concordantly 0 times
    812147 (26.94%) aligned concordantly exactly 1 time
    784647 (26.03%) aligned concordantly >1 times
    1417553 pairs aligned concordantly 0 times; of these:
      232981 (16.44%) aligned discordantly 1 time
    1184572 pairs aligned 0 times concordantly or discordantly; of these:
      2369144 mates make up the pairs; of these:
        1790438 (75.57%) aligned 0 times
        319377 (13.48%) aligned exactly 1 time
        259329 (10.95%) aligned >1 times
70.30% overall alignment rate

May I know whether this a good alignment?

Then the generated sam files are processed using samtools with the following steps:

samtools view -h -b X.sam > X.bam
samtools sort -o X_sorted.bam -O bam -n X.bam
samtools view X_sorted.bam > X_sorted.sam

Then I am using htseq to extract read counts from sam files:

htseq-count -s no -a 10 X_sorted.sam hg38.gtf > X.count

The index hg38.gtf using here is obtained from UCSC table with the following process:

 UCSC->Table->clade(Mammal)->genome(Human)->assembly(DEC.2013(GRch38/hg38))->group(Genes and Gene Prediction Tracks)->track(GENCODE 22)->table(knownGene)->output format(selected fields from primary and related tables)->output file: GTF->get output

Below is one of the counts summary:

__no_feature     89693
__ambiguous 395107
__too_low_aQual 757839
__not_aligned     900386
__alignment_not_unique  0

Is this considerably good or not.

Thank you for any help in advance!

rna-seq alignment • 2.2k views
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by feng00490
gravatar for ivivek_ngs
2.8 years ago by
Seattle,WA, USA
ivivek_ngs4.8k wrote:

For learning purpose for the beginning it is fine however you do not to create an intermediate sam file. You can directly map with bowtie2 and create bam file which you can sort and then use htseq or featurecounts to extract count data for each sample. If you have properly extracted the gene annotation fike (gff/gtf) file and you have series of samples in bam format then use all of them to construct a matrix of count data for transcript id which can be then summarized to gene count using tximport.

Take a look at the below links: rnaseqGene

Generating Counts Data From Fastq Sequence Files

Extracting Read Count For Each Gene/Exon From Rna-Seq Bam Files

You have to search properly the tutorials online and understand each steps carefully.

Try to follow this publication and this one as well for new methods for quantification, benchmarking and consequences in DE analysis, should be able to give you information of latest tools in use.

ADD COMMENTlink written 2.8 years ago by ivivek_ngs4.8k

Thanks for the information and suggestions. I will look into these!

ADD REPLYlink written 2.8 years ago by feng00490
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour