I got a set of human RNA fastq files and was asked to convert them to read counts to feed into edgeR package for further analysis. I am very new to RNA analysis, and I have done a lot of researches online to find out the process. However, I am still not confirmed about the correctness of my process. Below is how I convert pair-end fastq files to read counts and I want to confirm whether I am in the right track.
First, I am using bowtie2 to get .sam files from paired fastq files.The reference I am using is H.sapiens, NCBI GRCh38 provided from bowtie2 website link. Following is the code to extract .sam files from 2 fastq files obtained from the same cells:
bowtie2 -p16 -x -1 A.fastq.gz -2 B.fastq.gz -S X.sam
The alignment result summary is as follows:
3014347 reads; of these:
3014347 (100.00%) were paired; of these:
1417553 (47.03%) aligned concordantly 0 times
812147 (26.94%) aligned concordantly exactly 1 time
784647 (26.03%) aligned concordantly >1 times
----
1417553 pairs aligned concordantly 0 times; of these:
232981 (16.44%) aligned discordantly 1 time
----
1184572 pairs aligned 0 times concordantly or discordantly; of these:
2369144 mates make up the pairs; of these:
1790438 (75.57%) aligned 0 times
319377 (13.48%) aligned exactly 1 time
259329 (10.95%) aligned >1 times
70.30% overall alignment rate
May I know whether this a good alignment?
Then the generated sam files are processed using samtools with the following steps:
samtools view -h -b X.sam > X.bam
samtools sort -o X_sorted.bam -O bam -n X.bam
samtools view X_sorted.bam > X_sorted.sam
Then I am using htseq to extract read counts from sam files:
htseq-count -s no -a 10 X_sorted.sam hg38.gtf > X.count
The index hg38.gtf using here is obtained from UCSC table with the following process:
UCSC->Table->clade(Mammal)->genome(Human)->assembly(DEC.2013(GRch38/hg38))->group(Genes and Gene Prediction Tracks)->track(GENCODE 22)->table(knownGene)->output format(selected fields from primary and related tables)->output file: GTF->get output
Below is one of the counts summary:
__no_feature 89693
__ambiguous 395107
__too_low_aQual 757839
__not_aligned 900386
__alignment_not_unique 0
Is this considerably good or not.
Thank you for any help in advance!
Thanks for the information and suggestions. I will look into these!