Question

HTseq low counts

0

Entering edit mode

4.0 years ago

aka ▴ 10

Hello community,

I'm coming to you because I'm a bit lost. I made a snakemake pipeline to process RNAseq data which are Truseq Stranded Illumina.

I have done the classic processing FASTQC, TRIMMOMATIC, FASTQC2, BWAmem2, and it is for the count part that I have done with HTseq and with salmon where I have a problem.

I have very few counts, barely 20-30%.

For Htseq I tried sorting with samtools sort by name, position, without sorting and I always get around 26% except once I made a mistake and I sorted with samtools by position and counted with HTseq count by name and then I got 50% mapping. I guess it must be wrong...

Here is my command:

rule count_matrix:
    input:
        gff="../resources/reference/gff_gtf/25K_for_greg.gff",
        alg_file= "../results/sam_file/{sample}_sort_mapping.sam"
    output:
        inter= "../results/stat/Intersec/{sample}_count_matrix_position.csv",
    message:
        """
        --- Count genes per reads on {wildcards.sample} in process ---
        """
    shell:
        "htseq-count -m intersection-nonempty -f sam -t CDS -r pos -s reverse -i gene_id {input.alg_file} {input.gff} > {output.inter} "

For salmon I have this command:

rule count_matrix:
    input:
        alg_file="../results/mapped_reads/bwa_mem/{sample}_sort_mapping.bam",
        index = "../resources/reference/Qrob_H2.3_Genes_v2.2_20161004.CDS_nuc.fsa",
    output:
       temp("../results/stat/salmon/{sample}_mock.txt")
    params:
        out2= directory("../results/stat/salmon/{sample}_A"),
    threads: 10
    resources:
        mem_mb=25000
    message:
        """
        --- Count genes per reads {wildcards.sample} with salmon in process ---
        """
    shell:
        "salmon quant -t {input.index} -l A -a {input.alg_file} -o {params.out2} -p {threads} && touch {output} {log}"

I get with the auto library the ISR library 19% of mapping....

To check, I have generated my flagstat with this command:

rule align:
    priority: 0
    input:
        reference=config["genome_ref"],
        reads=["../results/trimmed/{sample}_R1_trimmed.fastq.gz","../results/trimmed/{sample}_R2_trimmed.fastq.gz"],
        faidx=config["genome_ref_faidx"]
    output:
        final_bam="../results/mapped_reads/bwa_mem/{sample}_sort_mapping.bam",
        flag= "../results/mapped_reads/bwa_mem/flagstat/{sample}_sort_mapping.bam.flagstat"
    message:
        """
        Mapping with BWA mem2 : {wildcards.sample} on ref
        """
    threads: 10
    resources:
        mem_mb=25000
    shell:
        "bwa-mem2 mem -M -t {threads} -v 2 {input.reference} {input.reads} | samtools view -q 20 -f 2 -F 3840 --threads {threads} -Sb -> {output.final_bam} && samtools flagstat {output.final_bam} > {output.flag} "

The flagstat gives me a good alignment 50%: The flagstat is only the proper pairs. On R1+R2 in total after trimming I had 171095182 reads and on the flagstat I have 91366309 .

91366309 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
91366309 + 0 mapped (100.00% : N/A)
91366309 + 0 paired in sequencing
45695163 + 0 read1
45671146 + 0 read2
91366309 + 0 properly paired (100.00% : N/A)
91366309 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

My BAM file is like this

@HD VN:1.6 SO:coordinate
@SQ SN:Qrob_P0000010.2 LN:693
@SQ SN:Qrob_P0000020.2 LN:1188
@SQ SN:Qrob_P0000030.2 LN:660
@SQ SN:Qrob_P0000290.2 LN:204
@SQ SN:Qrob_P0000440.2 LN:561
@SQ SN:Qrob_P0000460.2 LN:696

I really paid attention to the option for the sorting of the BAM, the direction: reverse in my case ect...

I confess I don't understand, if someone could enlighten me

Thanks in advance and have a nice day,

Aka

snakemake RNASeq htseq • 687 views

ADD COMMENT • link 4.0 years ago by aka ▴ 10