Question: Problems getting read number per gene after sequencing
0
gravatar for ThePresident
2.8 years ago by
ThePresident140
ThePresident140 wrote:

I am dealing with Illumina genomic (not transcriptomic) paired-end sequencing. For some purpose, I'd like to have a number of reads that cover a particular feature (let's say a gene) defined in a BED or GFF file.

I have sorted SAM and BAM files.

My BED files looks like this:

RandomSeqSTD    500000  505001
RandomSeqSTD    1000000 1001001
RandomSeqSTD    1500000 1500501
RandomSeqSTD    2000000 2000251
RandomSeqSTD    2500000 2500101

My GFF file (which I made) looks like this:

##gff-version 3
##sequence-region RanSeqSTD 1 4659625
RanSeqSTD   random  CDS 500000  505000  .   .   .   gene_id=5000bp;transcript_id=same
RanSeqSTD   random  CDS 1000000 1001000 .   .   .   gene_id=1000bp;transcript_id=same
RanSeqSTD   random  CDS 1500000 1500500 .   .   .   gene_id=500bp;transcript_id=same
RanSeqSTD   random  CDS 2000000 2000250 .   .   .   gene_id=250bp;transcript_id=same
RanSeqSTD   random  CDS 2500000 2500100 .   .   .   gene_id=100bp;transcript_id=same

First I tried getting read number for each of these features with:

bedtools multicov -bams *.bam -bed BEDfile.bed

I got a read count and in theory this might be enough, however the problem is that I have limited control over what is considerate to be part of the feature. So I tried using HTSeq with the sorted SAM file and the above GFF file with the following command line:

htseq-count -s no -m union -i gene_id -t CDS STD100_sorted.sam RanSeqGFF.gff3 > HTSeq.txt

However, I keep getting count "0" for all gene_id features.

Any idea why HTSeq gives me this and any idea how this can be done differently?

Note: I am fully aware that HTSeq was not developed for genomic sequencing and thus should not be used for this.

Thanks TP

multicov bedtools htseq • 1.2k views
ADD COMMENTlink modified 2.8 years ago by geek_y11k • written 2.8 years ago by ThePresident140

I think, gene_id and transcript_id need to be separated by a space (no '=' sign) and the actual IDs are escaped using " characters:

gene_id "ENSG00000223972"; transcript_id "ENST00000456328";

That's at least how I parse these IDs in my own tool alfred which should do the read counting you are looking for (requires a position-sorted BAM file):

/src/alfred count -i gene_id -f CDS -g RanSeqGFF.gtf.gz STD100_sorted.bam

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by trausch1.5k

What you are describing is a GTF file, I believe. It's worth trying though. I'll update the comment when I get the result.

ADD REPLYlink written 2.8 years ago by ThePresident140
1
gravatar for glihm
2.8 years ago by
glihm620
France
glihm620 wrote:

You did not specified the sorting order of your pair-end alignments.

  -r {pos,name}, --order {pos,name}
                    'pos' or 'name'. Sorting order of <alignment_file>
                    (default: name). Paired-end sequencing data must be
                    sorted either by position or by read name, and the
                    sorting order must be specified. Ignored for single-
                    end data.

I guess that you did a default sorting like:

samtools sort -o sorted.sam input.sam

So, you probably need to add "-r" option like:

htseq-count -r pos -s no -m union -i gene_id -t CDS STD100_sorted.sam RanSeqGFF.gff3 > HTSeq.txt

If you already sorted your sam file with names, and not positions, you should check the references used in the sam file and the seqid of you GFF3 formatted file.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by glihm620
1

You were right! Initially, I did sort by the name however I did a rookie mistake of not looking at my reference and seqid. In my GFF file I use "RanSeqSTD" but my sam file harbors "RandomSeqSTD". Anyway, problem solved. Thanks!

ADD REPLYlink written 2.8 years ago by ThePresident140

we all did ! :D Happy to see problem solved ! ;)

ADD REPLYlink written 2.8 years ago by glihm620
1
gravatar for geek_y
2.8 years ago by
geek_y11k
Barcelona
geek_y11k wrote:

Just convert the BED to SAF and use featureCounts. Its pretty straightforward.

awk -v OFS="\t" 'BEGIN {"GeneID","Chr","Start","End","Strand"} { print "Gene_"NR, $1, $2, $3, "."}' in.bed > out.saf
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by geek_y11k

featureCounts seems very similar to bedtools multicov. My only concern with these tools is that they count a read as long as it overlaps the feature (gene, exon etc.) by 1bp which in my opinion is a little bit too loose as cutoff. Therefore, the idea to use HTSeq as it seems more stringent. However, it might be enough for what I want to do. Thanks!

ADD REPLYlink written 2.8 years ago by ThePresident140
1

You can always change the default parameters −−minOverlap and also if you want to make the quantification bit stringent by utilizing other arguments like −−fracOverlap , −−largestOverlap ... in my personal opinion featureCounts is more flexible by having options for many parameters and also more efficient compared to HTSeq.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by EagleEye6.6k

I missed these options when I read about featureCounts. It looks good indeed, I'll give it a try. Thanks!

ADD REPLYlink written 2.8 years ago by ThePresident140
1

Definitely its not very similar to multiCov

ADD REPLYlink written 2.8 years ago by geek_y11k

Why do you say so? They perform identical function which is to count the number of reads that overlap a set of features defined by genomic boundaries. Additionally

featureCounts: "A read is said to overlap a feature if at least one read base is found to overlap the feature" multicovs: "-f argument : Minimum overlap required as a fraction of each A. Default is 1E-9 (i.e., 1bp)."

ADD REPLYlink written 2.8 years ago by ThePresident140
1

Performing the identical function, doesn't mean they both are very similar. The limitations of multiCov and the options in featureCounts make a big difference,

ADD REPLYlink written 2.8 years ago by geek_y11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1170 users visited in the last hour