Question: Problems getting read number per gene after sequencing
0
gravatar for ThePresident
22 months ago by
ThePresident110
ThePresident110 wrote:

I am dealing with Illumina genomic (not transcriptomic) paired-end sequencing. For some purpose, I'd like to have a number of reads that cover a particular feature (let's say a gene) defined in a BED or GFF file.

I have sorted SAM and BAM files.

My BED files looks like this:

RandomSeqSTD    500000  505001
RandomSeqSTD    1000000 1001001
RandomSeqSTD    1500000 1500501
RandomSeqSTD    2000000 2000251
RandomSeqSTD    2500000 2500101

My GFF file (which I made) looks like this:

##gff-version 3
##sequence-region RanSeqSTD 1 4659625
RanSeqSTD   random  CDS 500000  505000  .   .   .   gene_id=5000bp;transcript_id=same
RanSeqSTD   random  CDS 1000000 1001000 .   .   .   gene_id=1000bp;transcript_id=same
RanSeqSTD   random  CDS 1500000 1500500 .   .   .   gene_id=500bp;transcript_id=same
RanSeqSTD   random  CDS 2000000 2000250 .   .   .   gene_id=250bp;transcript_id=same
RanSeqSTD   random  CDS 2500000 2500100 .   .   .   gene_id=100bp;transcript_id=same

First I tried getting read number for each of these features with:

bedtools multicov -bams *.bam -bed BEDfile.bed

I got a read count and in theory this might be enough, however the problem is that I have limited control over what is considerate to be part of the feature. So I tried using HTSeq with the sorted SAM file and the above GFF file with the following command line:

htseq-count -s no -m union -i gene_id -t CDS STD100_sorted.sam RanSeqGFF.gff3 > HTSeq.txt

However, I keep getting count "0" for all gene_id features.

Any idea why HTSeq gives me this and any idea how this can be done differently?

Note: I am fully aware that HTSeq was not developed for genomic sequencing and thus should not be used for this.

Thanks TP

multicov bedtools htseq • 978 views
ADD COMMENTlink modified 21 months ago by geek_y9.6k • written 22 months ago by ThePresident110

I think, gene_id and transcript_id need to be separated by a space (no '=' sign) and the actual IDs are escaped using " characters:

gene_id "ENSG00000223972"; transcript_id "ENST00000456328";

That's at least how I parse these IDs in my own tool alfred which should do the read counting you are looking for (requires a position-sorted BAM file):

/src/alfred count -i gene_id -f CDS -g RanSeqGFF.gtf.gz STD100_sorted.bam

ADD REPLYlink modified 22 months ago • written 22 months ago by trausch1.2k

What you are describing is a GTF file, I believe. It's worth trying though. I'll update the comment when I get the result.

ADD REPLYlink written 22 months ago by ThePresident110
1
gravatar for glihm
22 months ago by
glihm600
France
glihm600 wrote:

You did not specified the sorting order of your pair-end alignments.

  -r {pos,name}, --order {pos,name}
                    'pos' or 'name'. Sorting order of <alignment_file>
                    (default: name). Paired-end sequencing data must be
                    sorted either by position or by read name, and the
                    sorting order must be specified. Ignored for single-
                    end data.

I guess that you did a default sorting like:

samtools sort -o sorted.sam input.sam

So, you probably need to add "-r" option like:

htseq-count -r pos -s no -m union -i gene_id -t CDS STD100_sorted.sam RanSeqGFF.gff3 > HTSeq.txt

If you already sorted your sam file with names, and not positions, you should check the references used in the sam file and the seqid of you GFF3 formatted file.

ADD COMMENTlink modified 22 months ago • written 22 months ago by glihm600
1

You were right! Initially, I did sort by the name however I did a rookie mistake of not looking at my reference and seqid. In my GFF file I use "RanSeqSTD" but my sam file harbors "RandomSeqSTD". Anyway, problem solved. Thanks!

ADD REPLYlink written 21 months ago by ThePresident110

we all did ! :D Happy to see problem solved ! ;)

ADD REPLYlink written 21 months ago by glihm600
1
gravatar for geek_y
21 months ago by
geek_y9.6k
Barcelona/CRG/London/Imperial
geek_y9.6k wrote:

Just convert the BED to SAF and use featureCounts. Its pretty straightforward.

awk -v OFS="\t" 'BEGIN {"GeneID","Chr","Start","End","Strand"} { print "Gene_"NR, $1, $2, $3, "."}' in.bed > out.saf
ADD COMMENTlink modified 21 months ago • written 21 months ago by geek_y9.6k

featureCounts seems very similar to bedtools multicov. My only concern with these tools is that they count a read as long as it overlaps the feature (gene, exon etc.) by 1bp which in my opinion is a little bit too loose as cutoff. Therefore, the idea to use HTSeq as it seems more stringent. However, it might be enough for what I want to do. Thanks!

ADD REPLYlink written 21 months ago by ThePresident110
1

You can always change the default parameters −−minOverlap and also if you want to make the quantification bit stringent by utilizing other arguments like −−fracOverlap , −−largestOverlap ... in my personal opinion featureCounts is more flexible by having options for many parameters and also more efficient compared to HTSeq.

ADD REPLYlink modified 21 months ago • written 21 months ago by EagleEye6.3k

I missed these options when I read about featureCounts. It looks good indeed, I'll give it a try. Thanks!

ADD REPLYlink written 21 months ago by ThePresident110
1

Definitely its not very similar to multiCov

ADD REPLYlink written 21 months ago by geek_y9.6k

Why do you say so? They perform identical function which is to count the number of reads that overlap a set of features defined by genomic boundaries. Additionally

featureCounts: "A read is said to overlap a feature if at least one read base is found to overlap the feature" multicovs: "-f argument : Minimum overlap required as a fraction of each A. Default is 1E-9 (i.e., 1bp)."

ADD REPLYlink written 21 months ago by ThePresident110
1

Performing the identical function, doesn't mean they both are very similar. The limitations of multiCov and the options in featureCounts make a big difference,

ADD REPLYlink written 21 months ago by geek_y9.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1890 users visited in the last hour