HTseq for filtering genes by expression given custom transcript file
1
0
Entering edit mode
8.5 years ago

Hi everyone,

I am a newbie in the field of Bioinformatics and have very little experience and knowledge of tools and applications used in pipelines. I apologize if my question may come off as inadequate. I will clarify the question or give more information as requested. I know I have omitted some of the steps in my workflow below. I have painstakingly spent a lot of time trying to figure out how to reach the next step in my RNA-seq pipeline. I have ran through most of the "conventional" pipeline workflow.

I am looking for novel transcripts. I have paired-end reads from timepoint samples. I merged the bam files resulting from the alignment to increase signal of counts. I then proceeded to use Cufflinks. I want to utilize the resulting transcript file from Cuffcompare which I have filtered to find novel transcripts, for the counts which only show increased expression in genes across the timepoints (used IGV ). At this point I want to filter for those counts that have expression increase in these unknown(potential novel) transcripts. I was told I could use HTseq from python to do the job by using the BAM and GTF files. But I cannot seem to figure out which method to use. I must mention that I have indeed looked at http://www-huber.embl.de/HTSeq/doc/overview.html, perhaps I am missing something or am I even in the right direction? Can anyone please point me to the right direction?

In a gist my question is: How do I use HTseq to assign counts to a custom transcript file based on increased expression across time points?

I appreciate your time and I hope to learn more from your responses

Thank you in advance!

RNA-Seq Novel-transcripts HTseq • 2.0k views
ADD COMMENT
1
Entering edit mode
8.5 years ago

You have the steps slightly confused. They are:

  1. Assign counts for EVERYTHING.
  2. Perform differential expression
  3. See which genes are up-regulated

If you only care about novel transcripts then I would recommend filtering everything except them out after performing library size normalization, since I suspect you're hoping to see a LOT of novel things changing in the same direction and not doing it this way could mask that effect.

BTW, use featureCounts rather than htseq-count, it's much faster.

ADD COMMENT
0
Entering edit mode

Oh sorry, I did not clarify too well. Steps 1 to 3 have already been carried out by a lab partner who used HTSeq to get the counts and DESeq2 for the differential expression analysis (I should note: Gfold and Cuffdiff were also utilized but we were not convinced of the results). After the above I returned to the original sorted aligned bam files merged them and ran the above workflow. I used cuffcompare to compare the refseq.gtf with our data transcript file. I obtained the transcripts.gtf.tmap and filtered by unknown intergenic transcripts, FPKM and base-pair length. I then looked at these transcripts on IGV, and loaded with them original bam files. I saw most transcripts had no gene expression for specific time points. I now just want to select those transcripts which have increase or significant gene expression across all timepoints.

Sorry for not being clear

ADD REPLY
1
Entering edit mode

Ah, HTSeq-count and featureCounts are far from ideal for transcript level counting. If you find cuffdiff's transcript-level results funky then I would do the following:

  1. Use cuffmerge to get a merged GTF file.
  2. Use gtf2fasta or gtf_to_fasta (I think that one comes with tophat) to make a transcriptome fasta file.
    1. Make sure to include everything in this, though make a note of the unknown intergenic transcripts (N.B., in my experience these tend to be repeat regions, but YMMV).
  3. Use Salmon or Kallisto to align the original fastq files against this. Note that these are VASTLY faster than what you've used already, so don't worry that this will take a lot of time.
  4. Use Sleuth, presumably filter out the non-novel stuff at some point (I've never used it myself, so I can't advise you on exactly when it's convenient to do so...though it's always possible).

This isn't directly answering your exact question, but I think you'll get better results this way.

ADD REPLY
0
Entering edit mode

Thank you very much! I will give this a try!

ADD REPLY

Login before adding your answer.

Traffic: 2073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6