Question: TCGA raw counts DexSeq
0
gravatar for Newbie
2.2 years ago by
Newbie0
Newbie0 wrote:

Hi,

I have a question regarding input file into DexSeq. I've read the DexSeq manual and see that it requires a SAM/BAM file to generate the counts file. The problem is that I have neither the SAM/BAM file as I have taken RNASeq level 3 data from the TCGA site which contains the chromosome start and end position for each exon along with raw counts for each exon. So I'm guessing I can skip the the first two parts outlined by the DexSeq manual because I already have my raw counts. I also have my gtf file, my question is how can I use my gtf file to get the transcript ids from the chromosome positions in the TCGA dataset so that I can input it into DexSeq, and carry out my analysis. Apologies if this seems like an obvious solution, I'm new to bioinformatics.

rna-seq tcga dexseq • 1.0k views
ADD COMMENTlink modified 2.2 years ago by H.Hasani780 • written 2.2 years ago by Newbie0
1
gravatar for H.Hasani
2.2 years ago by
H.Hasani780
Freiburg, Germany
H.Hasani780 wrote:

IMHO, you can annotate your RNA-Seq file. For example, you extract transcripts entries from your gtf, then using bedtools (intersect, overlap, or closest) to map both files. Withj some basic bashing you can then transform the results into the format you need.

hth

ADD COMMENTlink written 2.2 years ago by H.Hasani780

Should I convert my RNA seq file to a bed file first? And regards using bedtools how would I use that to map both files? Do you have any examples on how to do it? Apologies for the questions, I'm new to bioinformatics but have experience using basic bash commands and R and this is probably an obvious solution to anyone with more experience.

ADD REPLYlink written 2.2 years ago by Newbie0

no problem, no one was born knowing it all ;)

just go to their page and check how each tool works and what is the accepted format (they have plenty of examples), e.g. intersect takes files format BAM/BED/GFF/VCF so you need to make sure both files (your counts & gtf) follows one of those formats without losing the information you need (of course they don't need both to be bed or bam ..etc ).

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by H.Hasani780

Great, thank you, much appreciated :-) Also one more question, my gtf file needs to be converted to gtf for DexSeq which is fine because they provide a python script to do that, so would I better to extract the exon co-ordinates and transcript ID's from the gff file, and what would be the best way to do that. I've been looking online and there a a few suggestions using awk/grep, but I was wondering in your opinion what would be the best way to do that. Thanks again.

ADD REPLYlink written 2.2 years ago by Newbie0
0
gravatar for Istvan Albert
2.2 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

The manual is misleading, DEXSeq does not actually require BAM files to operate. It invokes another tool called HTSeq to perform the counting, then the resulting counts are loaded up into DEXSeq.

What you could do is paste your downloaded files together to be of the same format as the result of the counts produced by the HTSeq tool then continue with the instruction of the vignette from the step where it loads the count data.

http://www-huber.embl.de/HTSeq/doc/count.html

ADD COMMENTlink written 2.2 years ago by Istvan Albert ♦♦ 81k

hi, thanks for replying. I'll have a look at doing that, thank you. Although I'm not sure if it will work, the main issue I'm having is getting a transcript ID, because each row in my data file corresponds to one exon giving the chromosome start and end so I'm having trouble understanding how I can use my annotation file and dataset to get the corresponding transcript ID. Sorry if this sounds confusing. Thanks in advance

ADD REPLYlink written 2.2 years ago by Newbie0

this is the file I'm currently working with, thought it might help understand what I mean. It's level 3 Data exon quantification data from the TCGA

Hybridization REF TCGA-3C-AAAU-01A-11R-A41B-07 TCGA-3C-AALI-01A-11R-A41B-07 TCGA-3C-AALJ-01A-31R-A41B-07

exon raw_counts raw_counts raw_counts raw_counts

chr1:11874-12227:+ 29 23 18 2

chr1:12595-12721:+ 7 10 1 0

chr1:12613-12721:+ 7 10 1 0

chr1:12646-12697:+ 6 5 1 0

I need to turn this into something I can use in DExSeq, so I presume I need to convert the exon locations into transcript IDs in order to carry out my analysis in DexSeq, and that's what I'm having trouble with at the moment.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Newbie0

Hi! I wonder if you were able to turn your raw exon count file into something you can use with DEXSeq? I also have a project with TCGA raw exon counts and don't know how to make it work with DEXSeq. Any suggestions you can give is much appreciated!

ADD REPLYlink written 2.1 years ago by ren.yingxue0

Hi Newbie,

Actually, I don't think you can make this conversion successful. DEXSeq firstly produces "counting bin" annotation data from the GTF annotation file using the script "dexseq_prepare_annotation.py", and then make a differential analysis based on the reads mapped on these bins. Obviously, bins are smaller than exons. However, you cannot get these bin quantification data just from the level-3 exon quantification data downloaded from TCGA.

ADD REPLYlink modified 15 months ago • written 15 months ago by Hao Zhang0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2413 users visited in the last hour