TCGA raw counts DexSeq
2
0
Entering edit mode
5.0 years ago
Newbie • 0

Hi,

I have a question regarding input file into DexSeq. I've read the DexSeq manual and see that it requires a SAM/BAM file to generate the counts file. The problem is that I have neither the SAM/BAM file as I have taken RNASeq level 3 data from the TCGA site which contains the chromosome start and end position for each exon along with raw counts for each exon. So I'm guessing I can skip the the first two parts outlined by the DexSeq manual because I already have my raw counts. I also have my gtf file, my question is how can I use my gtf file to get the transcript ids from the chromosome positions in the TCGA dataset so that I can input it into DexSeq, and carry out my analysis. Apologies if this seems like an obvious solution, I'm new to bioinformatics.

RNA-Seq TCGA DEXSEQ • 1.9k views
1
Entering edit mode
5.0 years ago
H.Hasani ▴ 990

IMHO, you can annotate your RNA-Seq file. For example, you extract transcripts entries from your gtf, then using bedtools (intersect, overlap, or closest) to map both files. Withj some basic bashing you can then transform the results into the format you need.

hth

0
Entering edit mode

Should I convert my RNA seq file to a bed file first? And regards using bedtools how would I use that to map both files? Do you have any examples on how to do it? Apologies for the questions, I'm new to bioinformatics but have experience using basic bash commands and R and this is probably an obvious solution to anyone with more experience.

0
Entering edit mode

no problem, no one was born knowing it all ;)

just go to their page and check how each tool works and what is the accepted format (they have plenty of examples), e.g. intersect takes files format BAM/BED/GFF/VCF so you need to make sure both files (your counts & gtf) follows one of those formats without losing the information you need (of course they don't need both to be bed or bam ..etc ).

0
Entering edit mode

Great, thank you, much appreciated :-) Also one more question, my gtf file needs to be converted to gtf for DexSeq which is fine because they provide a python script to do that, so would I better to extract the exon co-ordinates and transcript ID's from the gff file, and what would be the best way to do that. I've been looking online and there a a few suggestions using awk/grep, but I was wondering in your opinion what would be the best way to do that. Thanks again.

0
Entering edit mode
5.0 years ago

The manual is misleading, DEXSeq does not actually require BAM files to operate. It invokes another tool called HTSeq to perform the counting, then the resulting counts are loaded up into DEXSeq.

What you could do is paste your downloaded files together to be of the same format as the result of the counts produced by the HTSeq tool then continue with the instruction of the vignette from the step where it loads the count data.

0
Entering edit mode

hi, thanks for replying. I'll have a look at doing that, thank you. Although I'm not sure if it will work, the main issue I'm having is getting a transcript ID, because each row in my data file corresponds to one exon giving the chromosome start and end so I'm having trouble understanding how I can use my annotation file and dataset to get the corresponding transcript ID. Sorry if this sounds confusing. Thanks in advance

0
Entering edit mode

this is the file I'm currently working with, thought it might help understand what I mean. It's level 3 Data exon quantification data from the TCGA

Hybridization REF TCGA-3C-AAAU-01A-11R-A41B-07 TCGA-3C-AALI-01A-11R-A41B-07 TCGA-3C-AALJ-01A-31R-A41B-07

exon raw_counts raw_counts raw_counts raw_counts

chr1:11874-12227:+ 29 23 18 2

chr1:12595-12721:+ 7 10 1 0

chr1:12613-12721:+ 7 10 1 0

chr1:12646-12697:+ 6 5 1 0

I need to turn this into something I can use in DExSeq, so I presume I need to convert the exon locations into transcript IDs in order to carry out my analysis in DexSeq, and that's what I'm having trouble with at the moment.

0
Entering edit mode

Hi! I wonder if you were able to turn your raw exon count file into something you can use with DEXSeq? I also have a project with TCGA raw exon counts and don't know how to make it work with DEXSeq. Any suggestions you can give is much appreciated!

0
Entering edit mode

Hi Newbie,

Actually, I don't think you can make this conversion successful. DEXSeq firstly produces "counting bin" annotation data from the GTF annotation file using the script "dexseq_prepare_annotation.py", and then make a differential analysis based on the reads mapped on these bins. Obviously, bins are smaller than exons. However, you cannot get these bin quantification data just from the level-3 exon quantification data downloaded from TCGA.