gdc RNA-Seq Pipeline gene expression Quantification parameters
3
0
Entering edit mode
6.4 years ago
young yu • 0

hi,everyone! I downloaded RNA-Seq Quantification data(HTSeq-FPKM) from GDC, but they almost all cancer data. I got some normal data from SRAdb,it's fastq type. Now, I want to process the normal fastq type as GDC did, but i don't know the process standards and software parameters. I had read some basic stuff, such as https://gdc.nci.nih.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification. anyone could help me? thanks.

RNA-Seq GDC TCGA gene expression quantification • 3.5k views
0
Entering edit mode

I've looked really hard, and I can't find any documentation for their RNA-seq analysis pipeline other than saying that they followed the ICGC STAR 2-pass RNA-seq SOP, which is not documented on the ICGC site at the moment as far as I can find. I've even gone as far a looking at the ICGC and GDC github repositories to see if I can find the commands they used, but thus far, no luck.

0
Entering edit mode

you can ask GDC HelpDesk for help. This is their e-mail: support@nci-gdc.datacommons.io

0
Entering edit mode
6.4 years ago

I am not very clear what you want exactly. But what I interpret is that you also want normal samples from fpkm files. Technically in the fpqm you have both normal and tumor data. While downloading the samples from GDC portal there will also be a meta data file under download option. Download it and it will link each sample to filename. There will be a TCGA barcode also given in a meta data file. That barcode will help you to characterise the samples. If you split that individual barcode by '-' the fourth element would be of form 01A,01B,11A,07A etc. Note that this symbol enables you to identify whether it is normal or tumor.01-09 stands for tumor and 11-20 for normal.

Eg - barcode is if TCGA-P4-A5E8-01A-11R-A28H-07 then 4th element is 01A and is tumor whereas TCGA-P4-A5E8-11A-11R-A28H-07 4th element is 11A. Now if you look at the first 3 elements they are same meaning tumor and normal are from same patient.

Hope this is what you were looking for.

0
Entering edit mode

hello noorpratap.singh,

maybe I haven't show my meanning well, actually I want to translate SRA type data got from other database(SRAdb) into gene expression fpkm by many steps as GDC did. And I'd like to compare this data with that I download from GDC. for now, I have got the way to complete the program, but for some reason, I can't performance the concrete steps on here. Thank you for your advice, but i think you may misunderstand my purpose. Thank you all the same.@noorpratap.singh

0
Entering edit mode

As far as I can tell, GDC does yet contain the matched normals for all cancer samples as the old portal did. If you do the barcode translation step you recommend, you'll find that the matched normal is barcode is often unrecognized.

0
Entering edit mode

TCGA does have matched normal DNA-Seq for variant calls, but it's not common for RNA-Seq.

0
Entering edit mode
6.3 years ago
Zhenyu Zhang ▴ 690

The ICGC pipeline is explained in OICR wiki, if you have access to it. It's the STAR 2-pass alignment, followed by HT-Seq count assuming all library are unstranded. GDC is working on get all pipeline public (not in weeks, likely months), if you can wait.

0
Entering edit mode

STAR 2-pass could cover a multitude of sins. Fortunately, I found that the exact command IS contained in header to the BAM file, at least for the second-pass (but not the first unfortunately). Buts its a start. Key points to note, they allow 10 mismatches or upto a third of the aligned read. Upto 20 multi-maps. A minimum overhang for a known splice junction of 1. And they assign strand based on intron motif into the XS attribute. Here is an example:

STAR --genomeDir /alignment/scratch94CyXB/star_genomedir_1st_qqActG
/alignment/scratch94CyXB/1ca3c03e-20a1-401e-be97-12d0e506afb8_fastq_files/140513_UNC15-SN850_0365_BC4BYLACXX_TGACCA_L001_2.fastq
--outFilterMultimapScoreRange 1
--outFilterMultimapNmax 20
--outFilterMismatchNmax 10
--alignIntronMax 500000
--alignMatesGapMax 1000000
--sjdbScore 2
--alignSJDBoverhangMin 1
--limitBAMsortRAM 0
--sjdbOverhang 100
--outSAMstrandField intronMotif
--outSAMattributes NH HI NM MD AS XS
--outSAMunmapped Within
--outSAMtype BAM SortedByCoordinate
--outSAMattrRGline ID::140513_UNC15-SN850_0365_BC4BYLACXX_TGACCA_L001 SM:

0
Entering edit mode
5.7 years ago
Fill ▴ 70

And this is formula for FPKM:

N = the count of all reads that are aligned to protein-coding genes in that alignment.

See how to calculate N: A: Calculating FPKM after htseq-count