Question

How to look for a specific nucleotides of lncRNA sequence in TCGA data?

0

Entering edit mode

3.8 years ago

newbie ▴ 120

Based on BLAT results I have a lncRNA sequence in which a stretch of nucleotides were common in both mice and humans.

chromosome 12:

tctccagtag gggtatcaat atgtagtgat gccccggcTT TCAGAGAtAG  145026799
aAGTGTCCTG AGATACCAaa aatAAAATCA TTCaACATTA AcAAGTCcAG  145026849
TAGaGtcAGG AGAgGGgAAA gAtGGAGtaG ACACATTCAA TACAAGtCAA  145026899
GCATatGtTT GCATaTGTGA GTTTTGTCTA AAAGcAAtTc agTaAaTCCA  145026949
CATCTGGACT CaGCaTTGGC CCgTCCCaca tATTATTAAA tAAGTTCAAA  145026999
GCCAgTAAaT

In the above sequence there are few nucleotides common in mice and humans. I'm interested in looking at those common stretches in the RNA-seq data of TCGA-Lung cancer between normal and tumor.

Can anyone tell me how I should go forward for this. thanq

RNA-Seq sequence alignment blat tcga • 1.1k views

ADD COMMENT • link updated 3.8 years ago by Kevin Blighe 87k • written 3.8 years ago by newbie ▴ 120

score 1 · Answer 1 · 2020-07-27

1

Entering edit mode

3.8 years ago

Kevin Blighe 87k

Using the open access / Level 3 TCGA data, your options are limited, unless this lncRNA relates to an already-known gene?

If this is a 'novel' sequence, then I would obtain the raw FASTQ files from the TCGA (requires access approval) and then add this sequence as a new transcript against which reads would be pseudo-aligned via Salmon or Kallisto.

Kevin

ADD COMMENT • link 3.8 years ago by Kevin Blighe 87k

0

Entering edit mode

thanq. sorry, you mean I have to add my lncRNA sequence in all the TCGA Samples fastq files?

ADD REPLY • link 3.8 years ago by newbie ▴ 120

1

Entering edit mode

Perhaps it is more important to first answer these questions:

do you have access to the controlled TCGA data (FASTQs and BAMs)?
does this lncRNA sequence represent an already-known gene?

On the second question, the sequence aligns to TNC and RP11-523L1

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

yes I do have TCGA bam files. No, it doesn't represent any known gene. The sequence I shown here is an example. So, should I add the sequence in all TCGA samples bams?

OR

Should I create some fake row in the annotation file using the coordinates of this lncRNA sequence and use salmon or kallisto as you said?

ADD REPLY • link 3.7 years ago by newbie ▴ 120

1

Entering edit mode

I would convert the BAMs to FASTQ, and then, yes, create a new entry in the GENCODE FASTA sequence file that is then used with Kallisto or Salmon. In this way, when you pseudo-align the FASTQs to the reference transcriptome FASTA, you will also gauge counts over your lncRNA of interest.

The GENCODE FASTA references are available here: https://www.gencodegenes.org/human/ (see, for example, 'Protein-coding transcript sequences' under the 'Fasta files' header).

This sequence that you're showing, though, it is the genomic sequence of the gene I presume?

There is likely another approach, too.

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks a lot will try this approach first but do you think first I should visualise in IGV to see whether there is a possibility for expression?

ADD REPLY • link 3.7 years ago by newbie ▴ 120

0

Entering edit mode

Yes, I would already check to see what is in these BAMs - it could be that they already have reads aligned to your gene of interest.

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

newbie : If the sequence is already present in the human genome creating another entry will cause problems.

You can just search your BAM files to find right alignment coordinates (since I assume you know where your sequences are in the human genome).

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

@Kevin Blighe and @genomax based on coordinates of the novel lncRNA sequence, I extracted the given region from on TCGA sample bam and visualized in IGV and I see that the lncRNA sequence aligns with the exon of a coding gene TNC. How do I proceed now?

ADD REPLY • link 3.7 years ago by newbie ▴ 120

1

Entering edit mode

Yes, I 'BLASTed' the sequence and saw how it aligns to TNC (mentioned above). In this case, you have a problem because it will be difficult to distinguish from which gene the reads derived. This is a issue with NGS technology, and effectively means that all expression studies are biased to a certain degree in relation to this.

You could try my approach, but then you'd encounter the same problem.

If you check the alignments [in IGV] along TNC, do you see reads over the other exons to the same general depth of coverage / read depth as exon1? If you do not see these, and I already know that TNC is lost in some metastatic cancers, one could infer that the reads over exon1 of TNC are solely deriving from your lncRNA.

In this case, you could possibly try DEXseq, which can take the BAM and which could possibly help. You'd have to add your lncRNA to the input GTF, though.

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k