Coordinates long non-coding RNAs
0
0
Entering edit mode
3 months ago
luffy ▴ 20

Dear All,

I am trying to get long non-coding RNA coordinates in gtf format. i have downloaded the file from here. but i want filter the file based on couple of conditions such as

• Remove records with length less then 200bps
• keep the records which are intersecting with coding region with 100bps upstream and downstream

first one was easily achievable using python

import pandas as pd
df_nc = pd.read_csv('gencode.v37.hg38.long_noncoding_RNAs.gtf', sep='\t', names['CHROM', 'HAVANA', 'TYPE', 'START', 'END', 'ID', 'STRAND', 'ID1','DETAILS'])
df_nc_len = df_nc[df_nc['END'] - df_nc['START'] >200]


How can go about with the next condition?

Also why do i find exons in the non-coding gtf

df_nc_len['TYPE'].value_counts()


the 3rd column gives me

exon 69042

transcript 48673

gene 17882

Any help would be much appreciated

gtf python bedtools rna-seq hg38 • 390 views
1
Entering edit mode

I would tackle this by getting the file in BED format from the UCSC table browser, and then using BEDtools intersect. Manually trying to code genome arithmetic functions in python is like trying to reinvent the wheel at this point

0
Entering edit mode

@heskett, Thank you for your input, can you please let me know tracks to choose from UCSC table browser to arrive at only lncRNA coordinates of hg38 assembly and those which are intersecting with coding region with 100bps upstream and downstream

Thank you

0
Entering edit mode

I won't do the work for you but I can point you in the right direction. The GENCODE track will have coding and noncoding genes. it looks like there is a transcriptClass column that says coding or nonCoding. you can download these different files from gencode -> gene and gene predictions -> knowngene on the table browser site. Then use bedtools to find intersections and limit the overlap to 100bps. Learning how to use these tools will be very helpful if you continue doing genomics analysis

0
Entering edit mode

Dear heskett, there seem to be bit of misunderstanding, that was not my intension. Since i had already tired similar idea, hence was requesting you elaborate on that.

Things I have tired:

1. downloaded known coding regions from UCSC (refseq track) and used bedtools to intersect the noncoding coordinates (from gencode) with coding regions (from UCSC) then imported into pandas filter overlap which are less than 200 then did pandas merge (coding and noncoding) but there were duplicates so did drop duplicates (was not sure about removing duplicate)

2. i also tried to filter based on transcript type imported into df then the 3rd column (gtf from gencode) has different types (exon, transcript etc) and again in the last column separated by ';' has again different types (lncRNA, misc_RNA, processed_transcript, transcribed_unprocessed_pseudogene etc..). i am confused what to choose/drop

few more attempted i made all were not successful

Sorry and Thank you