Question

GTF file procession

0

Entering edit mode

2.9 years ago

storm1907 ▴ 30

Hello,

I try to intersect GTF files with tped files with purpose to switch chromosomal location to gene location in tped files.

I created some topics before about steps needed:

So, I did steps suggested:

Got gtf from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
extracted only genes annotation with awk -F"\t" '$3=="gene"' my_file.gtf > my_genes_only_file.gtf
converted nc chromosome symbols from NC_xxx to X with sed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' GCF_000001405.39_GRCh38.p13_genomic.gtf > new.gtf
converted to bed with gtf2bed < foo.gtf: gtf2bed < GCF_000001405.39_GRCh38.p13_genomic_geneonly_1copy.gtf > out
removed unwanted symbols from final intersected file with cat test sed 's/"//g' | sed 's/;//g' test > fintest
printed out only the needed columns from intersected file: cat fintest | awk '{print $4,$33,$35,$34,$36,$37}' > fintestnew

In the end I have file with different content of columns:

KLHL17 0 C C  
KLHL17 0 A A   
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976215
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976669

However the desired structure should be like this:

PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PLEKHN1 966272 0 966272 G G

Is it because of inconsistency in GTF files?

Thank you!

ncbi • 1.4k views

ADD COMMENT • link 2.9 years ago by storm1907 ▴ 30

0

Entering edit mode

can you please post the output from awk -F "\t" '{print $4,$33,$35,$34,$36,$37}' fintest > fintestnew ? Please post few example lines from out file.

ADD REPLY • link 2.9 years ago by cpad0112 21k

0

Entering edit mode

awk -F "\t" '{print $4,$33,$35,$34,$36,$37}' fintest > fintestnew

KCNT1
KCNT1
KCNT1
KCNT1
KCNT1
CAMSAP1

and here is the converted out.bed file

1       11873   14409   DDX11L1 .       +       BestRefSeq      gene    .       gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";
1       14361   29370   WASH7P  .       -       BestRefSeq      gene    .       gene_id "WASH7P"; transcript_id ""; db_xref "GeneID:653635"; db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7, pseudogene"; gbkey "Gene"; gene "WASH7P"; gene_biotype "transcribed_pseudogene"; gene_synonym "FAM39F"; gene_synonym "WASH5P"; pseudo "true";
1       17368   17436   MIR6859-1       .       -       BestRefSeq      gene    .       gene_id "MIR6859-1"; transcript_id ""; db_xref "GeneID:102466751"; db_xref "HGNC:HGNC:50039"; db_xref "miRBase:MI0022705"; description "microRNA 6859-1"; gbkey "Gene"; gene "MIR6859-1"; gene_biotype "miRNA"; gene_synonym "hsa-mir-6859-1";
1       29925   31295   MIR1302-2HG     .       +       Gnomon  gene    .       gene_id "MIR1302-2HG"; transcript_id ""; db_xref "GeneID:107985730"; db_xref "HGNC:HGNC:52482"; gbkey "Gene"; gene "MIR1302-2HG"; gene_biotype "lncRNA";
1       30365   30503   MIR1302-2       .       +       BestRefSeq      gene    .       gene_id "MIR1302-2"; transcript_id ""; db_xref "GeneID:100302278"; db_xref "HGNC:HGNC:35294"; db_xref "miRBase:MI0006363"; description "microRNA 1302-2"; gbkey "Gene"; gene "MIR1302-2"; gene_biotype "miRNA"; gene_synonym "hsa-mir-1302-2"; gene_synonym "MIRN1302-2";
1       34610   36081   FAM138A .       -       BestRefSeq      gene    .       gene_id "FAM138A"; transcript_id ""; db_xref "GeneID:645520"; db_xref "HGNC:HGNC:32334"; description "family with sequence similarity 138 member A"; gbkey "Gene"; gene "FAM138A"; gene_biotype "lncRNA"; gene_synonym "F379"; gene_synonym "FAM138F";
1       52452   53396   OR4G4P  .       +       Curated Genomic gene    .       gene_id "OR4G4P"; transcript_id ""; db_xref "GeneID:79504"; db_xref "HGNC:HGNC:14822"; description "olfactory receptor family 4 subfamily G member 4 pseudogene"; gbkey "Gene"; gene "OR4G4P"; gene_biotype "pseudogene"; pseudo "true";
1       63015   63885   OR4G11P .       +       Curated Genomic gene    .       gene_id "OR4G11P"; transcript_id ""; db_xref "GeneID:403263"; db_xref "HGNC:HGNC:31276"; description "olfactory receptor family 4 subfamily G member 11 pseudogene"; gbkey "Gene"; gene "OR4G11P"; gene_biotype "pseudogene"; pseudo "true";
1       65418   71585   OR4F5   .       +       BestRefSeq      gene    .       gene_id "OR4F5"; transcript_id ""; db_xref "GeneID:79501"; db_xref "HGNC:HGNC:14825"; description "olfactory receptor family 4 subfamily F member 5"; gbkey "Gene"; gene "OR4F5"; gene_biotype "protein_coding";
1       89005   267354  LOC100996442    .       -       Gnomon  gene    .       gene_id "LOC100996442"; transcript_id ""; db_xref "GeneID:100996442"; gbkey "Gene"; gene "LOC100996442"; gene_biotype "lncRNA";

ADD REPLY • link 2.9 years ago by storm1907 ▴ 30

0

Entering edit mode

I think you are either using wrong input or expecting a wrong output. In expected output, there are alleles (singletons), bed (out file) doesn't have such entries (in copy/pasted out.bed). In addition, cat test sed 's/"//g' | sed 's/;//g' test > fintest code is totally incorrect unless there is a copy/paste mistake. Assuming that function is correct, the sed code removes field separators which is incorrect, for downstream processing. You need to explain where third column (0 s), 5 and 6 coming from. Output bed file you shared here doesn't catch all those fields. Without correct data, it is difficult to address the issue.

ADD REPLY • link 2.9 years ago by cpad0112 21k

0

Entering edit mode

Sorry, forgot one step, bedtools intersect

Before step 5 bedtools intersect -wb was used, to get the file with both gene names, gene positions and SNPs gtf file was intersected with sample bed file (obtained from Plink tped file) Sample bed file:

chr1    183189  183189
chr1    609407  609407

bed file from gtf:

gtf2bed < GCF_000001405.39_GRCh38.p13_genomic_geneonly_1copy.gtf > out

bedtools command: bedtools intersect -wb -a out -b sample.bed >  fintest

ADD REPLY • link 2.9 years ago by storm1907 ▴ 30

0

Entering edit mode

storm1907 : Please do not delete threads once they have received comments/answers.

ADD REPLY • link 2.9 years ago by GenoMax 141k

0

Entering edit mode

Sorry. Nobody answered, so I created another post. So, how can I get help to my question then, once this thread seems to be ignored?

ADD REPLY • link 2.9 years ago by storm1907 ▴ 30