Hello, I try to intersect GTF files with tped files with purpose to switch chromosomal location to gene location in tped files. I created some topics before about steps needed: conversion from genomic to gene coordinates in a large file bedtools intersect combined with filtering after particular word converting spaces to tabs in gtf files conversion from genomic to gene coordinates in a large file tped files to bed with awk
So, I did steps suggested:
1) Got gtf from
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
2) extracted only genes annotation with
awk -F"\t" '$3=="gene"' GCF_000001405.39_GRCh38.p13_genomic.gtf.gz > aaa.gtf
3) converted nc chromosome symbols from NC_xxx to X with
sed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' aaa.gtf > bbb.gtf
4) converted to bed
gtf2bed < bbb.gtf > ccc.bed
5) intersected with sample bed file (obtained from Plink tped file) Sample bed file:
chr1 183189 183189
chr1 609407 609407
bedtools intersect -wb -a ccc.bed-b sample.bed > ddd.bed
5) removed unwanted symbols from final intersected file with
cat test sed 's/"//g' | sed 's/;//g' ddd.bed > eee.bed
6) printed out only the needed columns from intersected file
cat fintest | awk '{print $4,$33,$35,$34,$36,$37}' eee.bed > fff.bed
In the end I have file with different content of columns:
KLHL17 0 C C
KLHL17 0 A A
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976215
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976669
However the desired structure should be like this:
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PLEKHN1 966272 0 966272 G G
Is it because of inconsistency in GTF files? It should be succesfully intersected with my sample file, so I can switch NC position to gene name and position in sample file and save tped file structure.
Thank you!