GTF file procession
0
0
Entering edit mode
2.9 years ago
storm1907 ▴ 30

Hello,

I try to intersect GTF files with tped files with purpose to switch chromosomal location to gene location in tped files.

I created some topics before about steps needed:

So, I did steps suggested:

  1. Got gtf from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
  2. extracted only genes annotation with awk -F"\t" '$3=="gene"' my_file.gtf > my_genes_only_file.gtf
  3. converted nc chromosome symbols from NC_xxx to X with sed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' GCF_000001405.39_GRCh38.p13_genomic.gtf > new.gtf
  4. converted to bed with gtf2bed < foo.gtf: gtf2bed < GCF_000001405.39_GRCh38.p13_genomic_geneonly_1copy.gtf > out
  5. removed unwanted symbols from final intersected file with cat test sed 's/"//g' | sed 's/;//g' test > fintest
  6. printed out only the needed columns from intersected file: cat fintest | awk '{print $4,$33,$35,$34,$36,$37}' > fintestnew

In the end I have file with different content of columns:

KLHL17 0 C C  
KLHL17 0 A A   
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976215
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976669

However the desired structure should be like this:

PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PLEKHN1 966272 0 966272 G G

Is it because of inconsistency in GTF files?

Thank you!

ncbi • 1.4k views
ADD COMMENT
0
Entering edit mode

can you please post the output from awk -F "\t" '{print $4,$33,$35,$34,$36,$37}' fintest > fintestnew ? Please post few example lines from out file.

ADD REPLY
0
Entering edit mode
awk -F "\t" '{print $4,$33,$35,$34,$36,$37}' fintest > fintestnew

KCNT1
KCNT1
KCNT1
KCNT1
KCNT1
CAMSAP1

and here is the converted out.bed file

1       11873   14409   DDX11L1 .       +       BestRefSeq      gene    .       gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";
1       14361   29370   WASH7P  .       -       BestRefSeq      gene    .       gene_id "WASH7P"; transcript_id ""; db_xref "GeneID:653635"; db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7, pseudogene"; gbkey "Gene"; gene "WASH7P"; gene_biotype "transcribed_pseudogene"; gene_synonym "FAM39F"; gene_synonym "WASH5P"; pseudo "true";
1       17368   17436   MIR6859-1       .       -       BestRefSeq      gene    .       gene_id "MIR6859-1"; transcript_id ""; db_xref "GeneID:102466751"; db_xref "HGNC:HGNC:50039"; db_xref "miRBase:MI0022705"; description "microRNA 6859-1"; gbkey "Gene"; gene "MIR6859-1"; gene_biotype "miRNA"; gene_synonym "hsa-mir-6859-1";
1       29925   31295   MIR1302-2HG     .       +       Gnomon  gene    .       gene_id "MIR1302-2HG"; transcript_id ""; db_xref "GeneID:107985730"; db_xref "HGNC:HGNC:52482"; gbkey "Gene"; gene "MIR1302-2HG"; gene_biotype "lncRNA";
1       30365   30503   MIR1302-2       .       +       BestRefSeq      gene    .       gene_id "MIR1302-2"; transcript_id ""; db_xref "GeneID:100302278"; db_xref "HGNC:HGNC:35294"; db_xref "miRBase:MI0006363"; description "microRNA 1302-2"; gbkey "Gene"; gene "MIR1302-2"; gene_biotype "miRNA"; gene_synonym "hsa-mir-1302-2"; gene_synonym "MIRN1302-2";
1       34610   36081   FAM138A .       -       BestRefSeq      gene    .       gene_id "FAM138A"; transcript_id ""; db_xref "GeneID:645520"; db_xref "HGNC:HGNC:32334"; description "family with sequence similarity 138 member A"; gbkey "Gene"; gene "FAM138A"; gene_biotype "lncRNA"; gene_synonym "F379"; gene_synonym "FAM138F";
1       52452   53396   OR4G4P  .       +       Curated Genomic gene    .       gene_id "OR4G4P"; transcript_id ""; db_xref "GeneID:79504"; db_xref "HGNC:HGNC:14822"; description "olfactory receptor family 4 subfamily G member 4 pseudogene"; gbkey "Gene"; gene "OR4G4P"; gene_biotype "pseudogene"; pseudo "true";
1       63015   63885   OR4G11P .       +       Curated Genomic gene    .       gene_id "OR4G11P"; transcript_id ""; db_xref "GeneID:403263"; db_xref "HGNC:HGNC:31276"; description "olfactory receptor family 4 subfamily G member 11 pseudogene"; gbkey "Gene"; gene "OR4G11P"; gene_biotype "pseudogene"; pseudo "true";
1       65418   71585   OR4F5   .       +       BestRefSeq      gene    .       gene_id "OR4F5"; transcript_id ""; db_xref "GeneID:79501"; db_xref "HGNC:HGNC:14825"; description "olfactory receptor family 4 subfamily F member 5"; gbkey "Gene"; gene "OR4F5"; gene_biotype "protein_coding";
1       89005   267354  LOC100996442    .       -       Gnomon  gene    .       gene_id "LOC100996442"; transcript_id ""; db_xref "GeneID:100996442"; gbkey "Gene"; gene "LOC100996442"; gene_biotype "lncRNA"; 
ADD REPLY
0
Entering edit mode

I think you are either using wrong input or expecting a wrong output. In expected output, there are alleles (singletons), bed (out file) doesn't have such entries (in copy/pasted out.bed). In addition, cat test sed 's/"//g' | sed 's/;//g' test > fintest code is totally incorrect unless there is a copy/paste mistake. Assuming that function is correct, the sed code removes field separators which is incorrect, for downstream processing. You need to explain where third column (0 s), 5 and 6 coming from. Output bed file you shared here doesn't catch all those fields. Without correct data, it is difficult to address the issue.

ADD REPLY
0
Entering edit mode

Sorry, forgot one step, bedtools intersect

Before step 5 bedtools intersect -wb was used, to get the file with both gene names, gene positions and SNPs gtf file was intersected with sample bed file (obtained from Plink tped file) Sample bed file:

chr1    183189  183189
chr1    609407  609407

bed file from gtf:

gtf2bed < GCF_000001405.39_GRCh38.p13_genomic_geneonly_1copy.gtf > out

bedtools command: bedtools intersect -wb -a out -b sample.bed >  fintest  
ADD REPLY
0
Entering edit mode

storm1907 : Please do not delete threads once they have received comments/answers.

ADD REPLY
0
Entering edit mode

Sorry. Nobody answered, so I created another post. So, how can I get help to my question then, once this thread seems to be ignored?

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6