converting spaces to tabs in gtf files
0
0
Entering edit mode
3.0 years ago
storm1907 ▴ 30

Hello, I have bed file, looking like this: (original is NCBi GTF file, converted to bed format)

enter image description here

I need to convert all delimiters to tabs in the given file for further usage with bedtools. I tried commands, according to this thread: https://stackoverflow.com/questions/1424126/replace-whitespaces-with-tabs-in-linux

However, after running bedtools, I get the following error

***** ERROR: illegal character 't' found in integer conversion of string "transcript". Exiting...

I cannot understand, why basic txt conversion commands dont work on this file. Also - is there a way to switch NC chromosomal code to chromosome number (i.e. NCXXX to 1)?

Thank you!

NCBI • 3.4k views
ADD COMMENT
0
Entering edit mode

what method from that SO thread did you apply? and can you then post an example of the output file of it.

ADD REPLY
0
Entering edit mode

I tried awk -v OFS="\t" '$1=$1' out_tabs and

sed 's/[:blank:]+/,/g' thefile.txt > out_tabs

The output was:

cat -t out_tabs
enter code here

NC_000001.11^I59638228^I59638375^IFGGY^I.^I+
NC_000001.11^I59641280^I59641351^IFGGY^I.^I+
NC_000001.11^I59660219^I59660293^IFGGY^I.^I+

sed "s/[[:space:]]+/$T/g" gave output like this:

cat -t out_tabs
NC_000003.121914854219534788KCNH8.+
NC_000003.121914854219148795KCNH8.+
tr -s '[:blank:]' '\t'

and perl -p -i -e 's/ /\t/g' file.txt ended with empty output

ADD REPLY
0
Entering edit mode

Is this what you are looking for? Also make sure that you do not have duplicates in the file and if it is windows originated file, look at hidden characters. Try dos2unix on windows generated text files.

$ cat test.txt                                                                                                                                                     
NC_000001.11^I59638228^I59638375^IFGGY^I.^I+
NC_000001.11^I59641280^I59641351^IFGGY^I.^I+
NC_000001.11^I59660219^I59660293^IFGGY^I.^I+

$ sed 's/\^I/\t/g' test.txt                                                                                                                                        
NC_000001.11    59638228    59638375    FGGY    .   +
NC_000001.11    59641280    59641351    FGGY    .   +
NC_000001.11    59660219    59660293    FGGY    .   +
ADD REPLY
0
Entering edit mode

I tried this command, but still get the same issue

bedtools intersect -a out_tabs.bed -b sample.bed  > intersected
***** ERROR: illegal character 't' found in integer conversion of string "transcript". Exiting...
ADD REPLY
0
Entering edit mode

can you try this in Vim?

https://superuser.com/questions/1188594/control-i-characters-in-my-text-file. Also check if your file has any invisible characters. Hint is given in the same post.

Can you have a look at the file by running sed -n 'l' input.txt. This would print the invisible characters in the file. (l is small letter L) `

ADD REPLY
0
Entering edit mode

this is the

output of sed -n 'l' sample.bed

    24\t20756240\t20756240$
    24\t21548215\t21548215$
    24\t24206185\t24206185$

and output of

sed -n 'l' out_tabs
NC_000002.12\t11788387\t11788456\tLPIN1\t.\t+$

NC_000002.12\t11791914\t11792006\tLPIN1\t.\t+$
ADD REPLY
0
Entering edit mode

I tried using the files above by removing $ and \t and using the function bedtools intersect -a sample.bed -b out.bed. It didn't throw any error. However, it seems there is another issue with your files. First column (of chromosomes/contigs) do not match in your files.

You can join biostar slack and share the part or full files privately with the members, it is okay with you.

This is what I did with the data furnished by you:

$ cat sample.bed                                                                                                                                                   
2   11788387    11788456
24  20756240    20756240
24  21548215    21548215
24  24206185    24206185

$ cat out.bed                                                                                                                                                      
2   11788387    11788456    LPIN1   .   +
2   11791914    11792006    LPIN1   .   +

$ bedtools intersect -a sample.bed -b out.bed -wb                                                                                                                  
2   11788387    11788456    2   11788387    11788456    LPIN1   .   +

$ bedtools intersect  -a out.bed -b sample.bed                                                                                                                     
2   11788387    11788456    LPIN1   .   +

Note: I added 2 entry to sample bed and I wrote a small sed script to trim 1 column in out.bed.

ADD REPLY
0
Entering edit mode

I checked out.txt file one more time, and indeed, in the middle of the file there are rows like:

NC_000022.11 transcript 22179165 gene_id 22179488 .
NC_000022.11 exon 22179165 gene_id 22179488 .
NC_000022.11 CDS 22179165 gene_id 22179485 .
NC_000022.11 start_codon 22179165 gene_id 22179167 .
NC_000022.11 stop_codon 22179486 gene_id 22179488 .
NC_000022.11 gene 22182583 gene_id 22182890 .

How can I get rid of them? The file was obtained from NCBI GTF files, with

 cat GCF_000001405.39_GRCh38.p13_genomic.gtf  | awk '{print $1,$4,$5,$10,$6,$7}' | sed 's/"//g' | sed 's/;//g' > out

file was downloaded from here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39 however it is quite complex, and could take too much space, if I post a fragment of it here

ADD REPLY
0
Entering edit mode

Please do the following:

  1. Download gtf from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

  2. Unzip (gzip -d) the file, convert first column as per the your file and use it direct with bedtools

  3. However, if you want to use bed file only, edit the first column (from gtf as per your input file), save the new file as gtf and convert it to bed by using 'gtf2bed' from bedops. Use the output bed.

Note that gtf file will have annotations at multiple levels (eg. gene, transcript, CDS, exon). Accordingly, extract those features of your interest from gtf or bed. In the example below, I have used full gtf and hits were too many. I printed only 2 lines.

$ cat sample.bed                                                                                                                                                   
1   959193  959193
1   961945  961945
1   962080  962080

$ bedtools intersect  -a GCF_000001405.39_GRCh38.p13_genomic1.gtf -b sample.bed | head -2                                                                          

1   BestRefSeq  gene    959193  959194  .   -   .   gene_id "NOC2L"; transcript_id ""; db_xref "GeneID:26155"; db_xref "HGNC:HGNC:24517"; db_xref "MIM:610770"; description "NOC2 like nucleolar associated transcriptional repressor"; gbkey "Gene"; gene "NOC2L"; gene_biotype "protein_coding"; gene_synonym "NET15"; gene_synonym "NET7"; gene_synonym "NIR"; gene_synonym "PPP1R112"; 
1   BestRefSeq  transcript  959193  959194  .   -   .   gene_id "NOC2L"; transcript_id "NM_015658.4"; db_xref "Ensembl:ENST00000327044.7"; db_xref "GeneID:26155"; gbkey "mRNA"; gene "NOC2L"; product "NOC2 like nucleolar associated transcriptional repressor"; tag "MANE Select"; transcript_biotype "mRNA"; 
ADD REPLY
0
Entering edit mode

Thank you! Sorry, but how should I edit gtf file then? I need only gene level

ADD REPLY
0
Entering edit mode

Keep only those lines from the result of bedtools intersect with word gene in column 3?

ADD REPLY
0
Entering edit mode

that's correct. If you do that to gtf, intersect operation would be faster as new gtf file is smaller in size. To format gtf (first column), use this code:

$ sed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' GCF_000001405.39_GRCh38.p13_genomic.gtf > new.gtf

ADD REPLY
0
Entering edit mode

Thank you! But got 1.11 instead of 1 with this command line

1.11    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";
ADD REPLY
0
Entering edit mode

what's your input file? If you are using the gtf file from the link I shared with you, it should work. Please paste first line from your input gtf.

ADD REPLY
0
Entering edit mode

Hello, so I downloaded file via wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

Then gunzipped it (with -d), executed command

sed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' GCF_000001405.39_GRCh38.p13_genomic.gtf > new.gtf

runned bedtools command:

bedtools intersect -a new.gtf -b sample.bed  > intersected

and got empty output, because this is the output of sed command line

1.11    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";
1.11    BestRefSeq      transcript      11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript";
1.11    BestRefSeq      exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript"; exon_number "1";
ADD REPLY
0
Entering edit mode

try this and post the results here:

$ sed -r 's/\.[0-9]+//;s/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' GCF_000001405.39_GRCh38.p13_genomic.gtf > new.gtf
ADD REPLY
0
Entering edit mode

This worked, thank you!

1       BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";
1       BestRefSeq      transcript      11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript";
1       BestRefSeq      exon    11874   12227   .      
ADD REPLY
0
Entering edit mode

Hello, sorry is this bedtools option? How can I write this command?

ADD REPLY
0
Entering edit mode

without posting example data and commands used, it is difficult to work with image.

ADD REPLY
0
Entering edit mode

Sorry about that. I posted commands above, in previous commands

And this is what i get when using bedtools

bedtools intersect -a out_tabs.bed -b sample.bed  > intersected
***** ERROR: illegal character 't' found in integer conversion of string "transcript". Exiting...
ADD REPLY
0
Entering edit mode

Post actual data. Not a screenshot.

ADD REPLY
0
Entering edit mode

My apologies

NC_000001.11^I149063633^I149063805^INBPF9^I.^I-
NC_000001.11^I149062862^I149062913^INBPF9^I.^I-
NC_000001.11^I149062093^I149062265^INBPF9^I.^I-
NC_000001.11^I149061332^I149061383^INBPF9^I.^I-
ADD REPLY
0
Entering edit mode

This i guess is from out_tabs.bed. What about sample.bed? Does any one of these files have string "transcript" in them?

ADD REPLY
0
Entering edit mode

not really

vi sample.bed
/transcript

1       959193  959193
1       961945  961945
1       962080  962080
E486: Pattern not found: transcript   
ADD REPLY

Login before adding your answer.

Traffic: 1860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6