Hi all,
I need a suggestion for annotating with the gene information a huge file. The file is in the gz format (dimension ~ 120 Mb) after unzip it, the dimension of the file is about 30 Giga and there are ~ 40 000 000 rows. In the file (text format), there are several columns, e.g. chr, start, end, ID sites and other information. Is there a tool that can allow me to easily annotate every sites with the gene information?
Thank a lot. Best regards
If you put your input into BED format, you can use
bedmapto associate those intervals with genes converted to BED viagtf2bedorgff2bed.Search biostars for those keywords and you'll find a number of answers that demonstrate this for Gencode and other annotation sets.
Some example code:
You can pipe things in via a process substitution, e.g.: