annotate sites with gene info
2
0
Entering edit mode
4 months ago
Rdeveloop • 0

Hi all,

I need a suggestion for annotating with the gene information a huge file. The file is in the gz format (dimension ~ 120 Mb) after unzip it, the dimension of the file is about 30 Giga and there are ~ 40 000 000 rows. In the file (text format), there are several columns, e.g. chr, start, end, ID sites and other information. Is there a tool that can allow me to easily annotate every sites with the gene information?

Thank a lot. Best regards

annotation • 444 views
ADD COMMENT
0
Entering edit mode

If you put your input into BED format, you can use bedmap to associate those intervals with genes converted to BED via gtf2bed or gff2bed.

Search biostars for those keywords and you'll find a number of answers that demonstrate this for Gencode and other annotation sets.

ADD REPLY
0
Entering edit mode

Some example code:

  1. https://bioinformatics.stackexchange.com/a/897/776
  2. Obtain intronic coordinates from GENCODE database

You can pipe things in via a process substitution, e.g.:

$ someTool <(gunzip -c genes.gz) | ...
ADD REPLY
0
Entering edit mode
4 months ago

annotate every sites w

"every sites" ? do you mean a vcf file ? sort+bzip+tabix your text file and have a look at bcftools annotate.

ADD COMMENT
0
Entering edit mode
4 months ago

If you have your gene information in a bed file (chr, 0-base start, 1-base end, gene) and your input file is a vcf file (it looks so from your very limited description; just describe it deeper if not) you can use bedtools intersect as follows:

bedtools intersect -loj -a input.vcf.gz -b genes.bed.gz

Note that both input files should be sorted previously. Note also that you don't need to uncompress input files in order to feed tools like bedtools or bcftools

ADD COMMENT
0
Entering edit mode

Thanks a lot. my file is not a vcf file, it is in a file in which there are listed several sites. For each site, it is reported information like chromosome, the start, the end position, the ID position. The format is tsv.gzs. So, I would like to map each site to the corresponding gene and add that information. Maybe I can use you above command but using a gtf file instead of the genes.bed.gz. What do you think? Could it work?

ADD REPLY
0
Entering edit mode

If the input is not a vcf file but a tab delimited table you can try modifying your input file on the fly to a bed-like format (if first 3 columns are chr, start and end you can convert the start second column to 0-base position), and modify the output back to the original format:

zcat input.tsv.gz \
| awk 'OFS="\t"{$2--;print}' \
| bedtools intersect -loj -a - -b genes.bed.gz \
| awk 'OFS="\t"{$2++; print}'
ADD REPLY

Login before adding your answer.

Traffic: 1415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6