Question

GWAS results: base pair annotation

0

Entering edit mode

2.6 years ago

hemasekhar09 • 0

Hello,

I have a question regarding the annotation of GWAS results.

Here are the sample files

File1 Below is File1 with the HG38id column name that contains chromosome number and base pair position.

HG38id
13:21325916
5:128327887
19:44906745
1:88820990
17:44981703

File2 Below is File2 with the HG38id column name that contains chromosome number and base pair position.

HG38id
13:21325516
5:128327100
19:44906111
1:88820200
17:44981903

I would like to know the ways to check the nearest base pair position between File1 and File2. For instance, 500 base pairs downstream and upstream.

The output for above given File1 and File2 with 500 base pair downstream and the upstream is

13:21325916
17:44981703

because these two base pair positions are nearby 500 bases in both files.

What is the way to find the nearest base pair positions (for instance, 500 or 200 base pairs downstream and upstream) in large files?

Thanks in advance

Sincerely,
Hema

GWAS Annotation results • 796 views

ADD COMMENT • link 2.6 years ago by hemasekhar09 • 0

score 1 · Answer 1 · 2021-09-03

You could convert your files into bed files using awk

awk -F ":" ' {print $1 "\t" $2 "\t" $2}' your_file.txt >your_files.bed

Run that for both files to make two bed-files. Then you can use bedtools. There are various bedtools commands which are useful for your problem.

For example, bedtools closest will give you the closest interval in the first file for each interval in the second file, here's their example:

$ cat a.bed
chr1  10  20  a1  1 -

$ cat b.bed
chr1  7   8   b1  1 -
chr1  15  25  b2  2 +

$ bedtools closest -a a.bed -b b.bed
chr1  10  20  a1  1 - chr1  15  25  b2  2 +

For fine-grained control, you could also expand all your SNPs by a certain bp amount and then see whether they intersect using bedtools slop:

bedtools slop -i your_file.bed -b 5000  > your_expanded_regions.bed # expand the SNP feature to be 5000 bp in both directions
bedtools intersect -a your_SNPs.bed -b your_expanded_regions.bed # will report all SNPs in your_SNPs.bed that overlap with the new 10,000 bp windows in your_expanded_regions.bed