Question

Tab Delimited Files Comparison

0

Entering edit mode

11.0 years ago

Raghav ▴ 100

I have two files, file1 contain, information regarding gene start positions and gene end positions with their chromosome locations and file 2 contain variant information in particular chromosome location it may be in CDS/5'UTRs/3'UTRs or some where else, I want to compare file2 (variant information) to file1 (complete gen information) and want to filter only those gene from file1 that have variation according to file2. for example

file2

Don-0    1    288    A    G    3    1    1    1   
Don-0    1    291    T    A    5    2    1    1
Don-0    1    303    T    C    18    3    1    1
Don-0    1    310    C    G    14    2    1    1
Don-0    1    317    C    T    23    4    1    1
Don-0    1    331    A    T    32    6    1    1
Don-0    1    344    A    C    32    9    1    1
Don-0    1    352    G    C    32    7    1    1
Don-0    1    365    C    G    32    6    1    1
Don-0    1    **4000**    A    G    25    6    1    1
`

file1

chromosome    1    30427671    .    .    .    ID=Chr1;Name=Chr1    
**gene            3631    5899    .**    +    .    ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010    
mRNA    3631    5899    .    +    .    ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1    
protein    3760    5630    .    +    .    ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1    
exon    3631    3913    .    +    .    Parent=AT1G01010.1    
five_prime_UTR    3631    3759    .    +    .    Parent=AT1G01010.1    
CDS    3760    3913    .    +    0    Parent=AT1G01010.1,AT1G01010.1-Protein;    
exon    3996    4276    .    +    .    Parent=AT1G01010.1    
CDS    3996    4276    .    +    2    Parent=AT1G01010.1,AT1G01010.1-Protein;

We want to filter only gene which contains variant position from file2 like

 **gene            3631    5899    .** **4000**    **A    G**

Need your valuable comments and suggestions

bash • 2.9k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 11.0 years ago by Raghav ▴ 100

1

Entering edit mode

I'm sure you can find many answers on this site for your question. Search for "bedtools" please.

ADD REPLY • link 11.0 years ago by Pierre Lindenbaum 161k

score 2 · Answer 1 · 2013-04-05

Can you get your file1 data into UCSC BED format? The file1 data looks almost like GFF.

If you can work this into GFF, it can be converted into sorted BED with BEDOPS gff2bed and BEDOPS sort-bed (and optionally filtered for genes with grep):

$ gff2bed < file1 \
    | grep gene - \
    > file1.genes.bed

Once in BED format, it can be mapped against a modified, sorted file2 with BEDOPS bedmap and BEDOPS sort-bed (assuming Don-0 is the same as Chr1).

First, modify file2 to make sure it uses the same chromosome ID as in file1 and is in correct sort order:

$ awk '{res="Chr1"; for(i=2;i<=NF;i++){res=res"\t"$i}; print res}' file2 \
    | sort-bed - \
    > file2.mod.bed

The bedmap call reports elements of file2 that overlap each file1 element:

$ bedmap --echo --echo-map --delim '\t' file1.genes.bed file2.mod.bed \
    > answer.bed

In answer.bed, you will get each file1 element, along with any qualifying file2 elements which overlap the file1 element by one or more bases. For example (taking liberties with your data as shown, in order to demonstrate a possible result):

...
Chr1    3631    5899    {rest of file1 element...}    Chr1    1    4000    A    G    {rest of file2 element...}
...

It looks like you might first need to work your data into a standard format that can be easily used with tools of this sort.