Question: Tab Delimited Files Comparison
0
gravatar for Raghav
6.5 years ago by
Raghav100
Allahabad, India
Raghav100 wrote:

I have two files, file1 contain, information regarding gene start positions and gene end positions with their chromosome locations and file 2 contain variant information in particular chromosome location it may be in CDS/5'UTRs/3'UTRs or some where else, I want to compare file2 (variant information) to file1 (complete gen information) and want to filter only those gene from file1 that have variation according to file2. for example

file2

Don-0    1    288    A    G    3    1    1    1   
Don-0    1    291    T    A    5    2    1    1
Don-0    1    303    T    C    18    3    1    1
Don-0    1    310    C    G    14    2    1    1
Don-0    1    317    C    T    23    4    1    1
Don-0    1    331    A    T    32    6    1    1
Don-0    1    344    A    C    32    9    1    1
Don-0    1    352    G    C    32    7    1    1
Don-0    1    365    C    G    32    6    1    1
Don-0    1    **4000**    A    G    25    6    1    1

file1

chromosome    1    30427671    .    .    .    ID=Chr1;Name=Chr1    
**gene            3631    5899    .**    +    .    ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010    
mRNA    3631    5899    .    +    .    ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1    
protein    3760    5630    .    +    .    ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1    
exon    3631    3913    .    +    .    Parent=AT1G01010.1    
five_prime_UTR    3631    3759    .    +    .    Parent=AT1G01010.1    
CDS    3760    3913    .    +    0    Parent=AT1G01010.1,AT1G01010.1-Protein;    
exon    3996    4276    .    +    .    Parent=AT1G01010.1    
CDS    3996    4276    .    +    2    Parent=AT1G01010.1,AT1G01010.1-Protein;

we want to filter only gene which contains variant position from file2. like gene 3631 5899 . 4000 A G

need your valuable comments and suggesstions

file • 2.0k views
ADD COMMENTlink modified 6.5 years ago by Alex Reynolds29k • written 6.5 years ago by Raghav100
1

I'm sure you can find many answers on this site for your question. Search for "bedtools" please.

ADD REPLYlink written 6.5 years ago by Pierre Lindenbaum123k
2
gravatar for Alex Reynolds
6.5 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Can you get your file1 data into UCSC BED format? The file1 data looks almost like GFF.

If you can work this into GFF, it can be converted into sorted BED with BEDOPS gff2bed and BEDOPS sort-bed (and optionally filtered for genes with grep):

$ gff2bed < file1 \
    | grep gene - \
    > file1.genes.bed

Once in BED format, it can be mapped against a modified, sorted file2 with BEDOPS bedmap and BEDOPS sort-bed (assuming Don-0 is the same as Chr1).

First, modify file2 to make sure it uses the same chromosome ID as in file1 and is in correct sort order:

$ awk '{res="Chr1"; for(i=2;i<=NF;i++){res=res"\t"$i}; print res}' file2 \
    | sort-bed - \
    > file2.mod.bed

The bedmap call reports elements of file2 that overlap each file1 element:

$ bedmap --echo --echo-map --delim '\t' file1.genes.bed file2.mod.bed \
    > answer.bed

In answer.bed, you will get each file1 element, along with any qualifying file2 elements which overlap the file1 element by one or more bases. For example (taking liberties with your data as shown, in order to demonstrate a possible result):

...
Chr1    3631    5899    {rest of file1 element...}    Chr1    1    4000    A    G    {rest of file2 element...}
...

It looks like you might first need to work your data into a standard format that can be easily used with tools of this sort.

ADD COMMENTlink modified 5.4 years ago • written 6.5 years ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1397 users visited in the last hour