Question

Extract gene lists using two files

0

Entering edit mode

6.5 years ago

bk11 ★ 2.4k

Hi,

I have two files with file1 having coordinates of genes and file2 containing list of genes. How can I extract genes lists from file2 using coordinates from file1?

File1:

**CNV_type  Coordinates size    val1    val2    val3    val4    val5    val6    cp**
deletion    chr10:1726501-1755000   28500   0.586226    9.73037E-05 715.754 0.00171548  3546.87 0.114216    1.17241

File2:

Chr10   NC_029525.1 gene    1672245 1676954 -   LOC107318572
Chr10   NC_029525.1 gene    1677076 1682931 -   C10H15orf39
Chr10   NC_029525.1 gene    1690899 1710413 -   PPCDC
Chr10   NC_029525.1 gene    1710723 1714472 -   LOC107318577
Chr10   NC_029525.1 gene    1714558 1714977 -   LOC107318579
Chr10   NC_029525.1 gene    1717116 1719122 +   RPP25
Chr10   NC_029525.1 gene    1721742 1725395 +   LOC107318578
Chr10   NC_029525.1 gene    1725935 1728167 +   FAM219B
Chr10   NC_029525.1 gene    1728336 1731151 -   MPI
Chr10   NC_029525.1 gene    1731194 1739576 +   LOC107318570
Chr10   NC_029525.1 gene    1739821 1743801 +   ULK3
Chr10   NC_029525.1 gene    1744568 1747749 -   CPLX3
Chr10   NC_029525.1 gene    1752411 1759515 -   CSK
Chr10   NC_029525.1 gene    1763792 1766892 -   LOC107318670
Chr10   NC_029525.1 gene    1768556 1772353 +   LOC107318671
Chr10   NC_029525.1 gene    1773117 1795190 +   EDC3
Chr10   NC_029525.1 gene    1795424 1803058 -   CLK3
Chr10   NC_029525.1 gene    1803181 1830203 -   ARID3B
Chr10   NC_029525.1 gene    1830313 1832341 +   LOC107318625
Chr10   NC_029525.1 gene    1832912 1837647 +   UBL7
Chr10   NC_029525.1 gene    1837777 1868463 +   SEMA7A
Chr10   NC_029525.1 gene    1871806 1875769 +   LOC107318663
Chr10   NC_029525.1 gene    1875800 1895830 -   CCDC33

Output/Result:

Chr10   NC_029525.1 gene    1725935 1728167 +   FAM219B
Chr10   NC_029525.1 gene    1728336 1731151 -   MPI
Chr10   NC_029525.1 gene    1731194 1739576 +   LOC107318570
Chr10   NC_029525.1 gene    1739821 1743801 +   ULK3
Chr10   NC_029525.1 gene    1744568 1747749 -   CPLX3
Chr10   NC_029525.1 gene    1752411 1759515 -   CSK

I will appreciate your help. Thanks.

Unix awk shell script • 2.3k views

ADD COMMENT • link updated 4.9 years ago by Biostar 20 • written 6.5 years ago by bk11 ★ 2.4k

0

Entering edit mode

I removed headers of both of the files using sed command: sed '1d' file1.txt File1 becomes like this:

deletion        chr10:742501-778500     36000   0.654632        5.86769E-05     4097.15 0.00078794      12578   0.0298329       1.30923
deletion        chr10:1726501-1755000   28500   0.586226        9.73037E-05     715.754 0.00171548      3546.87 0.114216        1.17241
deletion        chr10:1761001-2217000   456000  0.690132        3.49501E-13     47504300        3.51816E-13     48803700        0.000937598     1.38026
deletion        chr10:14281501-14355000 73500   0.686171        7.20884E-08     2557540 4.2113E-07      3406590 0.166855        1.37233
deletion        chr10:17359501-17797500 438000  0.721442        3.63864E-13     612200000       3.66374E-13     618714000       0.00241719      1.44288

And file2 becomes:

Chr1    NC_029516.1     967     5379    -       RNF212B
Chr1    NC_029516.1     5341    19961   -       LRRC16B
Chr1    NC_029516.1     20428   23408   -       LOC107314597
Chr1    NC_029516.1     23722   31174   +       LOC107305688
Chr1    NC_029516.1     34904   43887   -       LOC107312337
Chr1    NC_029516.1     49744   56456   -       LOC107317046

output from command line: tail -n+2 first.txt | cut -f2 | awk -v OFS="\t" -F'[:-]' '{ print $1, $2, $3; }' | sort-bed - > first.bed is

LGE22C19W28_E50C23      1       99000
LGE64   355501  421500
chr1    547501  571500
chr1    8761501 8773500
chr1    18082501        18099000
chr1    26757001        26775000
chr1    29340001        29350500
chr1    44872501        44920500

output from command line: awk -v OFS="\t" '{ print $1, $4, $5, $2, $3, $6, $7; }' second.txt > second.bed is

Chr1    5379    -       NC_029516.1     967     RNF212B
Chr1    19961   -       NC_029516.1     5341    LRRC16B
Chr1    23408   -       NC_029516.1     20428   LOC107314597
Chr1    31174   +       NC_029516.1     23722   LOC107305688
Chr1    43887   -       NC_029516.1     34904   LOC107312337
Chr1    56456   -       NC_029516.1     49744   LOC107317046

But when I ran final command line:

bedops -e 1 second.bed first.bed | awk -v OFS="\t" '{ print $1, $4, $5, $2, $3, $6, $7; }' > answer.txt

The answer file has nothing.

ADD REPLY • link 6.5 years ago by bk11 ★ 2.4k

1

Entering edit mode

The formats of these files are different than what you posted originally.

To answer your question based off of these inputs, you could convert the first file to sorted BED via:

$ cut -f2 first.txt | awk -v OFS="\t" -F'[:-]' '{ print $1, $2, $3; }' | sort-bed - > first.bed

Convert the second file:

$ awk -v OFS="\t" '{ print $1, $3, $4, $2, $6, $5; }' second.txt | sort-bed - > second.bed

Then you can run bedops and permute columns:

$ bedops -e 1 second.bed first.bed | awk -v OFS="\t" '{ print $1, $4, $3, $4, $6, $5;  }' > answer.txt

Check the files at every step. Especially if your input format changes, because that changes the behavior of tools like awk and bedops, etc.

ADD REPLY • link 6.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank You Alex Reynolds! I worked now. I appreciate your help.

ADD REPLY • link 6.5 years ago by bk11 ★ 2.4k

0

Entering edit mode

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically formatted. This belongs under @Alex's answer.

ADD REPLY • link 6.5 years ago by GenoMax 141k

1

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

convert file 1 to bed:

cut -f 2 file1.tsv | tr ":" "\t" | tr "-" "\t" | sort -t $'\t' -k1,1 -k2,2n > file1.bed

convert file2 to bed

awk '{printf("%s\t%s\t%s\t%s\n",$1,$4,$5,$0);}' file2.tsv | sort -t $'\t' -k1,1 -k2,2n > file2.bed

get the intersection using bedtools http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, Thank you for your reply. Now I am able to create both the bed files. Which command line of bedtools will you suggest for the result?

ADD REPLY • link 6.5 years ago by bk11 ★ 2.4k

0

Entering edit mode

as stated : bedtools intersect.

Which command line of bedtools will you suggest for the result?

what did you try ?

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 161k

GenoMax · Accepted Answer · 2017-10-16

0

Entering edit mode

6.5 years ago

Alex Reynolds 35k

Convert the first file to a sorted BED file:

$ tail -n+2 first.txt | cut -f2 | awk -v OFS="\t" -F'[:-]' '{ print $1, $2, $3; }' | sort-bed - > first.bed

Convert the second file to a sorted BED file:

$ tail -n+2 second.txt | awk -v OFS="\t" '{ print $1, $4, $5, $2, $3, $6, $7; }' | sort-bed - > second.bed

Run set operations on the two sorted BED files with BEDOPS bedops -e 1, and permute the output back to the desired format with awk:

$ bedops -e 1 second.bed first.bed | awk -v OFS="\t" '{ print $1, $4, $5, $2, $3, $6, $7; }' > answer.txt

NOTE: There is a correct answer from @Alex that has worked. Since it is in an odd location in this thread, I am going to toggle this answer as accepted but the actual working answer is here (click this link) --> C: Extract gene lists using two files

@Alex: Rearrange as you wish.

ADD COMMENT • link updated 6.5 years ago by GenoMax 141k • written 6.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Hi Alex, The second command line produced error like this:

awk -v OFS="\t" '{ print $1, $4, $5, $2, $3, $6, $7; }' second.txt | sort-bed - > second.bed

Non-numeric start coordinate. See line 1 in -. (remember that chromosome names should not contain spaces.)

ADD REPLY • link 6.5 years ago by bk11 ★ 2.4k

0

Entering edit mode

What does:

$ awk -v OFS="\t" '{ print $1, $4, $5, $2, $3, $6, $7; }' second.txt | head

look like?

ADD REPLY • link 6.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Second bed looks like:

Chrom   End     Strand  Accession       Start   Gene_Name
Chr1    5379    -       NC_029516.1     967     RNF212B
Chr1    19961   -       NC_029516.1     5341    LRRC16B
Chr1    23408   -       NC_029516.1     20428   LOC107314597
Chr1    31174   +       NC_029516.1     23722   LOC107305688
Chr1    43887   -       NC_029516.1     34904   LOC107312337
Chr1    56456   -       NC_029516.1     49744   LOC107317046
Chr1    93908   +       NC_029516.1     57601   GOLGB1
Chr1    106680  +       NC_029516.1     94008   HCLS1
Chr1    125933  -       NC_029516.1     108350  LOC107313979

First bed looks like

LGE22C19W28_E50C23      1       99000
LGE64   355501  421500
chr1    547501  571500
chr1    8761501 8773500
chr1    18082501        18099000
chr1    26757001        26775000
chr1    29340001        29350500
chr1    44872501        44920500
chr1    51655501        51690000
chr1    59677501        59686500

When I ran code:

bedops -e 1 second.bed first.bed | awk -v OFS="\t" '{ print $1, $4, $5, $2, $3, $6, $7; }' > answer.txt

answer.txt file has nothing.

ADD REPLY • link 6.5 years ago by bk11 ★ 2.4k

0

Entering edit mode

You have to get rid of that header. Please see my modified answer for stripping the header with tail -n+2.

ADD REPLY • link 6.5 years ago by Alex Reynolds 35k

0

Entering edit mode

And you're missing a column, somewhere, compared with what was in your original question. Which is the correct input? This one, or the input in your question?

ADD REPLY • link 6.5 years ago by Alex Reynolds 35k