How to merge chr+strand+coordinate+score from multiple files?
3
0
Entering edit mode
4.8 years ago
Rae • 0

I have many files (~50) with the same format of four columns: chr, strand, coordinate, score

Each file has about 1 Million lines. Now I want to merge these files by coordinates (strand specific) while preserve the score value of each sample. So the final file will be: chr, strand, coord, scoreTotal, score1, score2, ...., scoreN

For example:

File001:
chr strand  coord   score
chr1    +   100 1
chr1    -   200 2

File002:
chr1    +   100 3
chr1    +   101 4
chr1    -   200 5
chr1    -   201 6


Merged_file001_file001:

chr strand  coord   scoreSum    score1  score2
chr1    +   100 4   1   3
chr1    +   101 4   0   4
chr1    -   200 7   2   5
chr1    -   201 6   0   6


I'm thinking of using R package like dplyr to group by coordinates, but for so many files with so many lines, it may not be efficient. I'm also thinking of converting each file to BED format, and then use tool like multiIntersectBed to merge files, but it seems that it can not preserve the score column.

Is there any tool can do this merging quickly?

Thank you very much!

genome RNA-Seq • 2.1k views
0
Entering edit mode

You can use "bedtools groupby" command. Show some example files, will behelpfulr to get precise answer.

0
Entering edit mode

Thank you, but I checked the "bedtools groupby" command, it seems that it is not suitable for my case.

0
Entering edit mode

Hello,

are the coordinates in every file the same or do they differ?

fin swimmer

0
Entering edit mode

I have added example tables. Coordinates in every file are not exactly the same.

0
Entering edit mode

I'm thinking of using R package like dplyr to group by coordinates, but for so many files with so many lines, it may not be efficient.

50 million records isn't that big a data set. How many zeros are you expecting once you spread() you data into a matrix?

0
Entering edit mode

There will not be many zeroes. Because coordinates in different files are largely the same with some difference.

1
Entering edit mode
4.8 years ago

Hello,

my experiences with the pandas modul for python is limited. So I didn't know if it can handle your amount of data. But you should give it a try.

Run the following script like this:

$python join.py File* > result.csv  EDIT: Based on the suggestion by cpad0112 I changed the way how to specify the files. One can use the wildcard format and/or just list all files that should be used. I also adress the info by the OP that there are no headers in the file. fin swimmer ADD COMMENT 1 Entering edit mode Code works absolutely fine. Use of ls can be totally avoided as one can list files (with pattern..for eg files starting with File*) with glob (IMO). Some thing like files=glob.glob ("File*.). Rest of the code works as expected. User simply has to run the code: python join.py > result.csv ADD REPLY 0 Entering edit mode You always find something to improve my code ;) Thanks, I edited it. fin swimmer ADD REPLY 0 Entering edit mode I am learning from your code. I am not good at python @finswimmer ADD REPLY 0 Entering edit mode Thank you! I'm not familiar with python, I will have a try! ADD REPLY 0 Entering edit mode Hello Rae, than it's a good point to get familiar with python ;) The easiest way of installing pandas modul after you have install python is to use pip. In another comment you wrote that your files doesn't contain a header. If so you have to change my code to work. Change header=0 in header=None. fin swimmer ADD REPLY 0 Entering edit mode 4.8 years ago Make forward-stranded BED for each of N (~50) files: $ awk -vOFS="\t" '($2 == "+"){ print$1, $3-1,$3, ".", $4 }' file001.txt | sort-bed - > file001.for.bed ...  Make reverse-stranded BED for each of N files: $ awk -vOFS="\t" '($2 == "-"){ print$1, $3-1,$3, ".", $4 }' file001.txt | sort-bed - > file001.rev.bed ...  Map forward-stranded elements: $ bedmap --echo --sum --echo-map-score <(bedops -i file*.for.bed) <(bedops -u file*.for.bed) > answer.for.bed


Map reverse-stranded elements:

$bedmap --echo --sum --echo-map-score <(bedops -i file*.rev.bed) <(bedops -u file*.rev.bed) > answer.rev.bed  The result of --echo-map-score will be delimited by semi-colons. Add --multidelim "\t" to this option to replace semi-colons with tabs. Use awk to add strand back to answer.*.bed, if needed. ADD COMMENT 0 Entering edit mode Thank you! But I tested the above two example tables, but no result was returned. I put data in file01.txt and file02.txt, and then run your commands: To generate rev and forward strand for each file: for file in *.txt; do echo$file
awk -vOFS="\t" '($2 == "+"){ print$1, $3-1,$3, ".", $4 }' "$file" | sort-bed - > "$file".for.bed done for file in *.txt; do echo$file
awk -vOFS="\t" '($2 == "-"){ print$1, $3-1,$3, ".", $4 }' "$file" | sort-bed - > "$file".rev.bed done  Then run bedopt (here I also added the --exact option to see difference): bedmap --echo --sum --echo-map-score --multidelim "\t" <(bedops -i file0*.txt.for.bed) <(bedops -u file0*.txt.for.bed) > file.for.bed bedmap --echo --sum --echo-map-score --multidelim "\t" --exact <(bedops -i file0*.txt.rev.bed) <(bedops -u file0*.txt.rev.bed) > file.rev.bed  These two commands can return the union or intersect bed regions: bedops -u file0*.txt.for.bed bedops -i file0*.txt.for.bed  But the output file.for.bed and file.rev.bed are both empty. I don't know why? ADD REPLY 0 Entering edit mode Can you remove the column headers from your input files? I'd be curious to see if that fixes the issue for you. ADD REPLY 0 Entering edit mode There is no column header in my input files. ADD REPLY 0 Entering edit mode Sorry, I'm just looking at what you posted. Feel free to ignore. ADD REPLY 0 Entering edit mode 4.8 years ago with individual scores in a single column: $ cat test1.txt <(tail -n+2 test2.txt) | datamash  -sH -g 1,2,3  sum 4 collapse 4
GroupBy(chr)    GroupBy(strand) GroupBy(coord)  sum(score)  collapse(score)
chr1    +   100 4   1,3
chr1    +   101 4   4
chr1    -   200 7   2,5
chr1    -   201 6   6


input:

\$ cat test1.txt test2.txt
chr strand  coord   score
chr1    +   100 1
chr1    -   200 2
chr strand  coord   score
chr1    +   100 3
chr1    +   101 4
chr1    -   200 5
chr1    -   201 6