Question

Bash- Sort a file based on a list in another file

0

Entering edit mode

7.3 years ago

jblommaert92 ▴ 70

I know that similar questions have been asked about sorting a file by a specific column, but none of them seem to answer my question.

My input file shows coverage in 500bp windows along my genome, and it looks like

OHJ07_1_contig_10   0   500 130 500 500 1.0000000
OHJ07_1_contig_10   500 1000    180 500 500 1.0000000
OHJ07_1_contig_10   1000    1500    171 500 500 1.0000000
OHJ07_1_contig_10   1500    2000    79  380 500 0.7600000
OHJ07_1_contig_10   2000    2500    62  500 500 1.0000000
OHJ07_1_contig_10   2500    3000    96  500 500 1.0000000
OHJ07_1_contig_10   3000    3500    76  500 500 1.0000000
OHJ07_1_contig_10   3500    4000    87  500 500 1.0000000
OHJ07_1_contig_10   4000    4500    60  500 500 1.0000000
OHJ07_1_contig_10   4500    5000    64  500 500 1.0000000
OHJ07_1_contig_10   5000    5468    213 468 468 1.0000000
OHJ07_1_contig_100  0   500 459 500 500 1.0000000
OHJ07_1_contig_100  500 1000    156 500 500 1.0000000
OHJ07_1_contig_100  1000    1314    77  305 314 0.9713376
OHJ07_1_contig_1000 0   500 239 500 500 1.0000000
OHJ07_1_contig_1000 500 1000    226 500 500 1.0000000
OHJ07_1_contig_1000 1000    1500    238 500 500 1.0000000
OHJ07_1_contig_1000 1500    2000    263 500 500 1.0000000

I would like to sort it based on contig length. I have another file of ordered contig names in the first column. The other file has other information, like contig length in column 2 (this file was produced with samtools faidx), eg:

OHJ07_1_contig_25270    888266  96530655    60  61
OHJ07_1_contig_36751    583964  120924448   60  61
OHJ07_1_contig_44057    504884  134192571   60  61
OHJ07_1_contig_21721    415942  87354744    60  61
OHJ07_1_contig_46339    411691  143341916   60  61
OHJ07_1_contig_44022    330441  133783765   60  61

Since each contig is a different length, and has a different number of rows in the first column, what's the best way to go about this?

genome sequence • 5.3k views

ADD COMMENT • link 7.3 years ago by jblommaert92 ▴ 70

1

Entering edit mode

Which one is the column with contigs lengths?

ADD REPLY • link 7.3 years ago by iraun 6.2k

0

Entering edit mode

I see you suggest bash, but to me this is the type of question that asks for a custom (python) script since it's a bit more complicated than simple text manipulations.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 1 · Answer 1 · 2017-01-12

--updated answer--

I got OP's original question.

The second file is a FASTA index file sorted by sequence length (2nd column). She wants to sort file 1 according to 2nd column in file 2. It could be easier to understand if she mentioned what the file 2 is.

It can be easily done by csvtk, a cross-platform, efficient, practical and pretty CSV/TSV toolkit in Go. csvtk provides executable binary files for Linux/OS X/Windows, just download, decompress and directly run.

Here's the idea:

Merging the FASTA index file to the first one using csvtk join (usage) (like shell join but does not need pre-sorting data)
Sorting according to the length column using shell sort or csvtk sort (usage)
Removing the new column using shell cut or csvtk cut (usage) (like shell cut but supports unselect)

For a example:

$ more data.tsv 
OHJ07_1_contig_10       1500    2000    79      380     500     0.7600000
OHJ07_1_contig_10       5000    5468    213     468     468     1.0000000
OHJ07_1_contig_1000     0       500     239     500     500     1.0000000
OHJ07_1_contig_1000     500     1000    226     500     500     1.0000000
OHJ07_1_contig_100      0       500     459     500     500     1.0000000
OHJ07_1_contig_100      500     1000    156     500     500     1.0000000

$ more seqs.fa.fai 
OHJ07_1_contig_1000 888266  96530655    60  61
OHJ07_1_contig_10   504884  134192571   60  61
OHJ07_1_contig_100  415942  87354744    60  61

Commands and result:

$ csvtk -H -t join -f "1;1" data.tsv seqs.fa.fai | sort -k8,8nr | csvtk -H -t cut -f 1-7
OHJ07_1_contig_1000     0       500     239     500     500     1.0000000
OHJ07_1_contig_1000     500     1000    226     500     500     1.0000000
OHJ07_1_contig_10       1500    2000    79      380     500     0.7600000
OHJ07_1_contig_10       5000    5468    213     468     468     1.0000000
OHJ07_1_contig_100      0       500     459     500     500     1.0000000
OHJ07_1_contig_100      500     1000    156     500     500     1.0000000

Long-option version:

$ csvtk --no-header-row --tabs join --fields "1;"  data.tsv seqs.fa.fai \
$   | sort -k8,8nr \
$   | csvtk --no-header-row --tabs cut --fields 1-7

score 1 · Answer 2 · 2017-01-12

1

Entering edit mode

7.3 years ago

Pierre Lindenbaum 161k

Use linux join:

in brief : sort both files on chromosome name, join on chromosome name, sort on chromosome length, remove the 'length' column.

join -t $'\t' -1 1 -2 1 <(cut -f1,2 fasta.fai | sort -t $'\t' -k1,1 ) <(sort -t $'\t' -k1,1 input.bed)  | sort -t $'\t'  -k2,2n -k3,3n | cut -f 1,3-

ADD COMMENT • link 7.3 years ago by Pierre Lindenbaum 161k

score 0 · Answer 3 · 2017-01-12

0

Entering edit mode

7.3 years ago

iraun 6.2k

Although I'm not quite sure that I've completely understood your question... Here is one way to solve your problem:

awk 'BEGIN {OFS="\t"} $3>max[$1] {max[$1]=$3} END {for (i in max){print i,max[i]}}' file1.txt | sort -nrk2,2 | while read c1 c2; do grep -w $c1 file1.txt;done

This command sorts your coverage file (file1.txt) attending to the contig length. As can be seen, I'm not using file2 for anything. The command calculates the length of the contigs attending to the third column of the last row of the coverage per each contain (which is the contig length). The contigs are sorted attending to their length and then using a while loop, the file1.txt is processed and printed out using the sorted list of contigs.

ADD COMMENT • link 7.3 years ago by iraun 6.2k

0

Entering edit mode

Why do I need the first step if the second step sorts file1 based on my list of contigs anyway?

ADD REPLY • link 7.3 years ago by jblommaert92 ▴ 70

0

Entering edit mode

If you run only the first part of the command:

awk 'BEGIN {OFS="\t"} $3>max[$1] {max[$1]=$3} END {for (i in max){print i,max[i]}}' file1.txt | sort -nrk2,2

you'll get the name of the contig and the length of the contig. But as I've understood, you want the complete file1 sorted by contig length. So you need to process again the file1. There are a lot of ways of doing this. I've shown you only one approach. I'd suggest you to run the both commands separately, and check the output to better understand how it works.

ADD REPLY • link 7.3 years ago by iraun 6.2k

0

Entering edit mode

Sorry, I forgot to mention that file2 has contig length in column 2, is there any reason why can't I use this? Is it because there is only one row per contig in file2 and file1 has multiple rows per contig?

ADD REPLY • link 7.3 years ago by jblommaert92 ▴ 70

1

Entering edit mode

I have not used file2 since I didn't know which column was length column :). In that case, you can skip the first part of my command. You can just do:

while read c1; do grep -w $c1 file1.txt;done<file2.txt

ADD REPLY • link 7.3 years ago by iraun 6.2k

0

Entering edit mode

the 2nd column in FASTA index file is sequence length.

ADD REPLY • link 7.3 years ago by shenwei356 8.4k

0

Entering edit mode

Sorry for the delayed response, I tried this method and it works well, thanks! (Biostars told me I was posting too much and I couldn't reply anymore)

ADD REPLY • link 7.3 years ago by jblommaert92 ▴ 70

0

Entering edit mode

Glad to help :). It takes a little bit of time and learning to get used to Biostars. However there are nice post regarding the correct use of the forum, how to ask, how to reply/comment and so on...So I'd recommend you to read them if you are planing to use the forum in the future ;)

ADD REPLY • link 7.3 years ago by iraun 6.2k

0

Entering edit mode

see my updated answer

ADD REPLY • link 7.3 years ago by shenwei356 8.4k

0

Entering edit mode

Yeah, but since I have contig name and length already in a file, I'm still not sure why the first processing of file1 is necessary?