Filter files based on number in the filenames
1
1
Entering edit mode
13 months ago
vanessagpds ▴ 10

Hi everyone,

I am new to bioinformatics and I have the following question to resolve:

In one of the projects, the genomic editor stored each contig in an individual stage file. The files have a name pattern “Contig_ [number] _cov_ [number] .fasta”.

The “cov” information refers to the coverage obtained for that contig. How would you go about summarizing this data? How would you go about obtaining only contigs with coverage greater than 500, and storing them in a multi-layer file?

regex bash • 298 views
0
Entering edit mode

Are these contigs from SPAdes output? If so, just be aware that coverage in that case refers to kmer coverage, not 'sequencing depth' - this may also apply in other cases that I'm not aware of.

2
Entering edit mode
13 months ago
Fatima ▴ 960

This might give you what you're looking for, but I'm not sure!

ls *.fasta | sed 's/.*_//g' | sed 's/.fasta//g' | awk '($1>500)' | while read line ; do cat *_${line}.fasta >> cov_above_500.fasta ; done


or if you want to be more specific

 ls Contig_*_cov_*.fasta | sed 's/.*_//g' | sed 's/.fasta//g' | awk '($1>500)' | while read line ; do cat Contig_*_cov_${line}.fasta >> cov_above_500.fasta ; done


So, basically it removes everything (.*) before the second number, because there's an underline before the coverage value, and not after it, you can use it at a marker, and delete the _ and anything before it, then delete anything after the number, which is ".fasta", then only select those values that are above 500, and loop over them, and concatenate the corresponding files. When mapping the numbers back to the file names, you need to make sure to place *_ before the number, and .fasta after the number (as in the commands) otherwise you would also pick 6005 and 5600 when searching for 600.
If you have two contigs with 600 as coverage, both would be selected which is fine I guess.