Filter files based on number in the filenames
1
1
Entering edit mode
3.5 years ago
vanessagpds ▴ 10

Hi everyone,

I am new to bioinformatics and I have the following question to resolve:

In one of the projects, the genomic editor stored each contig in an individual stage file. The files have a name pattern “Contig_ [number] _cov_ [number] .fasta”.

The “cov” information refers to the coverage obtained for that contig. How would you go about summarizing this data? How would you go about obtaining only contigs with coverage greater than 500, and storing them in a multi-layer file?

regex bash • 698 views
ADD COMMENT
0
Entering edit mode

Are these contigs from SPAdes output? If so, just be aware that coverage in that case refers to kmer coverage, not 'sequencing depth' - this may also apply in other cases that I'm not aware of.

ADD REPLY
2
Entering edit mode
3.5 years ago
Fatima ▴ 1000

This might give you what you're looking for, but I'm not sure!

ls *.fasta | sed 's/.*_//g' | sed 's/.fasta//g' | awk '($1>500)' | while read line ; do cat *_${line}.fasta >> cov_above_500.fasta ; done

or if you want to be more specific

 ls Contig_*_cov_*.fasta | sed 's/.*_//g' | sed 's/.fasta//g' | awk '($1>500)' | while read line ; do cat Contig_*_cov_${line}.fasta >> cov_above_500.fasta ; done

So, basically it removes everything (.*) before the second number, because there's an underline before the coverage value, and not after it, you can use it at a marker, and delete the _ and anything before it, then delete anything after the number, which is ".fasta", then only select those values that are above 500, and loop over them, and concatenate the corresponding files. When mapping the numbers back to the file names, you need to make sure to place *_ before the number, and .fasta after the number (as in the commands) otherwise you would also pick 6005 and 5600 when searching for 600.
If you have two contigs with 600 as coverage, both would be selected which is fine I guess.

ADD COMMENT

Login before adding your answer.

Traffic: 2700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6