Question

error in extracting one chromosome from multiple fasta files to a single fasta file

0

Entering edit mode

2.9 years ago

evafinegan • 0

Hello All,

I have multiple fasta files. I want to make output files for each chromosome from all the fasta files. For example, output file all_ch1.fasta will have ch1 sequences from all the fasta files and so on. I tried:

samtools faidx *fasta.gz ch1 > all_ch1.fasta

But I am getting this error:

[W::fai_get_val] Reference sample2.fasta.gz not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in sample2.fasta.gz

I checked sample2.fasta.gz file but it is not empty. Thank you for any help!

fasta • 2.7k views

ADD COMMENT • link updated 2.9 years ago by cpad0112 21k • written 2.9 years ago by evafinegan • 0

0

Entering edit mode

if this format is correct, ch1 is sample1_rhg1.0ch1 in sample1.fasta.gz and ch1 is sample2_rhg1.0ch1 in sample2.fasta.gz, try:

with seqkit (dry-run)

$ seqkit seq *.gz | seqkit split -di --id-regexp ".*0(.*)"

New files would be in stdin.split directory. File names would be stdin.id_ch1.fasta, stdin.id_ch2.fasta for each ch and each fasta sequence name will be exactly as it is in each ch fasta. For eg. >sample1_rhg1.0ch1 and >sample2_rhg1.0ch1 for ch1. Download seqkit from https://bioinf.shenwei.me/seqkit/download/. Remove d from -di once you are okay with dry-run output.

without seqkit (assuming that sequences are flattened and all files have equal number ch entities)

$ zgrep "^>" sample1.fasta.gz | awk -F "\.[0-9]+" '{print $2}' | sort -Vu | while read line; do zgrep -A 1 -h $line *.gz > $line.fasta ; done

Files would be named ch1.fasta, ch2.fasta etc.

ADD REPLY • link 2.9 years ago by cpad0112 21k

score 0 · Answer 1 · 2021-05-12

0

Entering edit mode

2.9 years ago

Pierre Lindenbaum 161k

your running a command line like

samtools faidx *fasta.gz ch1 > all_ch1.fasta

so if you have multiple fasta.gz file in the directory (like sample1.fasta.gz sample2.fasta.gz sample3.fasta.gz ), the command above exapnds to

samtools faidx sample1.fasta.gz sample2.fasta.gz sample3.fasta.gz  ch1 > all_ch1.fasta

and so samtools search for sample2.fasta.gz in sample1.fasta.gz

so , may be , you want a loop like

for F in  *fasta.gz ; do samtools faidx $F ch1 ; done > all_ch1.fasta

ADD COMMENT • link 2.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you, Pierre! I tried and it gives an error:

[W::fai_get_val] Reference ch1 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in ch1
[W::fai_get_val] Reference ch1 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in ch1
[W::fai_get_val] Reference ch1 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in ch1

The chromosome names in each of the fasta files are like: sample1_rhg1.0ch1 sample2_rhg1.0ch1 sample3_rhg1.0ch1

How can I mention that in the loop so that it recognizes ch1 in each of the fasta file? Thank you!

ADD REPLY • link 2.9 years ago by evafinegan • 0

0

Entering edit mode

there is no ch1 in your fasta file.

check this with

grep  -F -w ch1 *.fai

just look at the fai files to find the correct name.

ADD REPLY • link 2.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes, I checked that the

ch1 is sample1_rhg1.0ch1 in sample1.fasta.gz and
ch1 is sample2_rhg1.0ch1 in sample2.fasta.gz

How can I specify to read ch1 with different prefixes for each of the sample fasta file using the loop. Thank you!

ADD REPLY • link 2.9 years ago by evafinegan • 0

0

Entering edit mode

for S in 1 2 ; do samtools faidx "sample${S}.fasta.gz" "sample${S}_rhg1.0ch1" ; done > all_ch1.fasta

ADD REPLY • link 2.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you! All the fasta files do not have the same prefix as "sample". So, based on your suggestion, I tried

for S in *fasta.gz ; do samtools faidx "${S}" "${S}ch1" ; done > all_ch1.fasta

But I got this error that it looks for sample1_rhg1.0.fasta.gzch01 instead of sample1_rhg1.0ch01

ADD REPLY • link 2.9 years ago by evafinegan • 0

1

Entering edit mode

why would it look for sample1_rhg1.0.fasta.gzch01 when your file name is sample1.fasta.gz (looking at "${S}ch1" part of the code) ? From your code, it should be looking for sample1.fasta.gzch1, not sample1_rhg1.0.fasta.gzch01. Try following code: (from Pierre and mine)

$ for S in 1 2 3; do echo "sample${S}.fasta.gz" "sample${S}_rhg1.0ch1" ; done 

sample1.fasta.gz sample1_rhg1.0ch1
sample2.fasta.gz sample2_rhg1.0ch1
sample3.fasta.gz sample3_rhg1.0ch1

$ for S in *fasta.gz ; do  echo $S ${S/.fasta.gz/}_rhg1.0ch1 ; done  

sample1.fasta.gz sample1_rhg1.0ch1
sample2.fasta.gz sample2_rhg1.0ch1
sample3.fasta.gz sample3_rhg1.0ch1

ADD REPLY • link 2.9 years ago by cpad0112 21k