Question: Faster way to get sequences from large fasta files?
0
gravatar for emilywynn6
29 days ago by
emilywynn60
emilywynn60 wrote:

Hello,

I am working with Illumina paired end reads. I am trying to use Blast to find read pairs in which one pair aligns to the nucleus and the other pair aligns to the mitochondria. Any reference genome I could align the reads to would not have the gene transfers I'm looking for, so the unaligned fastq files are my best bet.

Basically what I've done so far is turn the fastq files into fasta files and blasted them to the mitochondrial genome, then I've taken the read names from the blast output and used sed to change the ".1's" to ".2's" so that I have the name of that read's pair.

So I have a text file with 100000 read names, and I need to get the sequences of these reads. Originally I used grep to find the names and print the next line (grep -f -A1) however, because my files are so large, this took too much RAM and I had to cancel the task.

I found that using: while read line; do grep -wF -A1 "$line" query.file; done < subject.file gives me the output I need and doesn't crash our server, but it is also EXTREMELY slow.

Any advice on how to more efficiently perform this task? Even using grep -m 1 will take over a day to finish each sample.

paired-ends grep fasta • 144 views
ADD COMMENTlink modified 29 days ago by mike-zx110 • written 29 days ago by emilywynn60
1

Wouldn't STAR-Fusion or Tophat-Fusion give you the insight you need?

With these tools, you'll detect chimeric reads and search for those with one read in the nucleus and the other in mt.

ADD REPLYlink written 29 days ago by michael.ante2.7k

I think you can use samtools faidx to index the FASTA, that'll make the access faster, although I'm not sure if any tools exist that can use the faidx to speed up their operation as well as use identifiers to get the sequence.

You could try samtools faidx file.fasta; samtools faidx file.fasta <identifier>

ADD REPLYlink modified 29 days ago • written 29 days ago by RamRS19k
3
gravatar for Pierre Lindenbaum
29 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:

linearize your fasa file, sort and use joiin.

Not tested but it should be something like

join  -t $'\t' -1 1 -2 1 \
   <(cat query.file | paste - - | cut -c 2- | sort -t $'\t' -k1,1) \
   <(sort subject.file) |\
tr "\t" "\n"
ADD COMMENTlink modified 29 days ago • written 29 days ago by Pierre Lindenbaum114k

I've tried this out on a small test file, and I think this strategy will work and be much faster. Thank you!!!

ADD REPLYlink written 29 days ago by emilywynn60

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink written 28 days ago by Pierre Lindenbaum114k
1
gravatar for mike-zx
29 days ago by
mike-zx110
Mexico/Service of Alimentary Quality SENASICA
mike-zx110 wrote:

if you only need all the sequences without headers you could:

cat fasta | paste - - | cut -f2 > output

if you need each sequence in a separate file with the fasta header you could:

for id in $(cat file_with_headers_list)
do
  filename=$(cat test | grep "$id" | cut -d\> -f2)
  cat fasta | paste - - | grep $id | tr '\t' '\n' > ${filename}_output
done

These should be faster I think, don't know if I understood what you wanted correctly though so I apologize in advance if I misunderstood.

ADD COMMENTlink modified 29 days ago • written 29 days ago by mike-zx110

obviously to make the "file_with_headers_list" you would:

cat fasta | grep \> > file_with_headers_list
ADD REPLYlink written 29 days ago by mike-zx110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 760 users visited in the last hour