Question: Faster way to get sequences from large fasta files?
0
gravatar for emilywynn6
5 months ago by
emilywynn610
emilywynn610 wrote:

Hello,

I am working with Illumina paired end reads. I am trying to use Blast to find read pairs in which one pair aligns to the nucleus and the other pair aligns to the mitochondria. Any reference genome I could align the reads to would not have the gene transfers I'm looking for, so the unaligned fastq files are my best bet.

Basically what I've done so far is turn the fastq files into fasta files and blasted them to the mitochondrial genome, then I've taken the read names from the blast output and used sed to change the ".1's" to ".2's" so that I have the name of that read's pair.

So I have a text file with 100000 read names, and I need to get the sequences of these reads. Originally I used grep to find the names and print the next line (grep -f -A1) however, because my files are so large, this took too much RAM and I had to cancel the task.

I found that using: while read line; do grep -wF -A1 "$line" query.file; done < subject.file gives me the output I need and doesn't crash our server, but it is also EXTREMELY slow.

Any advice on how to more efficiently perform this task? Even using grep -m 1 will take over a day to finish each sample.

paired-ends grep fasta • 235 views
ADD COMMENTlink modified 5 months ago by mike-zx130 • written 5 months ago by emilywynn610
1

Wouldn't STAR-Fusion or Tophat-Fusion give you the insight you need?

With these tools, you'll detect chimeric reads and search for those with one read in the nucleus and the other in mt.

ADD REPLYlink written 5 months ago by michael.ante3.2k

I think you can use samtools faidx to index the FASTA, that'll make the access faster, although I'm not sure if any tools exist that can use the faidx to speed up their operation as well as use identifiers to get the sequence.

You could try samtools faidx file.fasta; samtools faidx file.fasta <identifier>

ADD REPLYlink modified 5 months ago • written 5 months ago by RamRS20k
3
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

linearize your fasa file, sort and use joiin.

Not tested but it should be something like

join  -t $'\t' -1 1 -2 1 \
   <(cat query.file | paste - - | cut -c 2- | sort -t $'\t' -k1,1) \
   <(sort subject.file) |\
tr "\t" "\n"
ADD COMMENTlink modified 5 months ago • written 5 months ago by Pierre Lindenbaum118k

I've tried this out on a small test file, and I think this strategy will work and be much faster. Thank you!!!

ADD REPLYlink written 5 months ago by emilywynn610

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink written 5 months ago by Pierre Lindenbaum118k
1
gravatar for mike-zx
5 months ago by
mike-zx130
Mexico/Service of Alimentary Quality SENASICA
mike-zx130 wrote:

if you only need all the sequences without headers you could:

cat fasta | paste - - | cut -f2 > output

if you need each sequence in a separate file with the fasta header you could:

for id in $(cat file_with_headers_list)
do
  filename=$(cat test | grep "$id" | cut -d\> -f2)
  cat fasta | paste - - | grep $id | tr '\t' '\n' > ${filename}_output
done

These should be faster I think, don't know if I understood what you wanted correctly though so I apologize in advance if I misunderstood.

ADD COMMENTlink modified 5 months ago • written 5 months ago by mike-zx130

obviously to make the "file_with_headers_list" you would:

cat fasta | grep \> > file_with_headers_list
ADD REPLYlink written 5 months ago by mike-zx130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1738 users visited in the last hour