I have the GR38 part of micro RNA sequences for humans. I want to convert the U in the sequences to Ts so that I can match them with my own FASTQ files .
The first few lines of the miRNA FASTA looks like this
>hsa-let-7a-3p MIMAT0004481 Homo sapiens let-7a-3p
CUAUACAAUCUACUGUCUUUC
>hsa-let-7a-2-3p MIMAT0010195 Homo sapiens let-7a-2-3p
CUGUACAGCCUCCUAGCUUUCC
>hsa-let-7b-5p MIMAT0000063 Homo sapiens let-7b-5p
UGAGGUAGUAGGUUGUGUGGUU
>hsa-let-7b-3p MIMAT0004482 Homo sapiens let-7b-3p
The first few lines of the FASTQ file I want to align looks like this
@SRR8248790.1401 HWI-D00306:1090:HKVGMBCX2:1:1101:6697:2269/1
CGCGACCTAGATCGGAAGAGCACACGTCT
+
DDDDDIIIIIIHIIIIIIIIIIIIIIIII
@SRR8248790.1402 HWI-D00306:1090:HKVGMBCX2:1:1101:6630:2272/1
CTCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAGGGTTTGAAGATCGGAAGAGCACACGTCTGAACTCCAGT
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIGHGHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR8248790.1403 HWI-D00306:1090:HKVGMBCX2:1:1101:6516:2280/1
TGAGGTAGTAGGTTGTGTGGTTTAGATCGGAAGAGCACACGTCT
As you can see the header line of each FASTA sequence is different. I want to convert each of the sequences in the FASTA so that the header is not affected and I have my required conversion.
I have tried to use both awk and sed commands to do such conversion without much success.
The sed script I used is
sed '/^[^>]/s/u/t/g' Homo_sapiens.GRCh38.miRNA.fasta >newfile.fasta
The awk script to do the same is
awk '/^[^>]/{ gsub(/u/,"t") }1' Homo_sapiens.GRCh38.miRNA.fasta > newfile.fasta
Any help will be useful.
From a first glance, both your awk and sed are misformed. Once corrected, they should work.
I think the pattern matching part in sed has an extra
/
and the sed would also benefit from the--extended-regexp
flag.You may want to add a
$0
to the gsub on awk and see if that works.awk and sed are both case-sensitive;
u
is notU
.