Hunting invisible characters?
1
0
Entering edit mode
22 days ago
geneticatt ▴ 120

Hi all,

I have a set of adapters which were given to me by a collaborator in a regular text file (i5R.txt). I moved these sequences onto my institution's linux HPC and attempted to use the files to pull sequences from a fastq using grep -f like so:

grep -f i5R.txt myseqs.fastq


This returned nothing, which was surprising because I know that the adaptors are there because I can match them in vim. Suspecting some pesky invisible characters, I typed out the characters in vim into a new text file called i5R.seqs. This fixed the pattern matching issue with grep.

Here is the diff of the two files, to show that they appear identical.

[geneticatt]\$ diff i5R.txt i5R.seqs
1,8c1,8
< CCTGATAC
< TTAAGTTG
< CGGACAGT
< GCACTACA
< TGGTGCCT
< TCCACGGC
< ATGTCGTG
< CCACGACA
---
> CCTGATAC
> TTAAGTTG
> CGGACAGT
> CGACTACA
> TGGTGCCT
> TCCACGGC
> ATGTCGTG
> CCACGACA


What type of character could be the culprit? I searched for \r because I've had problems with that one before, but this is another invisible character. How does one go about hunting down and removing the invisible characters that plague their workflow? Further, what preventative measures can I take to make sure I don't get hung up on something like this again?

1
Entering edit mode

You could have looked at the file using cat -vet which would have shown all characters in the file. Printable and non.

1
Entering edit mode

Another way to see hidden characters is to pipe them through octal dump: cat infile | od -c this will print out hidden characters, newlines, etc.

2
Entering edit mode
22 days ago
Mensur Dlakic ★ 11k

You may want to read this. I think you may be able to fix your adaper file by typing:

dos2unix i5R.txt


If an error pops up saying that a command doesn't exist, this should work:

sed -i 's/\r//' i5R.txt

0
Entering edit mode

Thank you, using dos2unix worked perfectly!