How can I prevent the
grep
output from stripping the accented characters?
grep
itself does not strip accented characters, it outputs matching lines as they are in the input file. It's your terminal (terminal emulator) that doesn't interpret accented characters encoded as ISO-8859-1 as anything it should display as accented characters.
Your terminal most likely expects UTF-8. The rest of this answer assumes the terminal does expect UTF-8 and the locale is something.UTF-8
(e.g. pt_PT.UTF-8
). It should be so in many modern Unix-like systems by default, certainly in Linux.
Possible solutions:
You may be able to configure your terminal emulator to ISO-8859-1, run the command and reconfigure back to UTF-8. (e.g. in
konsole
select from the menu:View
,Set Encoding
; and so on). I wouldn't call this the right way though.Alternatively convert the output of
grep
to UTF-8 on the fly:LC_ALL=pt_PT.ISO-8859-1 grep -a ese\$ wordsList | iconv -f ISO-8859-1 -t UTF-8
If you plan to work with the file a lot, convert the content to UTF-8*:
<wordsList iconv -f ISO-8859-1 -t UTF-8 >wordsList-utf8
Then work with the new file without tricks, e.g.:
grep ese\$ wordsList-utf8
Now you can even grep for accented characters in a straightforward way, e.g.:
grep ó wordsList-utf8
In general Unicode equivalence may be a problem; but here, since the file is a conversion from ISO-8859-1, I expect consistency: every
ó
shall be U+00F3 (0xC3B3
in UTF-8, the abovegrep
will find it), not U+006F followed by U+0301 (0x6FCC81
in UTF-8, the abovegrep
would not find it); similarly for other accented characters.
* I notice you used grep -a
, as if you needed grep
to treat binary files like text. If your wordsList
is truly non-text, converting the whole of it to UTF-8 may fail or give you mangled non-text parts. Since you did not link to a single specific file, I cannot investigate further without guessing. I guess you meant the file linked under "just the file", i.e. the file one can extract from wordsList.zip. With this particular file I do not need -a
for grep
, if only I tell grep
to use the right encoding (this is what LC_ALL=pt_PT.ISO-8859-1
does).