Quantcast
Channel: User Kamil Maciorowski - Super User
Viewing all articles
Browse latest Browse all 645

Answer by Kamil Maciorowski for Grep search for text in an ISO-8859-1 encoded file

$
0
0

How can I prevent the grep output from stripping the accented characters?

grep itself does not strip accented characters, it outputs matching lines as they are in the input file. It's your terminal (terminal emulator) that doesn't interpret accented characters encoded as ISO-8859-1 as anything it should display as accented characters.

Your terminal most likely expects UTF-8. The rest of this answer assumes the terminal does expect UTF-8 and the locale is something.UTF-8 (e.g. pt_PT.UTF-8). It should be so in many modern Unix-like systems by default, certainly in Linux.

Possible solutions:

  • You may be able to configure your terminal emulator to ISO-8859-1, run the command and reconfigure back to UTF-8. (e.g. in konsole select from the menu: View, Set Encoding; and so on). I wouldn't call this the right way though.

  • Alternatively convert the output of grep to UTF-8 on the fly:

    LC_ALL=pt_PT.ISO-8859-1 grep -a ese\$ wordsList | iconv -f ISO-8859-1 -t UTF-8
  • If you plan to work with the file a lot, convert the content to UTF-8*:

    <wordsList iconv -f ISO-8859-1 -t UTF-8 >wordsList-utf8

    Then work with the new file without tricks, e.g.:

    grep ese\$ wordsList-utf8

    Now you can even grep for accented characters in a straightforward way, e.g.:

    grep ó wordsList-utf8

    In general Unicode equivalence may be a problem; but here, since the file is a conversion from ISO-8859-1, I expect consistency: every ó shall be U+00F3 (0xC3B3 in UTF-8, the above grep will find it), not U+006F followed by U+0301 (0x6FCC81 in UTF-8, the above grep would not find it); similarly for other accented characters.


* I notice you used grep -a, as if you needed grep to treat binary files like text. If your wordsList is truly non-text, converting the whole of it to UTF-8 may fail or give you mangled non-text parts. Since you did not link to a single specific file, I cannot investigate further without guessing. I guess you meant the file linked under "just the file", i.e. the file one can extract from wordsList.zip. With this particular file I do not need -a for grep, if only I tell grep to use the right encoding (this is what LC_ALL=pt_PT.ISO-8859-1 does).


Viewing all articles
Browse latest Browse all 645

Trending Articles