Tuesday, December 05, 2006

UTF-8 Checklist

This is getting tedious, again I wasted a good 40 minutes looking for a solution as to why my perfectly fine utf-8 text which was being submitted through a get form was being garbled up by the server. So, as a matter of checklist, I'm going to sum up a few things that can be quickly run through to see why utf-8 might not be working. Mind you, this is going to be Java oriented, since that's where the source of my utf-8 woes lies, however, I don't have much doubt that getting a website properly using utf-8 is a pain in the ass in other languages as well. I'll gladly have anyone prove me wrong on that last assumption though.

What you should know first

The first thing you should know is that There Ain't No Such Thing As Plain Text. there, that's said, now go read the article and then come back.

Checklist

The following checklist is mainly a shorter, version of stuff I stole from a Sun article called Character Conversions from Browser to Database. You will learn more if you read that article, but my list is shorter.

  • First of all, make sure your container understands that the de facto encoding of any GET request it will receive will be in utf-8. In Tomcat, this is done by setting the property URIEncoding="UTF-8" inside your Connector definition. For other containers, there is always Google 'cause I'm lazy like that.

  • For some containers, setting the equivalent of the above configuration will make both GET and POST work properly, this is not the case however with Tomcat. To let Tomcat use utf-8 as the default encoding for POST requests, set the following in your application's web.xml

    <context-param>
    <param-name>PARAMETER_ENCODING</param-name>
    <param-value>UTF-8</param-value>
    </context-param>

    Okay, so now you have the encoding available as a parameter in every request, which means you can set it before reading any parameters like so:

    String paramEncoding = application.getInitParameter("PARAMETER_ENCODING");
    request.setCharacterEncoding(paramEncoding);

    If you are using a Servlet instead of a JSP, you can spread this code over the Servlet's init() and processRequest() methods:

    private String encoding;

    public void init() throws ServletException {
    ServletConfig config = getServletConfig();
    encoding = config.getInitParameter("PARAMETER_ENCODING");
    }

    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
    throws ServletException, IOException {
    if (encoding != null) {
    request.setCharacterEncoding(encoding);
    }
    ...
    }

    Alternatively, you might put this code in a filter.

  • When using JSP's, make sure you have your page directive set up correctly, and by correctly I mean:
    <%@page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%>
    If you are going to enter anything in your JSP page directive, the above should be it.

  • Write proper HTML. The <meta> tag in your header should define the content attribute:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    It's also generally a good idea to set the accept-charset attribute of any forms you define:

    <form name="form" action="action.do" method="post" accept-charset="UTF-8"></form>

    The accept-charset attribute will tell confused clients what charsets the server will accept.
Alright, I think that will do nicely as an initial checklist. If I come across any other pointers, I'll edit this post and add them.