Tuesday, December 05, 2006

UTF-8 Checklist

This is getting tedious, again I wasted a good 40 minutes looking for a solution as to why my perfectly fine utf-8 text which was being submitted through a get form was being garbled up by the server. So, as a matter of checklist, I'm going to sum up a few things that can be quickly run through to see why utf-8 might not be working. Mind you, this is going to be Java oriented, since that's where the source of my utf-8 woes lies, however, I don't have much doubt that getting a website properly using utf-8 is a pain in the ass in other languages as well. I'll gladly have anyone prove me wrong on that last assumption though.

What you should know first

The first thing you should know is that There Ain't No Such Thing As Plain Text. there, that's said, now go read the article and then come back.

Checklist

The following checklist is mainly a shorter, version of stuff I stole from a Sun article called Character Conversions from Browser to Database. You will learn more if you read that article, but my list is shorter.

  • First of all, make sure your container understands that the de facto encoding of any GET request it will receive will be in utf-8. In Tomcat, this is done by setting the property URIEncoding="UTF-8" inside your Connector definition. For other containers, there is always Google 'cause I'm lazy like that.

  • For some containers, setting the equivalent of the above configuration will make both GET and POST work properly, this is not the case however with Tomcat. To let Tomcat use utf-8 as the default encoding for POST requests, set the following in your application's web.xml

    <context-param>
    <param-name>PARAMETER_ENCODING</param-name>
    <param-value>UTF-8</param-value>
    </context-param>

    Okay, so now you have the encoding available as a parameter in every request, which means you can set it before reading any parameters like so:

    String paramEncoding = application.getInitParameter("PARAMETER_ENCODING");
    request.setCharacterEncoding(paramEncoding);

    If you are using a Servlet instead of a JSP, you can spread this code over the Servlet's init() and processRequest() methods:

    private String encoding;

    public void init() throws ServletException {
    ServletConfig config = getServletConfig();
    encoding = config.getInitParameter("PARAMETER_ENCODING");
    }

    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
    throws ServletException, IOException {
    if (encoding != null) {
    request.setCharacterEncoding(encoding);
    }
    ...
    }

    Alternatively, you might put this code in a filter.

  • When using JSP's, make sure you have your page directive set up correctly, and by correctly I mean:
    <%@page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%>
    If you are going to enter anything in your JSP page directive, the above should be it.

  • Write proper HTML. The <meta> tag in your header should define the content attribute:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    It's also generally a good idea to set the accept-charset attribute of any forms you define:

    <form name="form" action="action.do" method="post" accept-charset="UTF-8"></form>

    The accept-charset attribute will tell confused clients what charsets the server will accept.
Alright, I think that will do nicely as an initial checklist. If I come across any other pointers, I'll edit this post and add them.

Friday, November 17, 2006

Slimy emacs

Slimy emacs
This blog entry is being typed in emacs21. I finally decided to clear the hurdle and start learning myself emacs and try to wrestle through Paul Graham's "On lisp". Starting out with emacs has been fun, I've been working through the included tutorial, and the shortcut system already feels powerfull. Writing / programming without having to move my hand from keyboard to mouse is a really enticing prospect to me, and I'm intent on getting through the 36% of TUTORIAL remaining. This also seems like a really good time to finally trade in my azerty keyboard for a real one, which will have the added bonus of no longer having to adjust when switching from typing at home to typing at work.

Anyway, emacs and lisp. So on to lisp, I first downloaded slime for emacs, and after including the following short lisp snippet in my .emacs file tried to launch it:


(add-to-list 'load-path "/home//slime-2.0")
(require 'slime)
(setq inferior-lisp-program "clisp")
(add-hook 'lisp-mode-hook (lambda () (slime-mode t)
(local-set-key "\r" 'newline-and-indent)
(setq lisp-indent-function 'common-lisp-indent-function)
(setq indent-tabs-mode nil)))


Obviously, no clisp could be found since I'd been dumb enough to not install it first. So on to the next step of installing lisp. Now, my first version of my .emacs file did not refer to clisp, rather it refered to lisp as I was following the short tutorial on unmutual.info and had chosen to use CMU Common Lisp. Right, so how to install CMU Common Lisp on my Ubuntu install, I visited the CMU Common Lisp website but then decided I'd rather use the synaptic packet manager. After a search for "lisp" presented me with a huge list of packages to install, I spent a good 10 minutes just skimming through them all untill I finally settled for the clisp package. A brief few seconds later, clisp and it's dependencies were installed and all I had to do was to edit my .emacs file to update my inferior-lisp-program from "lisp" to "clisp". Restart emacs, M-x slime, a quick compile and tada, slime ! Cool, now I just have 20% of emacs TUTORIAL and Paul Graham's "On Lisp" left to digest.

Sunday, October 29, 2006

An inconvenient truth

I finally went to see Al Gore's "An Inconvenient Truth" last night, and it's really one of those must-see movies because it's quite an eye-opener. There aren't many things that stress the urgency of turning around our climate crisis as much as seeing a piece of ice the size of Rhode Island break away from Antarctica over the course of days. The documentary does an excellent job in breaking down the scepticism that is surrounding global warming, it emphasizes that there is scientific consensus and that global warming isn't just some theory. The idea that there is doubt over the veracity of global warming is being spoonfed to us through the political and economic propaganda machine. The main idea to take away from the movie, is to talk about global warming, help dispelling the myth of it being a theory as opposed to fact, and change. Make small changes to how you live, and encourage others to do the same.
On the way back home I had a lengthy discussion with my dad and grandfather over how nothing would really change unless everyone participated and it was hard to explain them the irony of saying: "Well this won't work unless everyone changes, so I'm not changing". In the end though I managed to explain how the upward trend in global warming can not, in fact, be turned into a downward trend overnight, but that it would be a huge step forward if only we managed to slow it down, then let it stagnate, and then finally, let it diminish. I still think Al Gore said it better, and certainly more terse, when he said that some people go from denial to despair, without considering the steps in between. Despair will promote our inaction, and it doesn't seem like we have alot of time left for that. If you want to do something right now, the climatecrisis.com website has a nice pdf with 10 easy tips to pollute less.

Ironically and sadly, the only people that walked out during this movie were a bunch of teenagers.

Thursday, October 19, 2006

Reddit and subreddits

There's currently a discussion going on on reddit on the concept of subreddits. A subreddit is basically a clone of the reddit concept but it is limited to a certain topic, a good example is the programming subreddit. To be completely fair, the discussion isn't really about the concept of subreddits per se, but for the purpose of this blog entry, I'm going to pretend it is.

Anyway, the discussion seems to be between largely two groups of people. One group argues that subreddits are fragmenting the community, while the other group loves subreddits for the way they (apparently) succinctly categorize submissions by their topic. Now, it's obvious that both groups have valid arguments, fragmenting a community is obviously not a good thing, while being able to group submissions by topic obviously is. To a certain extent people could argue that it is possible to group submissions by using the search, but the way it is dependant on the submission's title pretty much makes that an invalid argument in the long run.

So what could be a solution ?
One possible solution that would allow people to both organize submissions by topic while at the same time keeping the community together on a single reddit would be a tagging system of some form. If people want to view mainly programming related submissions, they could hit the programming tag.

Right now, subreddits are a way of categorizing submissions that is broken because it fragments the community for one, and secondly, it fragments discussion of a submission by allowing people to submit the same link to both a sub and the main reddit. I do personally read the programming reddit frequently, but it is not visible at all to new people and so it does a good job in preventing people who might have only a latent interest in programming from reading the better programming related submissions. On the other hand, since there are no restrictions on what link you can submit to what reddit (how could there be), it also happens that good programming related submissions are only submitted to the main reddit. This basically forces people to read both the main reddit and the subreddit of their interest if they want to have the maximum amount of coverage for their favorite topic.

But, but, tags are broken !
Ok, so there are some problems with tagging content, the largest issue being the fact that people do not use the same tag for the same thing, so there would still be a good amount of fragmentation. However, taking into account the great asset that the reddit community is, there should be no reason not to leverage that asset. It would be perfectly plausible to implement this in a way that people can only select tags from a predetermined pool. With this approach, the community could file requests for new tags, and if such a request is popular enough, it could be added to the existing pool.

Tag based filtering and modifying
Once there is a consistent way to apply tags, users could start filtering on those tags, if there's a topic people don't like (a lot of people complain about the amount of submissions on politics) they could set their preferences to block submissions with that tag outright. Or a more subtle approach could be to allow users to configure their preferences so that submissions with a certain tag are only visible after they get a certain rating by the rest of the community. Or what about attributing a default modifier to particular tags; you're into programming but not into politics ? Well then why not apply a standard +5 modifier to submissions with the programming tags, and a -5 to those with a politics tag. How exactly to implement the weight of these modifiers is up for debate, but looking at the great job that the reddit creators did with their recommendation engine, I'm sure they could figure out a kickass way to factor in personal weights with community approval.

Wednesday, October 18, 2006

Stack traces as coordinates

I picked up on Scott Rosenberg's code reads a while ago by following a post on reddit, and have started cursively following it. I've skipped out on reading the first book, The Mythical Man Month but I do intend to read up on that at some point in the future. Anyway, the reading material for this week consists of no less then three of Edsger Dijkstra's articles / papers and I'm fully intent on keeping up with the several reads to come from now on.

So I'm reading through "Notes on structured programming" in which Dijkstra lays out the concept of a coordinate system to identify a discrete point in a computation process. Basically, this is what you want your stack trace to be, something blows up, and you want to know as much as possible about when and where exactly things went awry.

In the context of understanding programs he notes how a mathematical theory and a computer program are both structured, timeless objects, but while the mathematical theory makes sense on it's own, the program doesn't make sense until its execution. With this in mind, he describes a way to decompose programs in such a way that it becomes easier for humans to make sense out of them. For this purpose, he distinguishes three types of decomposition: concatenation, selection and repetition.

We have now seen three types of decomposition; we could call them "concatenation", "selection" and "repetition" respectively. The first two are understood by enumerative reasoning, the lost one by mathematical induction.
- Dijkstra, notes on structured programming.

Concatenation is used to decompose computations directly following each other, selection is a way of decomposing computations that are executed based on a condition and repetition is the decomposition of computations that are in a loop.
Now, with these 3 types of decomposition, it becomes easier to read and understand a program because once you decompose a number of computations, you can then mentally abstract it, and view it as a single step.

Once we have a properly decomposed program, we want to be able to make assertions about its computations, we want to know what value a certain variable has at a certain point in the program, during its execution. To be able to do this Dijkstra explains:

In short, we are looking for a co-ordinate system in terms of which the discrete points of computation progress can be identified,... We can state our problem in another way. Given a program in action and suppose that before completion of the computation the latter is stopped at one of the discrete points of progress. How can we identify the point of interruption, for instance if we want to redo the computation up to the very same point ?
- Dijkstra, notes on structured programming.

As far as the concatenation and selection decompositions are concerned, it is trivial to do this by using what Dijkstra calls the textual index of the program text, in other words, the line number. However, as we all know, due to the nature of loops, the same does not hold true for the repetition decomposition. Simply because a loop iterates over the same computations (with the same textual index) several times, the textual index is of no use to us for the purpose of indicating where we are in the computation progress. The solution ? Introduce a dynamic index, a variable independent from the computation that keeps track of where exactly in the repetition we are. OK so now we almost have a coordinate system as described above, there's just one more concept that isn't covered yet,... functions. When a language allows for functions, we need a way to represent the exact point of progress where our function is called. A textual index won't suffice, since it is lacking context, however, we can pass along the textual index of where our function was called.

Mix all the above together, and you get a perfect stack trace which would give you:
  • the line number of the call producing the error
  • if the error-producing call is part of a function, the line number of the call invoking this function
  • if the error-producing call is nested in any loops, the dynamic index indicating the specific iteration of those loops
Hmm, now I wonder why the loop index is never provided in any stack traces I've seen up to now. Is it a design decision to reduce overhead from tracking dynamic index variables, or am I missing something obvious?
Regardless, that's an interesting bit of history right there, and I'm only on page 29 :)