Edge of the desert

Sunday, June 20, 2010

Death and taxes

It's that time of the year again to start filing taxes, and here in Belgium we are "blessed" to be able to use our Belgian eid to file those taxes electronically at our governments Tax on Web site.

Linux support

Officialy, there are some drivers available for Redhat and Ubuntu 9.10. However, remembering last years debacle of trying to follow those instructions, I decided to not stumble into that tarpit and just did a bit of googling around. After pulling together some various bits and pieces, this is what I did to make it work:
  1. $ sudo apt-get install libacr38u libacr38ucontrol0 beid-tools pcscd libpcsclite-dev beidgui libbeid2 libbeidlibopensc2
      
  2. Reboot. Yes, I know this is Linux and I know it is 2010. The problem is the pcscd service won't start properly without a reboot. If you have a better clue then I do, you can probably fix this without a reboot, but there it is.
  3. Launch firefox and navigate to file:///usr/share/beid/beid-pkcs11-register.html Depending on your version of Firefox that may or may not properly install the pkcs11 module. For me it did not so I moved on to the next step.
  4. In firefox, navigate to Edit > Preferences > Advanced > Security Devices Click on Load and fill out file:///usr/lib/libbeidpkcs11.so.2 as the Module filename. Then click OK.
After these steps I was able to log in to the Tax on Web site. Hope this is helpful to somebody else out there.

Wednesday, March 31, 2010

Visualizing tweets over a 16 hour span, a Twitterswarm

Every second thursday of the month, the Media and Design Academy organizes its puntKom event. A few work colleagues have attended these events on and off and my employer, Nascom, has been known to provide the attendees with free snacks and beer.

During one puntKom, people challenged us to do an actual presentation for their next theme, which was to be "20 lines of code". The main idea was to come up with something cool / quirky which could be done in 20 lines of code and give a short presentation about it. After the usual discussion on how worthless loc is as a measurement and how most heavy lifting would always be done by frameworks, a few colleagues and I had a short brainstorming session to see if we could think of something interesting.

The plan

It quickly became clear that to do something visually interesting, we would have to find some data, and glue it into an interesting visualization scheme. So we had to come up with 3 components; glue, data and a visualization idea.

Glue, data and visualizations

As any linux user will tell you, the best way to glue stuff together on a computer is to use a quick and dirty bash script. Finding the glue was easy, and from the start, I had wanted to do something with code_swarm. The only thing left to do was to come up with a good dataset. After considering a few different sources, we ended up with the plan to pull down all the tweets we could that showed up for a few predefined search results and feed those to code_swarm.

Approach

After the brainstorming, I was given some time to hack together our little idea. I started out by breaking down the problem in the following subproblems:

  • Fetch a bunch of tweets from Twitter
  • Convert those tweets to a format usable by code_swarm
  • Run code_swarm to generate a bunch of png's
  • Use ffmpeg to generate a video out of those png's

The first problem wasn't too hard, a simple wget to the Twitter search api to fetch the .atom version worked just fine. Until it didn't. It turns out that Twitter auto throttles ip's that make a lot of requests to their search api, and while normal Twitter api usage is linked to a specific Twitter account, their search api usage is linked to an ip. A pretty harsh constraint when there's about 80 colleagues with various Twitter clients hitting that same api.

Luckily, the Twitter engineers were kind enough to return a HTTP 420 code and a Retry-After header whenever a request was throttled, which allowed me to just put the script to sleep for the specified time whenever a 420 occured. After this fix, fetching tweets was a breeze.

As for the second problem, converting an atom feed to a custom xml format is annoying enough without a 20 loc constraint, so I googled around a bit for some xslt tools. Pretty soon I discovered the excellent XMLStarlet tool which would allow me to generate and apply XSLT from the commandline, the perfect tool to glue into my script.

One thing that was annoying me was how to convert the dates of the tweets to the millisecond format required by code_swarm, so I ended up tweaking code_swarm's source to also accept the standard tweet date format.

From this point, it was fairly straight forward to come up with a semi-decent configuration file for code_swarm, feed my xml to it, and then let ffmpeg lose on the generated png's as per the code_swarm wiki instructions.

Results

The code ended up looking a bit like this:

#!/bin/bash
function twitter_to_swarm { xmlstarlet sel --text -N a=http://www.w3.org/2005/Atom -N twitter=http://api.twitter.com/ --template --match '//a:entry' -o '<event filename="' -v 'a:author/a:name' -o 'lang:' -v 'twitter:lang' -o '"' -o ' date="' -v 'a:updated' -o '" ' -o 'author="' -v '/a:feed/a:title' -o '"/>' -n $1 | sed -e 's/ since:.*- Twitter Search//' | sed -e 's/date="\([^T.]*\)T\([^Z.]*\)Z"/date="\1 \2"/' | sed -e 's/&/&/g' >> $2; }

read PAGE STATUS <<<$(echo 1 200)
for p in "$@"; do QUERY=$(perl -MURI::Escape -e 'foreach $argnum (0 .. $#ARGV) {print uri_escape($ARGV[$argnum])."+";}' $p)
    until [ $PAGE -gt 15 ]; do
        RESPONSE=$(wget --server-response -Oout.wget http://search.twitter.com/search.atom?q=$QUERY\&rpp=100\&page=$PAGE\&since=`date +%Y-%m-%d` 2>&1)
        STATUS=$(echo $RESPONSE | grep 'HTTP/' | sed -e 's/.*HTTP\/[^ ]* \([^ ]*\).*/\1/' | tail -1)
        case $STATUS in
            200) echo STATUS $STATUS for page $PAGE && twitter_to_swarm out.wget out.xml && let PAGE+=1;;
            420) SLEEP=$(echo $RESPONSE | grep 'Retry-After' | sed -e 's/.*Retry-After[^ ]* \([^ ]*\).*/\1/') && until [ $SLEEP -le 0 ]; do echo sleeping $SLEEP seconds & sleep 10 && let SLEEP-=10; done;;
            *) echo STATUS $STATUS for page $PAGE && let PAGE=17;;
        esac; done; let PAGE=1
done
echo '<file_events>' > result.xml && cat out.xml | sort -t '"' -k 3 -r >> result.xml && echo '</file_events>' >> result.xml
echo Generating images... && cd codeswarm/ && codeswarm ../mpeg.config && cd .. && echo Done.
echo Generating video... && ffmpeg -f image2 -r 24 -i images/twitter-%05d.png -sameq ./result.mov -pass 2 && echo Done. Generated result.mov

I ran the script with parameters to search for tweets containing, google buzz or buzz, haiti, snow and iPad. The resulting video ended up looking like this:


The terms that were searched for float through space, every colored dot that moves towards a search term is a tweet matching that term. It's a bit hard to see, but tweets are also colored depending on the language they're in, the following leged was used:
  • yellow for English
  • cyan for German
  • violet for French
  • red for Portuguese
  • blue for Dutch
  • purple for Spanish
  • brown for Japanese
  • pink for Italian
  • grey for Other

Rerun dammit

I wasn't completely satisfied with the first result, since it turned out to only cover a very short timespan. This was due to the fact that the Twitter api only returns a max of 1500 tweets, so I modified the script above a bit so that I could schedule it as a cronjob and run it every hour. I then let my cronjob run for about 16 hours and produced the video below with the results:

Monday, May 07, 2007

Social tips and a semantic wikipedia

Two barcamp talks in particular interested me enough that I just had to find out more about them.

John Baeyens on Not so-soFirst of these was John Baeyens' talk on his soon-to-go-live social tips website "not so-so", of which you can find the current test version right here.
Basically, not so-so is kind of a mix between blogging and delicious in that you write up a short review, or tip on something (anything!) you like, which you then automatically share with the rest of the community. You can stay up to date with the tips that other people submit by following those people ala the delicious network feature. Extra features that are currently in the works are a calendar system and a rating system. The rating system in particular seems very interesting to me, as it could be tied to a recommendation system to put you in touch with people that share your interests, or just interesting tips from people not currently in your network. John's reply to the question "I want to see all events today within a radius of 20km from where I live" was along the lines of: "We have the data, we just need to implement it.".






Will Moffat on Freebase.comThe other cool talk was given by Will Moffat. It was a short introduction to Freebase which is basically a queryable service containing a structured version of the Wikipedia data. The database is queryable through JavaScript and returns a list of results using JSON, allowing for some interesting mash-up ideas. Better yet it's also possible to update the database using JavaScript. Currently freebase is still in alpha phase and registration is on an invite only basis. There are still some open questions though, mainly, how are the Freebase people going to deal with vandalism, as currently doesn't appear to be moderated. Also, will this project be able to gain enough interest as to gain a large enough userbase to actually fill out all the currently missing data.



Pictures by Bert Heymans, for more pictures, check out his flickr stream.

Friday, May 04, 2007

Barcamp Brussels #3

Tomorrow I will be attending the third Barcamp Brussels event which will mark the end of my barcamp virginity. Looking over the barcamp wiki, it looks like there will be plenty of interesting topics to mull over. I'm particularly interested in Frank Louwers' talk on OpenID and the talk that Jan de Poorter will be having on Rails.
As for myself, I had been planning on doing something around Collaborative Filtering and Slope One, but a busy schedule has kept me from putting together something even remotely cohesive. I'm thinking of taking some pictures and if anyone is interested in the slope one stuff, I can always annoy anyone who asks :)

Thursday, April 26, 2007

Tomcat UTF-8 CharsetMapper

Just a quickie I thought I'd share, if you're running into encoding problems using Tomcat and JSTL's fmt:xxx tags, you might want to implement your own Tomcat CharsetMapper. The following example will fix things for UTF-8 encoding:


public class UTF8CharsetMapper extends CharsetMapper {
public UTF8CharsetMapper() {
super();
}

public UTF8CharsetMapper(String name) {
super(name);
}

public String getCharset(Locale locale) {
return "UTF-8";
}
}


You should then edit your Tomcat's server.xml file by mapping your Context's charsetMapperClass attribute to your custom made CharsetMapper:


<Context charsetMapperClass="your.package.here.UTF8CharsetMapper" otherParam="xxx">
...
</Context>

Tuesday, December 05, 2006

UTF-8 Checklist

This is getting tedious, again I wasted a good 40 minutes looking for a solution as to why my perfectly fine utf-8 text which was being submitted through a get form was being garbled up by the server. So, as a matter of checklist, I'm going to sum up a few things that can be quickly run through to see why utf-8 might not be working. Mind you, this is going to be Java oriented, since that's where the source of my utf-8 woes lies, however, I don't have much doubt that getting a website properly using utf-8 is a pain in the ass in other languages as well. I'll gladly have anyone prove me wrong on that last assumption though.

What you should know first

The first thing you should know is that There Ain't No Such Thing As Plain Text. there, that's said, now go read the article and then come back.

Checklist

The following checklist is mainly a shorter, version of stuff I stole from a Sun article called Character Conversions from Browser to Database. You will learn more if you read that article, but my list is shorter.

  • First of all, make sure your container understands that the de facto encoding of any GET request it will receive will be in utf-8. In Tomcat, this is done by setting the property URIEncoding="UTF-8" inside your Connector definition. For other containers, there is always Google 'cause I'm lazy like that.

  • For some containers, setting the equivalent of the above configuration will make both GET and POST work properly, this is not the case however with Tomcat. To let Tomcat use utf-8 as the default encoding for POST requests, set the following in your application's web.xml

    <context-param>
    <param-name>PARAMETER_ENCODING</param-name>
    <param-value>UTF-8</param-value>
    </context-param>

    Okay, so now you have the encoding available as a parameter in every request, which means you can set it before reading any parameters like so:

    String paramEncoding = application.getInitParameter("PARAMETER_ENCODING");
    request.setCharacterEncoding(paramEncoding);

    If you are using a Servlet instead of a JSP, you can spread this code over the Servlet's init() and processRequest() methods:

    private String encoding;

    public void init() throws ServletException {
    ServletConfig config = getServletConfig();
    encoding = config.getInitParameter("PARAMETER_ENCODING");
    }

    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
    throws ServletException, IOException {
    if (encoding != null) {
    request.setCharacterEncoding(encoding);
    }
    ...
    }

    Alternatively, you might put this code in a filter.

  • When using JSP's, make sure you have your page directive set up correctly, and by correctly I mean:
    <%@page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%>
    If you are going to enter anything in your JSP page directive, the above should be it.

  • Write proper HTML. The <meta> tag in your header should define the content attribute:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    It's also generally a good idea to set the accept-charset attribute of any forms you define:

    <form name="form" action="action.do" method="post" accept-charset="UTF-8"></form>

    The accept-charset attribute will tell confused clients what charsets the server will accept.
Alright, I think that will do nicely as an initial checklist. If I come across any other pointers, I'll edit this post and add them.