Wednesday, March 31, 2010

Visualizing tweets over a 16 hour span, a Twitterswarm

Every second thursday of the month, the Media and Design Academy organizes its puntKom event. A few work colleagues have attended these events on and off and my employer, Nascom, has been known to provide the attendees with free snacks and beer.

During one puntKom, people challenged us to do an actual presentation for their next theme, which was to be "20 lines of code". The main idea was to come up with something cool / quirky which could be done in 20 lines of code and give a short presentation about it. After the usual discussion on how worthless loc is as a measurement and how most heavy lifting would always be done by frameworks, a few colleagues and I had a short brainstorming session to see if we could think of something interesting.

The plan

It quickly became clear that to do something visually interesting, we would have to find some data, and glue it into an interesting visualization scheme. So we had to come up with 3 components; glue, data and a visualization idea.

Glue, data and visualizations

As any linux user will tell you, the best way to glue stuff together on a computer is to use a quick and dirty bash script. Finding the glue was easy, and from the start, I had wanted to do something with code_swarm. The only thing left to do was to come up with a good dataset. After considering a few different sources, we ended up with the plan to pull down all the tweets we could that showed up for a few predefined search results and feed those to code_swarm.

Approach

After the brainstorming, I was given some time to hack together our little idea. I started out by breaking down the problem in the following subproblems:

  • Fetch a bunch of tweets from Twitter
  • Convert those tweets to a format usable by code_swarm
  • Run code_swarm to generate a bunch of png's
  • Use ffmpeg to generate a video out of those png's

The first problem wasn't too hard, a simple wget to the Twitter search api to fetch the .atom version worked just fine. Until it didn't. It turns out that Twitter auto throttles ip's that make a lot of requests to their search api, and while normal Twitter api usage is linked to a specific Twitter account, their search api usage is linked to an ip. A pretty harsh constraint when there's about 80 colleagues with various Twitter clients hitting that same api.

Luckily, the Twitter engineers were kind enough to return a HTTP 420 code and a Retry-After header whenever a request was throttled, which allowed me to just put the script to sleep for the specified time whenever a 420 occured. After this fix, fetching tweets was a breeze.

As for the second problem, converting an atom feed to a custom xml format is annoying enough without a 20 loc constraint, so I googled around a bit for some xslt tools. Pretty soon I discovered the excellent XMLStarlet tool which would allow me to generate and apply XSLT from the commandline, the perfect tool to glue into my script.

One thing that was annoying me was how to convert the dates of the tweets to the millisecond format required by code_swarm, so I ended up tweaking code_swarm's source to also accept the standard tweet date format.

From this point, it was fairly straight forward to come up with a semi-decent configuration file for code_swarm, feed my xml to it, and then let ffmpeg lose on the generated png's as per the code_swarm wiki instructions.

Results

The code ended up looking a bit like this:

#!/bin/bash
function twitter_to_swarm { xmlstarlet sel --text -N a=http://www.w3.org/2005/Atom -N twitter=http://api.twitter.com/ --template --match '//a:entry' -o '<event filename="' -v 'a:author/a:name' -o 'lang:' -v 'twitter:lang' -o '"' -o ' date="' -v 'a:updated' -o '" ' -o 'author="' -v '/a:feed/a:title' -o '"/>' -n $1 | sed -e 's/ since:.*- Twitter Search//' | sed -e 's/date="\([^T.]*\)T\([^Z.]*\)Z"/date="\1 \2"/' | sed -e 's/&/&/g' >> $2; }

read PAGE STATUS <<<$(echo 1 200)
for p in "$@"; do QUERY=$(perl -MURI::Escape -e 'foreach $argnum (0 .. $#ARGV) {print uri_escape($ARGV[$argnum])."+";}' $p)
    until [ $PAGE -gt 15 ]; do
        RESPONSE=$(wget --server-response -Oout.wget http://search.twitter.com/search.atom?q=$QUERY\&rpp=100\&page=$PAGE\&since=`date +%Y-%m-%d` 2>&1)
        STATUS=$(echo $RESPONSE | grep 'HTTP/' | sed -e 's/.*HTTP\/[^ ]* \([^ ]*\).*/\1/' | tail -1)
        case $STATUS in
            200) echo STATUS $STATUS for page $PAGE && twitter_to_swarm out.wget out.xml && let PAGE+=1;;
            420) SLEEP=$(echo $RESPONSE | grep 'Retry-After' | sed -e 's/.*Retry-After[^ ]* \([^ ]*\).*/\1/') && until [ $SLEEP -le 0 ]; do echo sleeping $SLEEP seconds & sleep 10 && let SLEEP-=10; done;;
            *) echo STATUS $STATUS for page $PAGE && let PAGE=17;;
        esac; done; let PAGE=1
done
echo '<file_events>' > result.xml && cat out.xml | sort -t '"' -k 3 -r >> result.xml && echo '</file_events>' >> result.xml
echo Generating images... && cd codeswarm/ && codeswarm ../mpeg.config && cd .. && echo Done.
echo Generating video... && ffmpeg -f image2 -r 24 -i images/twitter-%05d.png -sameq ./result.mov -pass 2 && echo Done. Generated result.mov

I ran the script with parameters to search for tweets containing, google buzz or buzz, haiti, snow and iPad. The resulting video ended up looking like this:


The terms that were searched for float through space, every colored dot that moves towards a search term is a tweet matching that term. It's a bit hard to see, but tweets are also colored depending on the language they're in, the following leged was used:
  • yellow for English
  • cyan for German
  • violet for French
  • red for Portuguese
  • blue for Dutch
  • purple for Spanish
  • brown for Japanese
  • pink for Italian
  • grey for Other

Rerun dammit

I wasn't completely satisfied with the first result, since it turned out to only cover a very short timespan. This was due to the fact that the Twitter api only returns a max of 1500 tweets, so I modified the script above a bit so that I could schedule it as a cronjob and run it every hour. I then let my cronjob run for about 16 hours and produced the video below with the results:

No comments: