the Olympics and the French (political) blogosphere: first crawls

Following on from my previous post, the diagrams below are the results of my first topic-oriented experiments with IssueCrawler. As mentioned before, I set up nine crawls in total, three each looking for French blog posts concerning Tibet, censorship, and human rights. These crawls were queued at different times during the Olympic Games in Beijing – once during the first week of competition, once during the second week, and a few days after the closing ceremony.

After a flurry of activity on the IssueCrawler servers over the weekend, only one crawl of the nine remains to be completed. However, the delay in crawls does ask the question of whether the networks depicted would be different if the crawls happened on the same day, or within a few days, of being enqueued – of the first three crawls, the Tibet crawl (first to be enqueued) was started and completed on 19 August, censorship 19-20 August, and human rights 27-28 August. While the first two were complete before the second crawls were prepared, the third round of crawls had been enqueued before the first human rights crawl had even started. As such, there may be a risk that material not published, or written, on 14 August may influence the network of that crawl because of the delay, and the other human rights networks may end up looking rather similar. Of the latter concern, the third crawl is still to be completed so I’m not sure how close the fit will be, but the list of seed sites for each crawl was different. As this is still experimental, for these crawls I did not carry over the seed sites from one crawl to another or have a master list of sites. Instead, before each crawl I went through the top 100 results from Google Blog Search (French) using the search terms in the diagrams below, and manually included or rejected the posts based on their content and whether or not they were blog posts. I will attempt another topical experiment later this month, in which I will test out the including previous sites in the seed list, for comparison between methods.

JO Pekin Tibet - 13/08/08

JO Pekin censure - 13/08/08

Human Rights
JO Pekin "droits de l'homme" - 14/08/08

One other concern with these crawls, as opposed to the next two rounds, is that I didn’t keep the same settings for the crawls: censorship has three iterations in its crawl, as does human rights, but Tibet only has two and with a greater crawl depth than the other two crawls. As such, any comparison of these diagrams with the later results is affected by the variable settings used.

Some notable aspects of the diagrams:
The large cluster of blogs to the bottom-right of the human rights diagram. However, these are all blogs hosted by one service, 20 minutes, and although the page being linked to is a news article, as nearly all of the 17 links to the page come from 20 minutes blogs this cluster may not be as significant to the issue discussion as other parts of the diagram.

The censorship network has, as its largest node, Wikio.fr – in particular, the top 100 political blogs page, with 42 in-links. This may be a result of the three iterations of the crawl, following links from blogs (such as icons showing where the particular blog is in the rankings) to that page, and later crawls with only two iterations may reduce the presence of Wikio in the network.

More notes to come, but one other concern with IssueCrawler is the legend – the colours used for particular top-level domains (.com, .fr, etc.) vary between diagram. Although the legend is clearly noted on each diagram, it does mean that quick comparisons also have to be careful ones!

