Unshortening URLs in PHP

I had cause to unshorten a ton of links from twitter today.  There are loads of services that do it for you (and the popular shorteners have APIs) but I couldn’t see any existing PHP libraries to do it.  Below are some fairly straightforward PHP functions to unshorten URLs based on the Location: header that’s used in redirects.  The “isShort” function is a bit of a hack because I was only interested in a few domains, but should be straightforward to modify it for other tasks.

Requires curl support in PHP.

 * Get the value of the Location header obtained when dereferencing the given URL. False if there isn't one
function getLocation($url)
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_HEADER, true);

$data = curl_exec($ch);

list($headers, $body) = explode("rnrn", $data, 2);

$headers = explode("n", $headers);
foreach($headers as $h)
if(preg_match('@Location:(.*)@i', $h, $match))
return $match[1];

return false;

function is_short($url)
return preg_match('@^https?://(www.)?(bit.ly|t.co|goo.gl|dlvr.it|tl.gd|is.gd)@', $url);

 * Unshorten a short URL until it isn't short anymore (copes with URLs that have been 
 * shortened multiple times, up to $limit).
 * Returns false (by virute of getLocation() ) if the URL isn't short
function unshorten($inurl, $limit = 5)
$i = 0;
$url = $inurl;
while(is_short($url) && $i < $limit)
$url = getLocation($url);

return $url;

“Put your tweets on your website”

Have you seen a message like the one below when using twitter?


Wonder why twitter’s so keen for you to do that?  A quick check of the settings page gives it away:


Yep – Twitter’s encouraging website owners to embed Twitter widgets on their websites so that twitter can track the websites that its users visit.  The real travesty here is that, I suspect, the majority of people embedding the widget and the majority of twitter users have no idea that the tracking is taking place.  Not cool, Twitter.


Why I’ll vote NO to NUS in the SUSU Referendum

On December 6th, Southampton University Student Union (SUSU) is holding (another) referendum on whether we should affiliate with the NUS. 2012 marks ten years of independence for SUSU – Ten successful years, I might add – and, whilst there are lots of arguments about the cost, necessity and loss of freedom that make a compelling case for saying No to NUS, I thought I’d take some time to explain my biggest objection: affiliating to the NUS fundamentally alters the role of SUSU.

The Role of an Independent SUSU

SUSU is a fundamental part of student life at Southampton. It runs everything from the shop we buy our calculators from (it has a monopoly on those) to the restaurants we eat lunch in to the bars we drink in. It holds the University to account and it helps to improve the quality of Education, Feedback and Welfare within the University. SUSU may be a legaally distinct organisation from the University itself, but it is intimately coupled with it and plays an important role in the lives of Southampton students and an important role as part of the University as an Institution.

This is my first important point: SUSU is an integral part of Southampton University and an integral part of being a student here – Not being a member of SUSU would be a significant loss.

The NUS: A political organisation

The NUS, on the other hand, is an overtly political organisation. “NUS is joining with the TUC to march and rally” says the NUS website, “Sign the e-petition and email your MP here” it says just underneath. Don’t misunderstand me, I wholeheartedly support the right of students to get involved with the political process, I fully support their right to join organisations and I even support the NUS campaigns that these examples refer to, but fundamentally I also believe that students at Southampton should have a CHOICE about which political organisations they join.

By affiliating to the NUS, SUSU signs up each of us to this organisation. We could opt-out, but we’d have opt-out of SUSU as a whole. We’d lose the internal represenation that SUSU provides within the University, just because we objected to having the NUS speak on our behalf. If the University were our employer that would quite probably be illegal.

An Ideal World

In an ideal world, the NUS wouldn’t require Universities to opt-in all or none of their students. It would operate like other political organisations, members would be free to choose if they agreed with the governance and aims and to opt in or out as individuals. One could speculate about WHY the NUS will only take whole Universities (perhaps they think they’d be about popular as their own discount scheme?) – but that’s not speculation for here.

In Summary

SUSU is an integral part of the Institution that is Southampton. It represents us, Southampton students. The NUS is a political organisation, and by affiliating SUSU would become a political organisation, too. Individuals shouldn’t have to join a political organisation in order to participate fully in their University or to get the full student experience.

Southampton has shown that it is a strong and thriving institution outside of the NUS, and the supposed benefits of affiliating (not that the auditors found any) are certainly not outweighed by the fundamentally illiberal process of foisting political membership upon students in spite of their own consciences. The problem here is not that SUSU is independent, but that the NUS have a fundamentally flawed membership model.

If you’ve not already, check out the No to NUS Facebook page, follow @No2NUS and take a look at the current (unscientific) SUSU NUS opinion poll!

US “Fusion Centers” in Privacy #fail: Organisational Approaches to Privacy

Last Tuesday, the United States Senate published a report into so-called “fusion centers” that were set up post-9/11 to share intelligence between agencies to support counter-terrorism activities. The report is pretty damning on a range of issues (poor financial accountability, out of date or poor-quality intelligence, officials insisting that non-existent centres did exist) but of particular interest, to me at least, are some of the finding related to privacy.

In summary, the fusion centres each represent a geographical area (states or cities) as “a collaborative effort of 2 or more Federal, State, local, or tribal government agencies that combines resources, expertise or information with the goal of maximizing the ability of such agencies to detect, prevent, investigate apprehend, and respond to criminal or terrorist activity.” To this end, the centres produce intelligence reports that are sent to the Department for Homeland Security (DHS, similar in scope to the UK Home Office / Ministry of Justice and created by the consolidation of a range of security-related departments).

Privacy Failures

In the USA, the Privacy Act* governs the collection, maintenance, use and dissemination of of personally identifiable information (PII) by federal agencies. One finding of the Senate report was that “if published, some draft reporting could have violated the Privacy Act.” Specifically “DHS officials also nixed 40 reports filed by DHS personnel … at fusion centers after reviewers raised concerns the documents potentially endangered the civil liberties or legal privacy protections of the US persons they mentioned.”

This, to me, raises two concerns:

  1. Why, given the fundamental nature of the privacy protections in both the Privacy Act and US Constitution, were fusion centre staff not better trained to compile reports?
  2. Since the Senate report focuses on counter-terrorism efforts but acknowledges that fusion centres play a significant role in other intelligence activities, it seems possible (even likely) that other privacy-sensitive reports could have been compiled and not checked/corrected/stopped by staff at the DHS.

Both of the above look like symptoms of the fairly standard “privacy as the last thing to think about” syndrome that seems pervasive in most organisations. So, how could organisations implement privacy protection as more than just a reactive bolt-on?

* Unlike EU Data Protection rules, the US Privact Act applies only to federal agencies (and not bodies such as courts) and has no equivalent of (eg) the UK’s ICO.

Organisational Approaches to Privacy Protection

These two issues made me think about how organisations can structure themselves to protect privacy of the individuals they collect data about. After some thought, I have thought of three models that could be used. I expect there could be more, and in practice I expect that most organisations have a hybrid arrangement.

1. The Firewall


The first model I identified is the one that seems to be used by the fusion centres – I call it “the firewall.” Within the organisation, there is little consideration given to privacy protection but publications and data dissemination is controlled by a “firewall” that is designed to prevent the publication or dissemination of materials that could undermine individuals’ privacy.

This is similar to the model used by some companies for PR purposes – Employees are not allowed to talk directly to the media and are expected to refer such communication via the Public Relations department.


  • It’s probably easier (and cheaper) to train employees to send materials via the correct channel than to train them on privacy protection policies and best practice.
  • As in the case of the fusion centres, there is a failsafe in place even where employees should know better.


  • The firewall can prevent publication or dissemination, but it’s less clear how it could be used to enforce restrictions on internal processing or storage of data.
  • In large organisations, internal firewalls might be required to properly control data but would certainly slow down communication and introduce a layer of bureaucracy and expense.
  • Whilst, om the face of it, the firewall looks like the most rigourous way to ensure data dissemination and publication doesn’t violate privacy-protection policies, it is impractical to shut down all channels of communication, especially when the lines between organisations are blurred, such as in the fusion centres.

2. The Point of Reference

Point of Reference

The second model I identified I call the “point of reference” – This is the model that Universities use to enforce research ethics. A body within the organisation is tasked with maintaining privacy policies and advising other parts of the organisation about what they can and cannot do. The rest of the organisation needn’t understand all the intricacies of privacy protection, but know enough to identify when they should consult the point of reference.

Here at the University of Southampton, the rule for when we should contact the Ethics Committee is fairly* straightforward: Whenever we conduct research that involves humans or animals.


  • Unlike the Firewall model, the Point of Reference can be equally applied to data collection, maintenance, storage, use and dissemination.
  • It is easier for employees to identify WHEN they need to consult the Point of Reference than to understand all of an organisation’s privacy policies.


  • Unlike the firewall, which can provide a reasonably good failsafe (as in the case of the fusion centres – At least so far as DHS reports are concerned), unless the point of reference also has the authority to pro-actively check activities throughout the organisation it could easily be bypassed.
  • The point of reference could become a point of friction if employees do not understand enough about organisational privacy policies to understand decisions that conflict with their goals.

* I say fairly, because there are some edge-cases; does scraping twitter involve human participants?

3. Culture of Privacy

Culture of Privacy

My third model is what I call the “Culture of Privacy”. In this model, each employee within an organisation has a working knowledge of the organisation’s privacy policies and privacy is seen as an integral part of the organisation’s operations. In this model, employees are responsible for more than just knowing when to refer to a point of reference but have a personal responsibility for protecting the privacy of data subjects in the course of their work. This model involves the most training and support, and probably also involves appropriate sanctions for employees that engage in their own “privacy counter-culture.”


  • This model applies privacy principles to all aspects of an organisation and allows for a degree of monitoring between employees.
  • If privacy is seen as part of an organisation’s core principles or even identity, then it is less likely to be seen as a hindrance.


  • In practice, making privacy a core value is probably a pretty difficult thing to do (especially in engineering companies [hello Google, Bing, Facebook] where “what we can do” is more of a concern than “the side effects of what we do”).
  • A internal culture of privacy is likely to be dependent on a wider culture that respects privacy. There seem to be differences between the EU and the US in this regard and the motivation to create such a culture might be stronger in the EU, given the stricter Data Protection regime.
  • Even with good training, employees are likely to require additional advice and support – So this model probably doesn’t work well by itself and probably needs to be considered alongside a point of reference.

Hybrid Models

As I alluded to previously, adopting a single model to try and enforce privacy-protection within an organisation is probably not a good approach. None of the models is perfect and (in the EU at least) the implications of failing to adequately protect data subjects’ privacy are serious enough for an organisation that privacy-protection is worth doing properly.

Creating hybrid models of privacy protection, for instance combining a point of reference with a firewall model for any substantial inter-organisation data transfers, is probably a better way to ensure that data subjects’ privacy is respected than (as the DHS appears to have done in the case of the fusion centres) relying on a single measure to enforce privacy protection.


The case of the US Fusion Centres illustrates atrocious project management on a number of fronts – But the apparent lack of robust privacy protection measures for data subjects is perhaps among the most unsettling. I’ve briefly explained three ways in which privacy protection could be implemented in an organisation, one of which (the firewall) appears to have saved the Fusion Centres from an even more damning report. However, in reality privacy protection needs to be at the heart of what organisations, especially data-intensive ones, do; and that probably involves a hybrid approach in which failsafe procedures are combined with a supportive environment and a culture in which employees consider privacy an important part of what they and their organisation strive to be.

There are issues that I haven’t explored about how privacy needs to be re-framed from a hindrance to engineers and service designers to being an enabler for the rest of us.

The First Interdisciplinary Web Privacy Seminar @ Southampton: Thursday 1st November 2012

Thursday, November 1st, 10:00 – 15:00, Building 32 Coffee Room

Many of us within the Web Science DTC at Southampton, and beyond, have research interests related to privacy. To foster collaboration and to help develop some common understanding and direction, we’re arranging a day-long seminar on web privacy on Thursday, November 1st 2012. Refreshments will be provided by the Web Science Doctoral Training Centre.

We’d like to invite anybody who’s working on privacy to take part and we hope that all attendees will give a short presentation (5-20 minutes) about their research or interest in privacy, focusing (if possible) on some or all of the following questions:

  1. What IS privacy?
  2. Why is privacy important?
  3. What changes, if any, do you think could improve our privacy? Technical, social, legal or otherwise.

After the presentations, we’ll discuss the questions that have arisen and examine possibilities for future research.

To register for the seminar, please use the form below and do get in touch, R.Gomer (at) soton.ac.uk, if you have any questions.

Richard & Maire

On the Ethics of “Consent”

TL;DR: Consent matters when it comes to cookies that could expose sensitive personal attributes (health, income, age, sexuality, religion, ethnicity), even if you don’t mean to collect them. Collecting these things could put the subject at a small but appreciable risk. The only person is a position to decide whether a personal attribute is sensitive is the subject (and even they may have trouble). Getting consent is different to getting someone to click the “I consent” button. People are irrational, don’t pay attention and are goal-focussed – It’s not OK to exploit that in order to get a meaningless but legally-acceptable “consent” signal.

Hand-Waving in the general direction of consent

Consent is one of those ideas that seems to permeate through every level of society. At a macroscopic level we talk of citizens being governed and policed by consent, and at a smaller scale consent underlies the relationships between individuals. It is only rarely that someone can be compelled to do something without their consent at some level – Whether that’s macroscopic consent derived from their participation in a democratic society or case-by-case consent formed through contract or interpersonal agreement.

What underpins the idea of consent, is that the entity giving consent (whether an individual or a group, and sometimes both) has a meaningful choice to make: Do I or do I not want to enter in a particular set of rules or conditions?

Consent and Cookies

So, what does consent have to do with cookies? An advertising network that tracks my visits over multiple sites isn’t compelling me to do anything, but it is taking decisions, the right to digital self-determination, away from me. As I’ll come on to later, people deserve a choice when it comes to data about them, and when an advertising network starts covertly collecting data that choice is taken away. Secondly, the EU 2009 e-privacy directive specifically requires that

“the storing of information, or the gaining of access to information already stored … is only allowed in the event that subscriber or user concerned has given his or her consent, having been provided with clear and comprehensive information … about the purposes of the processing.”

Consent vs “I Consent”

When piloting the study I’m working on at the moment, I spoke to several people about their experiences with cookies and asked most of them about the new “consent” dialogues that have sprung up on UK websites since May*. The overwhelming response seems to be that people have seen them, but don’t really pay attention to what they say or understand the decision that has to be made. That’s not surprising, people have been ignoring warnings about security certificates for years.

Here’s the difference between actually consenting to something and clicking on a consent button (or worse, “continuing to use this website indicates your consent”). The legal basis for determining whether a user has consented seems to be rooted in the same discredited notions that human beings are rational and self interested as Economics. More, it assumes that people will always read, understand and give proper thought to the information that they’re shown. We know that both these things are categorically not true. By relying on human psychology to trick users into “giving consent” whilst simultaneously pretending that such consent is in any way meaningful is ethically indefensible. What matters is not whether you can get a user to click a button (probably after having gone through a shallow heuristic evaluation rather than critical thought) but whether you can say with any certainty that users are actually happy for you to do what it is your doing; (and you can’t assume that they’d be happy if you haven’t actually told them).

If these techniques were proposed as “nudges” (and default options can be legitimate nudges) they would be rejected on the grounds that they’re not in the interest of the subject or even of broader society.

* May is when the Information Commissioner’s Office claimed that it would start enforcing the UK’s Electronic Communications Regulations, as amended.

Why “digital self-determination” matters

By “digital self-determination” I mean the right to control data about oneself – Even in situations where it would be hard (although not impossible) to link that data back to the individual it relates to. Every time data about a person is stored, there is an unknown increase in the risk of harm to the data subject. It’s not the job of Bing, DoubleClick or Facebook to make risk decisions on behalf of the data subjects – The data subject is the best placed to know which personal attributes are potentially sensitive given their personal circumstances.

Why does it matter if a company collects data about the web pages I’ve visited? There’s no answer to that question. Some people have no reason to care, but others may have several. Advertising companies know that the web pages people visit can tell you something about them – They exploit that knowledge to target adverts based on what they think you’re likely to buy. What somebody’s likely to buy is not the only thing you can infer. Consider the following examples:

A web user searches for advice about problems with their eyesight and tremors. In the UK those web searches wouldn’t be too sensitive – Our health care is free at the point of use. In countries where people rely on private health insurance that web search could be construed as evidence of a pre-existing medical condition and preclude the data subject from appropriate care if they were later diagnosed with Multiple Sclerosis.

You could make a reasonable inference as to the sexuality of somebody who routinely visits PinkNews.co.uk. For some people that’s not a problem, but for some people such a revelation could cause family or employment difficulties.

What about the social stigma around depression and suicide that might be invoked by disclosure of visits to the Samaritans website? Or the consequences of an abusive partner finding that their victim was seeking domestic violence support? An employer that found out you’d been uploading a CV to Monster?

Shouldn’t those sites just stop using third party services that could track their visitors? Probably. But that’s not enough – Newspapers carry stories about these topics and links to those websites. Bloggers that rely on free services don’t have a choice which third parties get to track their visitors.

“We use behavioral advertisers – People can accept it or leave”

Do people have a choice of whether they take a risk with their personal information? Perhaps they do, but should people have to make a choice between risking personal data and using a website? That is surely a form of indirect discrimination.

So, what’s your point?

The current system of tracking, the paternalistic attitude that companies have to subject data and the technology that allows companies to do tracking with no consent from users is broken. Something has to give: Either data protection legislation needs to be strengthened (or just enforced – Yes, ICO, looking at you), companies that make money from surreptitiously stealing people’s data need to start behaving more responsibly or the technology needs to be tweaked to give web users a break.

Introductory Thoughts on Cookies

What’s this all about?

During my internship at MSRC, I’ve been focussing on how we can visualise cookies to help people better understand what they’re doing and how they work. But there are other issues tied into this: Privacy (and what it means for that to be undermined, and who has the ability to determine whether an action undermines an individual’s privacy), Technical Issues (how can we help guard against “abusive” tracking cookies and cookies that really are needed, without breaking things?) Legal issues (particularly around data protection, EU privacy directive and what informed consent is) and even some Economics (how do cookies support content-providers via ad networks, and how do you balance that against user privacy or make content worth the privacy risk).

I think there are a few issues that keep coming up, no matter which way you approach cookies: The technical insolubility of preventing an ID from being used for several purposes, knowing what an ID is being used for, balancing the needs of websites to track users internally to optimise content versus users’ right not to have their browsing history across multiple sites snaffled by advertising networks.

Why selectively blocking tracking cookies might backfire

I’d be interested to see whether one could differentiate between “types” of cookie with accuracy good enough for general use – There’s no technical distinction, so I think it would be far from easy. The P3P approach, in which cookies are delivered with a machine-readable privacy policy, seems like it might address some of the problems with categorising cookies, but (from experience of trying to implement P3P policies for websites) it’s pretty complicated and feels out of place alongside the simplicity of HTTP itself.

But if we could tell “this is a cookie that keeps you logged in” and “this cookie is just for targeted advertising”, would that help?

Sites that set multiple cookies generally seem to do so out of convenience (easier for eg product teams to have their own cookie) – A single cookie would probably suffice technically for the overwhelming majority of sites that currently use multiple cookies – There may be a downside to widespread categorisation of cookies as “authentication” and “tracking”, in that sites start consolidating into fewer multiple-purpose cookies that are harder for users to control individually and removing any shred of transparency that currently exists. I suspect also, that rather than reduce the lifespan of that single cookie to reflect the often limited lifetime of the current cookies, companies would just give the single cookie the lifetime of the longest-lived cookie at the moment, which undermines privacy further.

Knowing what cookies are for: A policy problem?

There could be a policy response that insisted on a certain level of atomicity in cookie use (not using a single identifier for technically-necessary identification like authentication and non-essential uses like tracking). Implementing that seems like it would either a) have a lot of side-effects for eg companies (like Facebook) that operate advertising only within an authenticated environment (differentiating the ID of the user and the advertising recipient makes little sense) or b) have a lot of loopholes to accommodate them.

Which cookies do I even want to control?

Contexts seem to play a role in the idea of privacy – I don’t care so much that the Guardian knows which stories I’ve read, but I do care that an advertising network knows which stories I read on the Guardian and which stories I read on the Telegraph and which product I looked at on Amazon – A third-party that doesn’t respect the “natural contexts” in my browsing is more troubling to me.

Applying contexts to the Cookie Jar

I think contexts could be implemented in the web browser. Sites could, by default, operate in a “sandbox” – A cookie for Facebook set in a first-party scenario (I’m on a Facebook URL at the time) can only be seen by Facebook. A DoubleClick cookie set in a third-party context while I’m on Guardian.co.uk can only be seen by DoubleClick when I’m on the Guardian – When I’m on the Telegraph, DoubleClick sets/sees a different DoubleClick cookie. This wouldn’t interfere with analytics on the site itself, and would still allow sites to track return visits without bothering the user for all the largely-innocent cookies.

HTTP cookies already have a couple of properties that can be specified at creation (to eg restrict them to HTTPS connections or prevent access from client-side scripts) – A new property could allow cookies to break the sandbox and become global, accompanied by a user confirmation, perhaps using a P3P-like policy to tell the user what the cookie is for, like you get when adding an App on Facebook.

“Facebook.com wants to a set a tracking ID on your browser. It will be used to:
– Keep you logged in to Facebook services provided on other websites
– Track your browsing activities for the purpose of behavioural advertising
Do you want to accept this tracking ID?”

Sandboxing the browser cache in a similar manner would help to prevent some of the other tracking mechanisms, like caching a unique image and then reading that back using a javascript canvas. I think that prevents large-scale tracking of a user’s browsing across many websites, but still allows cookies for legitimate cross-domain purposes (Like Facebook comments on blogs) to work. The policy response then just needs to deal with companies that misinform users about the purpose of the cookies that they’re requesting are un-sandboxed and possibly require that sites use separate global cookies for different purposes, so that the user gets some granularity in what they allow.

A social nudge?

There’s space for a social nudge here, I think. I sometimes feel like if I don’t accept eg an app’s permission request I’ll miss out (possibly coupled with a strong cultural influence to avoid saying no at all costs!) “3000 people have said no today” lets people feel like rejecting this cookie/request a) is socially acceptable and b) won’t disadvantage them, at least with regard to this big number of other people.

If DoubleClick wants to incentivise the user to accept a global tracking ID by giving them something in return, then great!

Credibility Judgement and Meta-Content

Most of us know that there’s a lot of rubbish on the web – Content that is wrong for one reason or another, whether it’s just out of date, the author just didn’t understand or was deliberately trying to mislead. Most of us would also like to think that we can tell the difference between “good” and “bad” content and act accordingly. But is that really true? Can I really differentiate between reliable information about, say, a particular health problem? Even if some people can tell the difference all of the time, something that I’m highly doubtful of, it’s clear that some users can’t. In some cases, maybe this doesn’t matter too much. Health, or finance, though, are areas where relying on bad information could have serious repercussions.

So what’s this got to do with meta-content? I mentioned previously the similarities between the mass publication of bad meta-content that Web 2.0 brought about and the mass publication of bad content that was facilitated by the web itself. I’m most interested, though, in how meta-content could help individual users to make better judgments about the credibility of the information that they find online.

Social bookmarking, the ability to share, classify and comment on web links is a relatively common activity, albeit not something that your average web user takes part in. Services like Delicious and StumbleUpon help users to locate information that may be of relevance or interest, but they also allow users to write comments or reviews of the resources that they bookmark. In this way, social bookmarking services effectively allow users to annotate the resources that they find with their own opinions. My hypothesis is that these comments could help users to make more accurate credibility judgments about the information that they encounter online, even in domains where they have relatively little prior knowledge or experience.

Not all meta-content is created equal, though. If some meta-content can help people to make better credibility judgements then the challenge is how to encourage the meta-content that is helpful in this respect and minimise the amount of noise. To accomplish this, I propose the use of “nudge”-like techniques within the user interface to influence users as they create meta-content.

There are a few (subtly) different ways to describe what a nudge is, but the original definition, provided by Thaler & Sunstein in their influential 2008 book “Nudge: Improving Decisions About Health, Wealth, and Happiness” is:

“… any aspect of the choice architecture that alters people’s behaviour in a predictable way without forbidding any options or significantly changing their economic incentives.”

I’m currently running a study to test out whether nudges could be useful in this way, and to keep the experiment “clean” I won’t explain the nudges that I’ve designed yet. I’d love more people to take part, though. If you’ve any interest in health, fitness or well-being then head on over to fitness.gathr.co.uk to take part!

Web Science: What I think it is and why we might not be doing it.

Web Science is not doing science on the web, it’s not about the web, and it isn’t science. My view on what Web Science is and why sometimes I think we don’t actually do it.

Reader beware: The post below is an awful mishmash of half-formed ideas and potentially contentious thinkings. That said, I’d love to hear what you think, so have a read and leave a comment!

The question “what is Web Science” is one that comes up again and again, to the point of becoming a running joke. “Web Science is whatever you want it to be” is one of the more liberal caricatures that I often hear. What’s clear, though, is that up until this point most of the definitions have been given by people that I (respectfully) refer to as “Web Science Immigrants” – So, what is Web Science to a “Web Science Native”, someone who now has “MSc Web Science” affixed to their CV for the rest of eternity (or long enough at least for the distinction to be irrelevant) and (supposedly) should have a feel for what the whole thing is all about?

What seems to be quite clear, certainly to me and to some of the other people I speak to, is that some of what’s labeled “web science” isn’t really Web Science at all. Some of it’s Web Technology, and some is “science about the Web” and neither of these is the same as Web Science, although there is evidently some overlap. There is no shame in that, and there is undoubtedly some fantastic “web science” research going on, but Web Science should be more than a catch-all term for things that combine science and the web. As Wendy Hall sometimes says: “There are two problems with the name ‘Web Science’: ‘Web’, and ‘Science'”

The problem with ‘Web’

The first problem with the word ‘Web’ is that everybody seems to have a different idea of what ‘Web’ is. Here are just some of the definitions that I’ve come across:

  1. An abstract information concept, the idea of having interlinked resources with unique identifiers (hypertext)
  2. A set of technologies
  3. The set of interlinked HTML (etc.) documents that exist now
  4. A series of social phenomena arising from 1 or 2
  5. A subset of the interlinked documents that we have. This suggests that our “personal web” is just one web in a potentially infinite webiverse. (If an HTML document is generated but nobody bothers to read it, does it really exist?)
  6. All of the above

The second problem with the word ‘Web’ is that web science isn’t just about the Web. Even allowing a broad definition that encompasses all the previous definitions (and allowing for the cardinal sin of conflating “web” and “internet”) there are, in my opinion, genuinely Web Science questions that don’t involve the Web. In fact, I see the word web as shorthand for “technology and people”, although I would be prepared to strengthen that definition slightly to “information technology”, since I don’t see Web Science legitimately encompassing the impact of trains on society.

So, this leads me to rule number 1: Web Science research should consider both the technology and the people that are involved in a system. Yes, this definition excludes just studying the web graph and making statements about density or the average shortest path between two web pages. We needn’t exclude graph theory or network analysis from Web Science, though, (quite the contrary, it’s clearly massively relevant). Web Science requires that, having done the maths, we can go on to say something about the people. Or, conversely, having studied some human behaviour, you can say something about the technology. It’s all about the co-constitution, after all.

The problem with ‘Science’

The problem with the word ‘Science’ is that it excludes disciplines that don’t see themselves as sciences and invites the “hard” sciences to deploy all manner of inter-disciplinary name calling and stereotypes in order to “defend” “real science” from “wooly” “rigourless” “qualitative” “social science”.

Try and explain how the web and people influence one another without mentioning law or the humanities. You can’t do it. The law defines aspects of the web graph as much (if not more so) than the technology itself. A court order could ban links, or prevent access, to a website that offers illegal material; A court order can alter the web graph.

So, here’s rule number 2: Web Science research involves knowledge, methods or epistemologies from both human-centric and technology-centric disciplines and it needs to do more than just pay them lip service. In fact, to properly stick to rule 1 and comment on the relationship between the people AND the technology, it’s highly likely that there will need to be a mix of research methods.

We study the Web itself

Even if we adhere to the two rules above, there is huge scope for variation with Web Science and clearly some research will be more about the social aspects and some more about the technical. But social/technical distinctions aside (and I think a discussion about whether that’s even a distinction worth making would be genuinely useful) there are different ways to combine disciplines. We have to choose not just which disciplines to use, but whether we want to make use of knowledge, research methods or entire methodologies. We can combine disciplines, analyse the web and still not be doing Web Science. Allow me to illustrate this point:

In November of last year, a group of us visited Tsinghua University Graduate School is Shenzhen, China, to undertake a collaborative project looking at how young people in China and the UK view other countries. We used data from fora and bulletin boards, used natural language processing techniques to generate statistics and then visualised those numbers.

We learnt something about attitudes (people) by using technology and even something of the state of the technology itself, but I don’t feel like we said anything about how the technology and the people interact, or how the technology and people shape one another. No, this felt to me like using web technology to answer a sociology or politics question. To me, this was not quite web science. It was science ON the web, it was not science ABOUT the Web.

“How do young people view other countries” is a sociology question, and we tackled it using data from the web and methods from computer science. It was interdisciplinary in the sense that we attempted to answer a question from one discipline with methods from another, but it still didn’t feel like we were ‘living the Web Science dream’. I think that true Web Science would instead ask “How does the web influence young people’s views of other countries?” or “How does the web expose people to other cultures?”

So, here is rule number 3: Web Science should say something about the relationship between the people and the technology. We should question how technology facilitates and alters behaviour or beliefs, how it impacts upon the economy or how laws evolve to counter new problems, how people create new technology and how social pressures impact upon its adoption and potentially translate into obstacles or social problems such as exclusion or deviant behaviour.


I don’t believe that a lot of “web science” is actually Web Science. Web Science is not necessarily about the web, nor is it necessarily science; it is the study of how technology and humanity work together, shaping one another. Maybe we should really be calling it “Information technology-and-people studies“. We may need to use any or all of the models, knowledge and methodologies that humanity has found in order to study itself and all of the models, knowledge and methodologies that humanity has found to study and create technology.

I believe that, in order to be considered Web Science, research should satisfy at least the following three conditions:

  1. Web Science research should consider both the technology and the people that are involved in a system,
  2. Web Science research involves knowledge, methods or epistemologies from both human-centric and technology-centric disciplines,
  3. Web Science should say something about the relationship between the people and the technology.

Want to add something, think I’m wrong or have your own view on what Web Science is? Leave a comment and let’s work it out together!

Thoughts on Meta Content

Over the next few posts, I want to tackle some of the issues from my PhD research around “meta-content” (comments, reviews etc.). Here’s an introduction to meta-content, my research, and why I think it’s interesting.

Web 2.0 is characterised, in part, by a massive increase in user-generated content. YouTube, Flickr, Blogger, Tumblr et al let anyone publish just about anything: Videos, photos, essays, news reports. But, in addition to this new “primary” content comes a wave of user-generated opinion in the form of comments, reviews, trackbacks, discussions, video responses, flaming, trolling and rick-rolling. We now have billions of dollars worth of everybody’s two cents.

Content Meta-content
Wikipedia Article Article Talk Page
YouTube Video Viewer comments
Blog Post Reader comments
Online news article Reader comments
Website Comments on delicious / StumbleUpon
* Discussion on reddit
Content Meta-content

It’s this “other stuff” that I’m most interested in, and is the direction in which my PhD is heading. It’s this other stuff that I call “meta content” – Content that is about other content. The table shows a type of online content on the left and a corresponding type of meta-content on the right.

Often, when looking for information, the content itself is what seems most useful, or most interesting; but frequently the meta content surrounding it provides a resource in itself.

It’s not hard to think of a situation where the meta-content might be more useful than the content itself. Take the Wikipedia example: The article might provide a fairly neutral account of a topic, a subset of the “facts” that everyone can agree on, but the talk page can provide a much better understanding of the discourse around an issue, of the opposing points of view or which aspects of an issue are contentious.

Similarly, the comments on a news article can provide a better idea of the debate surrounding events than the story itself. In many cases comments provide balance to biased reporting or correct inaccuracies.

Of course, meta-content is not all balanced intellectual discussion. The most obvious issue is comment spam, although there are technological solutions that do a reasonable job of stemming that. The spam problem aside, some types of meta-content have a reputation for being particularly unhelpful or unpleasant – The comments on YouTube videos are a good example – and far from contributing helpful information, much meta-content contributes nothing but anecdote and rumour.

In many ways, the problems posed by meta-content are no different to those posed by web content in general. The move away from a publishing model where publishers and peers act as “gatekeepers” to a model where anyone can publish anything brought with it new problems with inaccurate or deliberately misleading information. As the barriers to publishing are lowered, it is almost inevitable that more bad content will follow. We still don’t really have a solution to the problem of bad content on the web, (although Hypothes.is is trying) save for educating people to be a bit more critical about the information that they find.

The problem of useless or malicious meta-content might not be insurmountable, though. Meta-content is the result of social engagement with content and is, therefore, mediated in part by the social norms within the community that produces it. Online communities have their own cultures and norms and these undoubtedly arise as a result of both the people within those communities (and the cultures that they bring with them) and the online environment (design, usability, affordances) itself.

There’s some interesting research that showed how the design of a website effected the thoughtfulness of user contributions, and my own research is trying to use psychological “nudges” to alter the composition of user-provided reviews in a social bookmarking context. The basic premise is that if we can find ways to shape the cultures and norms of an online community, or to promote certain types of thinking, then we potentially have the ability to start steering meta-content in the direction that we want.