Data & Design How-to's Note 3: Opening open data

Print this pagePrint this page


Data seems more abundant these days. Governments and international institutions are publishing more and a growing collection of initiatives around the world are making it more useful to citizens. These projects are helping realise the goals of many activists,  such as demystifying how tax money is spent, seeing how elected representatives vote in parliament, how development aid money is used, and how public services perform.  It is driving innovation in a wide variety of different sectors, including politics, journalism, public services and anti-corruption.

For others, the opposite is true. Data remains as hard to find as it ever was on issues that are opaque, such as corporate and national security issues. In the majority of countries, not least those that are closed  and repressive, freedom of information remains a far off aim. The data that many activists have is often found, leaked to them or discovered through tenacious and risky investigations. In some cases, it comes in stacks and boxes of paper that need to be made sense of somehow. 

Whether easily obtained from public sources or not, the biggest challenge for activists is still to do something useful and effective with the data they can get. In this third Note on Data and Design, we offer three complementary takes on how to do this:

  • The first section - Open data. What is it? So what? - gives an introduction to the access of information and open data movements, and what they mean in practice for activists. We draw on examples of advocacy groups others using open data and its techniques to make things happen in what we call 'data dark zones'.
  • Making data useful online is the second section. In it we look at the basic starting points for making smart use of data, including using it to tell stories and 'hack' public services to improve them. Our key example is of an initiative that has made over 2 million of India's court cases available online, with rather disruptive consequences.
  • We wrap up this Note with a section called Accessing data when it's inaccessible. This detailed, hands on section digs deep into two essential techniques to help you get at data locked in stacks of paper and drives of digital files. The examples we use come from groups documenting political violence in Zimbabwe, and investigating the secret world of the CIA's extraordinary rendition program.

1. Open data. What is it? So what?

How much information and data do you think the public bodies in your country create? Governments have long published at least some sorts of data, often through national statistics offices, or through various different thematic websites. However, the current scale and nature of data publication by some governments is very different from even a few years ago.

To see this, take a quick look around (but please come right back). This is the main location where the United States government publishes data created and held by its national level institutions. At the time of writing, it is a catalogue and home for nearly 400,000 datasets created by 172 US government agencies. But it still remains far from comprehensive. The website itself is created to make it as easy as possible to find data; the data itself is structured to enable it to be re-used with ease. Importantly, the data comes with no restriction on who can use it and for what. It's “open” in  the common sense of the word. But is just one of many other such official “open government data” sites operated by governments in other countries. This is not to mention the many sites that have been created by non-governmental groups to catalogue the public data that exists.

The information created and held by public bodies is valuable in a range of different areas, including:

  • Participation in public life: information is essential to the functioning of organisations at all levels, including decision-making, administration and financial management. When it shows the operation of government and the state, advocates argue that putting this information in the hands of citizens is a precondition to the effective scrutiny of the use of power.
  • Economic value: some of the information collected and maintained by public bodies has economic value. For example, consider the importance of geographic or meteorological information to both government and businesses. How public bodies enable others to create businesses using this information (which they already pay to collect) is becoming a key question. 

There are two complimentary activist 'movements' working specifically in this area. The first is the “access to information” movement also referred to as Freedom of Information (FOI). The second is the “open data” movement. Access to information activists put pressure on governments to enact and implement laws enabling people to ask questions of any official body that is part of or controlled by the state, and receive prompt and thorough answers. They draw on the idea that information produced using tax money is owned by the tax-paying public, and should be made available to them without restriction. As public bodies responded to people's queries and pro-actively publish the information they create, people are able to see, better understand and scrutinize the workings of the public bodies they fund. Access to information is seen as a necessity for effective participation in public life; a tool to redress one sort of imbalance between people and the powerful institutions that govern them.

Open data activists build on these ideas and concern themselves with the re-use of data and information released by public bodies. This emerges from two important changes created by the internet:

  • A collapse in the costs of sharing any kind of information, and the methods by which information can be shared and consumed; and,
  • the fact that 'digitally native' people everywhere are creating and consuming information on the internet. Many people use online forums, social media and blogs as a key part of their lives, using it to  learn and form opinions and seek advice. Other, more technical groups “mash up” data – putting it online, showing it on maps, making it searchable - to try and show interesting or new things.

Open data advocates point out that these changes are far too disruptive for anyone to ignore; that they are profound and irrevocable in every aspect of life, and the public and civic sector should adapt to them. They argue that public bodies should not only release information and data with modern online habits in mind, but they should do it in a way that removes technical, financial and legal obstacles to any sort of re-use. In practice this means designing methods and standards for releasing different sorts of information in ways that anticipate but don't preclude what people might want to do with it. These include making sure information is released in digital formats that can be used in commonly used desktop tools like spreadsheets, and using common standards to enable linking between datasets.  This important technical work removes practical obstacles to the potential held within the data, helping to realise the promise that the calls for access suggest. 

Open data has risen in prominence as an idea over the past few years, particularly in countries with so-called mature democracies. These are beginning to be experimented with across a spread of governmental and civic activities, often with interesting results. Their impact will take longer to determine,  and a common objection advocacy groups have to it is the fact that the availability of more data doesn't translate automatically into more effective services. Yet open data and FOI are not ends in themselves; they're far from perfect, and there's an art to using them effectively. At this early stage, open data resources may have little to offer directly on many of the contentious global issues – state-sponsored violence, conflict, human rights violations, environmental degradation, resource transparency -  particularly as these play out at the global level, or in transition or repressive parts of the world. In such places it may be impossible or even dangerous to ask a local authority or a company to release data. Yet this has not stopped activists experimenting with these methods in 'data dark zones', for example by not waiting for information to be released but instead finding it themselves, creating their own resources or working with leaked information. 

Data dark zones

Even where immediately relevant data is scarce, thoughtful illumination can make it useful. Here are three examples of activists have taken from one place or issue and used it to inform another. In each of these cases, data resources that were already public were re-worked by advocacy groups in ways that create opportunities for further re-use and analysis by different audiences.

A. Monitoring - Uganda Development Aid Spending, by Publish What You Fund and Open Knowledge Foundation

The transparency of development aid donors is extremely important to beneficiaries of aid in their attempts to get their own governments to be more effective and accountable. While development agencies from OECD countries have long published data about how development budgets were spent, the level of detail, comprehensiveness and value was inadequate. Recently, campaigners have spread 'open data' thinking to the development aid sector, through the creation of the International Aid Transparency Initiative (IATI). IATI is a data standard and a commitment from aid donors – mostly governments - to publish detailed data about how much development aid is being spent, where and when, on an ongoing basis. 

But what good is this on it its own? The Uganda Budget Explorer uses the sort of data that is being now released as standard through IATI to show both development aid and national government spending as it is divided up by area of ministerial responsibility. Using data released by a range of different foreign governments It gives a unified overview of the areas where development aid is a significant supporter of public services and functions in Uganda. This is  is a useful attempt to use data – and an online visualisation – to help citizens and researchers alike get a more complete picture of government expenditure, and the role, influence and effectiveness of development aid.

B. Leaking - TuniLeaks by Nawaat de Tunis 

In November 2010, Wikileaks started the release of a quarter of a million leaked internal memoranda (“cables”) sent between United States embassies and the State Department. The cables cover over 40 years of confidential reporting, opinion and analysis by US officials about diplomatic relations, human rights, corruption, politics and events in nearly every country of the world. Immediately after the WikiLeaks release, Nawaat de Tunis - an independent news website run by a collective of Tunisian bloggers and digital activists –started looking through the cables for what it could reveal about the Tunisian dictatorship of Zine El Abidine Ben Ali.  Nawaat set up Tunileaks to pull together the cables from the US Embassy in Tunis, translate them from English into French and then spread the content widely across the Tunisian internet.

Tunileaks was put online days before a remarkable chain of events was also set in motion. For years, the Tunisian regime had been successful at suppressing public dissent about its corruption and human rights abuses. In mid-December 2010, citizen-made videos and reports about protests started appearing of the suicide of a young man in response to the dire economic and political situation spread across social media. These were picked up and re-broadcast on television and online by the Al Jazeera news network. In under a month, the dictatorship had fallen. Some experienced Tunisia watchers commented that Tunileaks contained little that experts didn't know already about the Tunisian dictatorship. However, further analysis stressed the empowering effect that this leaked public data had on Tunisians when they find out en mass that other, influential people also thought the Ben Ali regime was corrupt and despotic. 

Tunileaks illustrates two useful ideas. First, it shows the value of keeping an eye on external resources for information that could be brought into play. Sometime we can be too narrow in where we look for relevant information. Second, Nawaat successfully repackaged existing information to make it accessible in a timely way to audiences who would never otherwise have been able to access it.

C. Collecting - 'Big Brother Inc.' by Privacy International 

In recent years, a highly secretive industry has grown in creating and selling technologies that can be used to intercept emails and website use, hack online user accounts and track their owner's behaviour and location through internet and mobile use. The risks that activists and journalists around the world face as a result of digital surveillance by repressive regimes has also grown, some say in lockstep with the market for these technologies. 

However, it has long been difficult to gather evidence of systematic connections that would help activists exert pressure on companies and governments to comply with Human Rights standards in the export and use of these technologies. 

Researchers from Privacy International (PI), a Human Rights group based in the UK, managed to attend a number of surveillance industry conferences . By collecting many of the product marketing materials freely distributed at the ISS World conferences, they were able to identify which companies were offering what services. Privacy International and a consortium of activist and journalist groups released this information as The Spy Files (a similar set of information was also released by the Wall Street Journal as a searchable dataset called the Surveillance Catalogue).

Through further data gathering activities, PI were able to obtain lists which companies and government agencies attended the same ISS conferences. They published this in the form of a Surveillance Who's Who, which gives leads to public agencies in over 100 countries that have shown active interest in surveillance technologies. This data has been wired in to other public data resources and services about public spending and company information. 

In parallel with the emergence of these resources, there have been stories about business relationships between British surveillance companies and the Syrian regime, and French providers with Libya's now defunkt Ghaddafi regime. The resources released by PI are important as they provide new windows into the operations and behaviours of repressive governments. By publishing them on an open platform PI raise the issue but make them available for others to analyse and investigate. Others have the opportunity to fill in the gaps in the existing dataset, improving the resource for everyone interested in the issue. 

Through these examples of working in 'data dark zones' a new picture emerges of how governments, businesses, and advocacy groups could (and perhaps should) function in the internet age. These new trends in getting and using public information have been developed through a blend of re-thinking ideas about transparency and a redefinition of the methods and ways that data and technologies can be put to use in advocacy. 

They have inspired new forms of online media and approaches to news reporting, called 'data driven journalism'. They have inspired the creation of new sorts of public, digital services that enable citizens to engage in political life in ways unthinkable before. Along with social media, these new ways of publishing information present opportunities to activists and others. 

What can you do with them?

Further resources for this section

2. Making data useful online

“If you would have asked me in 2008 whether I was an activist I would say 'no'. I was a pure tech guy at that time. Now I think I have a role of providing free access to law in India.”

Dr Sushant Sinha laughs quietly when we ask him if creating, a free, daily updated, online search engine of 2 million of India's laws and court judgements, is activism. He sees his work as solving an annoying problem. “A lot of other Indian websites that try to provide legal information make no connection amongst the documents, so judgements don't refer to one another. You can't find a link to these judgements. As a result what happens is that people are confused by this complete jargon.”

Sinha, a software engineer with Yahoo! India in his day job, started taking an interest in law in 2005, spending time on the growing number of law blogs that appeared in India around that time. But he was unable to quickly find sources mentioned in the blogs or understand what a case was about. “The frustration was that I did not have the legal background. In 2005 or 2006 the Supreme Court of India started putting each judgement online. So I started reading them and I was like 'oh man, there is too much jargon'. But then an idea struck me. Let's suppose these people know that these sections are important, so why can't computers automatically discover it?”

To advance his own understanding of the law Sinha used his skills as a computer scientist to bring together around 30,000 judgements from the Supreme Court of India published on its official website. His computer programs 'read' through each judgement, picking out citations of sections of the legal code and references to other cases the Supreme Court had decided. They then link them all together making the legal documents dramatically easier to search, browse through and understand. 

But he didn't stop there. In early 2008 he decided to put his work online as a simple to use search engine. The High Courts of India also publish the outcomes of court cases each day, so Sinha began to include them in the search engine. His programs – called 'crawlers' or 'scrapers' (which we explain later in this Note) – automatically visit these websites each day to look for new material, downloading what they find and adding it to the search engine.

As published officially online, court judgements come in lots of different shapes and sizes with few standards in the way that judgements are presented or accessed online. For legal scholars and the public trying to understand the law, this is a significant obstacle.  Predictably for a computer scientist, Sinha sees this as a small issue when set against the big achievements of having the judgements in the first place: “the big achievement here is that you get daily Orders passed by the court. The information is there for you. You have to give a lot of credit to the law ministry in this regard. There are not a lot of things that work correctly with the government but this thing is working very well in the sense that the law ministry has been, and actually the entire government has been pushing every segment of public facing organisations to be providing its data.” 

This is a key point. Thanks to India's Right to Information regulations and proactive government policies a lot of data about the legal system is already out there. What Sinha has done is to create a way to access the information that meets the needs of lawyers and the public, rather than the publication preferences of the bureaucrats running the court system. He has earned a large number of users through taking what was to him the next obvious step and focussing on what users of the information really needed.  Without this user focus it would not happen, at least not at the current time. 

Not everyone has been happy. As the court judgements in IndianKanoon are also indexed in Google and other search engines, many people involved in court cases are finding their names appearing in search engine results for the first time. Some have pleaded with Sinha to remove their names, effectively asking him to change the content of original, already public court documents. From the data supply side, Sinha notes that IndianKanoon's focus on ease of use shows how the interests and capabilities of the IT companies running the court systems get in the way of a useful, responsive service for the end users. As for civil society groups working to improve access to information about the law, he believes they don't see it as their role to work on this sort of technical project: “Civic activists in India tend to follow this route: file a public interest litigation in the concerned court about it. They know what it takes. I have no idea about how to take that battle - so whatever sort of battle I can fight, I'm fighting.”

So how do you come up with an idea about how to use public or even your own data?  The creator of IndianKanoon was a newcomer. He had no activist intent and was “scratching his own itch”. He put his work online and found that others valued it. If you want to use these techniques in your area of work here are four ways to think about the data you have or have found, and how you could help others also get the most out of it: 

  • Find a public service that really should be better, or try to create a completely new one if it doesn't exist yet: some of the first, most interesting and influential open data initiatives have been created by people frustrated by a public service that wasn't working as well as it should. Two of the best examples are The Public Whip and They Work for You websites. Together, these create an “accessible” version of the official transcript of the parliament of England and Wales. The developers of this website were frustrated that it was not possible to see how Members of Parliament had voted because this data was buried in strange places in the official transcript. They applied a technique called web-scraping (which we discuss later in this chapter) to collect this information, and a user-friendly website to display it. Work in this area of online parliamentary informatics has rapidly taken off globally in the last few years as you can see from the big list of sites here. MySociety, the organisation that runs, has written a guide to what it takes to get these sorts of sites up and running. 
  • 'De-fragment' an area of knowledge by pulling it all together: Many different groups collect and publish data about the same thing, but do it in different ways with different approaches, standards and technologies. For example, different governments publish information about companies in different ways: in a globalised world, this makes it difficult to track the activity of companies and the individuals associated with them. An interesting approach to solving this is OpenCorporates which pulls together data about the registration and ownership of companies from around the world. Open Corporates does the heavy lifting of making corporate information easier to access, meaning that others researching companies don't have to. As mentioned above, the International Aid Transparency Initiative (IATI) does something similar in creating a standard that governments and international organisations can use to publish data about development aid spending, enabling it to be aggregated and compared. 
  • Find stories in data. Tell stories with data: the release of data through access to information laws, and technical innovation from the open data movement has also affected news and investigative journalism. Initiatives like the Guardian's Datablog and Pro Publica have added to the existing skills of journalists by developing better technology tools for collecting, analysing and showing data. At times, this means that the data is the story; at others, the publication of visualisations made from the data extends reader interest in a story through revealing different aspects or angles about it. Finally, data can be a place to find new stories, particularly if you can create a way of showing raw data that enables the public to help “trawl” through it for interesting things.
  • Publish information in ways that are native to the Internet: The examples above, particularly the Surveillance Who's Who show how advocacy groups are beginning to adapt to Internet-native ways of publishing information online. In this case, the idea is to encourage others to use it by making it easy to search, explore, re-use and contextualise. Rather than think of your data as a table in a report, think of it as a service to others: what else could they do with it that you can't? The wave of open data portals, such as Open Data Kenya, go even further by providing tools for mapping and quantitative analysis. The Open Knowledge Foundation has created a guide to realising this sort of technical openness, here.

Your audience may be changing their expectations and increasing their level of interest in delving directly in to the data. Increasingly, online data sources are used not only by specialists, researchers, academics, and journalists, but also by concerned members of the public who have surprised data publishers in their level of curiosity and the deep dives they are willing to do to raw data in order to form their own opinions. 

To learn a bit more about this way of thinking about data, have a dig into the resources below:

3. Accessing data when it's inaccessible

In many cases, data will not be as freely available and easy to re-use as it could be. Researching an issue or requesting information can result in stacks of paper or thousands of digital files. These can be overwhelming, difficult to make sense of quickly, and it can be hard be to know how to proceed. Do you just start flicking through the documents with a pen and paper, or would a more systematic approach be better? What technologies could be helpful? 

Activists and journalists have been collaborating with technologists on a range of potentially useful approaches to overcoming situations where the format gets in the way of the information. To wrap up this Note we're doing to take 'deep dives' into two of these techniques: 

  • digitization and optical character recognition; and,
  • a data conversion technique called scraping and parsing

Deep dive on scraping and parsing: 'reverse engineering' a digital document to make the data in it more useful

Most of us who use computers are comfortable enough creating documents. However, we are less familiar with unpicking the documents others have made when the structure they have used gets in the way of our own effective use of the information and data contained within it. It is possible to “reverse engineer” documents to make their content easier to work with and analyse. In this section, we explain a technique called scraping and parsing.
The materials we'll use to demonstrate this technique Zimbabwe Peace Project (ZPP), a Zimbabwe-based organisation that documents political violence, Human Rights violations and politicised use of the emergency food distribution system. They have a nationwide network of field monitors who submit thousands of incident reports every month, covering both urban and rural Zimbabwe. Between 2004 and 2007, ZPP released comprehensive reports detailing the violence occurring in the country. The reports are dense PDFs and Microsoft Word documents that are digests of incidents, unique in their comprehensiveness. As documents, they are also pretty inaccessible and get in the way of trying to see what happened and how the situation changed over those years. Locked inside PDFs, it is hard to do anything else with the data, such as search , filter, count and plot it on maps. What can we do about this?
All documents are arranged in a particular, pre-defined way. Whether they are reports, or web pages they will have a structure that includes:
  • different types of data, such as text, numbers and dates.
  • text styles like headings, paragraphs and bullet points. 
  • a predictable layout such as a heading, a sub-heading, then two paragraphs, another heading, and so on.
Here's a single page from one of ZPP's report about political violence in Zimbabwe in 2007 (PDF). What can you see in there?

How does it appears in the report?

What is it really? What type of data is it?
1 Northern Region


Heading 1 Geographic area (Region)
2 Harare Metropolitan Heading 2 Geographic area (District)
3 Budiriro Heading 3 Geopraphic area (Constituency)
4 A date Heading 4 Date (of incident)
5 Paragraph Text Text describing an incident
4  A date Heading 4 Date (of incident)
5 Paragraph Text Text describing an incident

This structure repeats itself  across the full document. You can see a regular, predicable pattern in the layout if you zoom out of the report and look at 16 pages at once:

So there's lots of data there, but we can't get at it. The report is very informative, containing the details of hundreds of incidents of politically-motivated violence. However, it has some limitations. For example, without going through the report counting them yourself, it is impossible to find out what incidents happened on any specific day across Zimbabwe. This is because the information is not structured to make it easy for you to find this out. It is written in a narrative form, and is contained in a format that makes it hard to search.

To do something about this, the format of the information has to change to allow it to be searched better. Try to imagine this report as a spreadsheet:
Geographical area (Region) Geographic area (District) Geographic area (Constituency) Date of incident  Incident
Northern region Harare Metropolitan Budiriro
4 September 2007 
At Budiriro 2 flats, it is alleged that TS, 
an MDC youth, was assaulted by four 
Zimbabwe National Army soldiers for 
supporting the opposition party. 
Northern Region Harare Metropolitan Budiriro
9 September 2007 
In Budiriro 2, it is alleged that three 
youths, SR, EM and DN, were harassed 
and threatened by Zanu PF youths, 
accused of organising an MDC meeting. 
Northern Region
Harare Metropolitan Budiriro
11 September 2007 
Along Willowvale Road, it is alleged 
that AM, a youth, who was criticising 
the ruling party President RGM in a 
commuter omnibus to town, was 
harassed and ordered to drop at a police 
road block by two police officers who 
were in the same commuter omnibus. 
A spreadsheet created in something like Open Office Calc or Microsoft Excel enables this information to be sorted, filtered and counted, which helps us explore it more easily.
However making this from the original ZPP reports would require lots of cutting and pasting – time that we don't have. So what can help us? If you can read it, a computer might be able to read it as well. Thankfully, documents that are created by computers can usually be “read” by computers. 
With a little technical work, a report like the one in our example can be turned from an inaccessible PDF into a searchable and sortable spreadsheet. This is a form of machine data conversion. Knowing how this works can change how you see a big pile of digital documents. The computer programs that are used to convert data in this way are called scraper-parsers. They grab data from one place (scraping) and turn it into what we want it to be through filtering (parsing).
Scraper-parsers are automatic, super-fast copying and pasting programs that follow the rules we give them. The computer program doesn't “read” the report like we do, but it looks for the structure of the document, which as we saw above is quite easy to identify. We can then tell it what to do based on the elements, styles and layouts it encounters. Using the ZPP reports, our aim is to create a spreadsheet of the violent incidents, including when and where they happened. We would give the scraper the following rules:
  • Rule 1: If you see a heading that is a) at the top of a page, b) in bold capitals you shall assume this is a Geographical area (Region) and print what you find in Column 1 of the spreadsheet.
  • Rule 2: If after seeing Geographical area (Region), you shall assume that until you see another heading at the top of the page in bold capitals that is different from the previous one, you are looking about things that have happened in that geographical region.
  • Rule 3: Until then, whenever you see a paragraph of text that has one line on top of it, and one line beneath it, that is preceded by a date in the form “Day Month Year”, these are incidents that happened in this geographical region, so you will copy them to the column called “Incident”.
Once the rules are set, the scraper-parser can be run. Very quickly it will have gone through this 100 page document pulling out the data you have told it to. The scraper might not get it right first time, and there will be errors. The key point is that you can improve a scraper-parser, run it 100s of times and check by hand what it has put in your spreadsheet, and it will still be faster than trying to re-type out the content yourself.
Scraper-parsers have to be written especially for each document because the rules will be different, though the task is the same. However, in most cases it is not a major challenge for a programmer, the challenge is for you to understand it is possible, and explain clearly what you want! 

Dull repetitive stuff is really what computers live for, so it makes them happy 

You might think that it is not worth writing a scraper-parser for a one problem task. However, what if you have hundreds of documents, all with the same format, all containing information you want? In the Zimbabwe example, there are 38 reports produced over nearly 10 years.  Each is dense, and in total contain data on over 25,000 incidents of political violence. The format gets in the way of  being able to  use this data.
A scraper-parser can:
  • Go through all 38 documents you tell it to, whether on your computer, or on the Internet (scraper-parsers can browse the internet as well).
  • Pull out the data that you tell it to, based on the rules that you make for it.
  • Copy all that data in to a single spreadsheet.
  • Further, a scraper-parser could:
  • Check each day on the website where ZPP publishes its reports and if there is a new one, download it, then email you to let you know, before adding it to the list of reports it “reads” into your spreadsheet.
  • Include new columns for the date the report was published, and the page number where the incident was recorded in the report (so you can check the data has come across properly).
  • Change the format of every date for you e.g. from 27 September 2004 to 17/09/2004.
  • Automatically turn the spreadsheet into an online spreadsheet (like Google Spreadsheets, which we profile here) that can be shared freely online, and update it when data from a new report becomes available.
In this example we have looked at data produced by a single organisation in Zimbabwe, but the ideas and techniques are applicable to anywhere that a digital publication format gets in the way of using the data inside it. The ideas apply equally to getting data from of a website.
Scraping and parsing is pretty technical, but here are some further resources that can help deepen your understanding of this technique and give it a try yourself: 
  • Pro Publica's guide to how they used scrapers to collect data to show the connections between pharmaceutical companies and doctors in the US.
  • Dan Nguyen's Bastard's Book of Ruby is a free, online and solid (but rather oddly-named) introduction to computer programming, aimed at journalists.
  • Here's a video introducing Scraperwiki, an online platform for getting data off the internet. You can use it to learn how to code your own scrapers, look at scrapers other people have created, and find programmers to help you out. 

Deep dive on digitization: from piles of paper to drives of data

Paperwork is a fact of life whatever you are doing. Whilst this is changing, not all information that might be be useful us is 'born digital' or exists in a digital format. For a range of reasons, paper can still remain a better solution for whomever was trying to capture or transfer information  A confrontation with a mountain of paper that you know or hope contains information relevant to the issue you are working on, can be intimidating and discouraging. This section proposes some rules of thumb and a process to help you overcome such challenges.

Reprieve: tracking the abduction and torture of terrorist suspects 

Item 272 from the Richmor vs Sportsflight case. Investigators from Reprieve established that this invoice was for services provided to the CIA enabling the extraordinary rendition of Hassan Mustafa Osama Nasre (also known as Abu Omar) from Italy to Egypt on 18 February 2002.

In response to the terrorist attacks in New York on 11 September 2001 the US Government intensified its interrogation of foreign nationals it suspected of involvement in terrorism. To do this, the Central Intelligence Agency (CIA) set up a program of “extraordinary rendition”, through which its operatives apprehended people in one country and took them for interrogation in countries where torture was routinely used, such as Egypt. The program clearly violates a range of international Human Rights and humanitarian laws. 

A decade later, Human Rights lawyers continue to seek redress for those people abducted, detained without due process and tortured. Over the years, they have identified the planes which transported prisoners, the dates and routes of flights, as well as the companies running them. They have pulled in data from many sources including national and international aviation bodies.  

In a court in New York State In 2007, a legal battle about money broke out between Richmor Aviation, a company whose planes had contracted for use in the rendition program, and Sportsflight Air Ltd, a small firm which had been involved in brokering some of Richmor's services for the government.  The Human Rights organisation Reprieve learned of it in 2011, almost accidentally. Crofton Black, an investigator at Reprieve, says that the court transcript and discovery documentation from this case became a treasure trove of information about the extraordinary rendition program: “We were very struck by the level of detail in the documents. There was a stratum of information that hadn't really been publicly available before. What it shows you is a microcosm of the way that the program was running between 2002 and 2005. There were phone bills, lists of numbers that were called by the renditions crews during their missions.  There were catering receipts, records of ground handling services, references to many different service providers in different countries who provided the logistical framework for the missions.”

The 1,700 pages of hard copy court documentation were couriered to them. To start making sense of the material, volunteers at Reprieve first scanned them in and made a PDF out of them. They then quickly skim-read them to identify the types of documents they had, bookmarking the key blocks of information. To help pull out the topics discussed in the material, a technologist colleague used optical character recognition (OCR) on the material and created a searchable index of the all the words used.

However, the most useful information  contained in the invoices for services couldn't be picked up reliably using OCR, and had to be extracted by hand.  Over a few weeks, Reprieve's team manually pulled this data out of the invoices into a spreadsheet.  They made a first pass over the material, creating a simple data structure, which they then expanded to include more detailed information about different flights. By picking apart this paper trail, Reprieve's investigators pieced together dozens of trips, using the invoices to evidence where the plane stopped and which companies has provided services to suspected rendition flights.

This data has served to fill holes in numerous different cases, and analysis of it has been made available to journalists and legal teams worldwide. “The million dollar question in all this stuff is which of these flights had a prisoner on it, and who was it? So, that's one thing these documents won't tell you, of course. But the spreadsheet is a fantastic analytical tool. If we hear about a prisoner who was transferred on a particular date, but they don't know where, we can look at that date and see if it matches anything in this,” says Black.

He has only one regret. “Optical character recognition is still quite poor. If these documents OCR'd properly then it would have been different ball game from day one”, he explains. Looking at a sample of what OCR produced from the scanned documents, you can understand what he means:

u 1'I::CC:.1 ... eu. (>04t Ollicc Box 179

-OIdChlJthBITI. NewYoric. 'Z130 

re:/epllane:(618) 794-9600 Nlghr: 

(518) 794-7977 FAX:(61B/794-7437 

It's often difficult to gauge the amount of time and effort it will take to bring an information dump like this into a form where its value can be seen, let alone exploited. At some point, working with the information in an ad-hoc way, by hand or using basic but well understood technologies may become impractical. It may create a diminishing return over time, for example if useful information wasn't pulled out of the source material first time. On the other hand, the alternative approaches that experiment with emerging technologies (like OCR) or use a more systematic approach can seem difficult to justify: they may add costs, or seem like overkill for just a box of paper. 

Whatever approach is taken, investigations of complex and concealed systems of Human Rights abuse are about adding layer after layer of information from different sources. This example shows the importance to investigators of being able to quickly respond to the availability of new information resources, breaking down whatever form they come in and linking it authoritatively to what is already known. Digitization is a key skill in this.

Deep-dive on digitizing printed materials

Digitization is the process of moving this information from analogue to digital formats, which can be analysed using computers. It is not a single thing but a set of steps, which we will look at it turn.

Before you begin, get prepared

  • Be clear about what you want to achieve and why: The decision to digitize reflects a balance of motives. One of the key reasons activists and investigative journalists digitize hard copy material is security. A digital archive can be duplicated and kept safe from deliberate destruction by their adversaries, or degradation due to the rigours (temperature, humidity, light) of the environment they are working in. Other drivers include concerns about the the sheer scale of the materials, both as a physical storage problem  and a challenge to getting at useful information quickly. 
  • Know what you're dealing with: do some work to ascertain the scale, shape and scope of the material you have. Do you have a room of paper documenting years of work, or a folder or two? Is it a one-off initiative, or something you'll have to do every day? Create a count of the current number of physical paper sheets or images, the number of individual documents, creating a breakdown of the different sorts of documents you have. If you think that additional material will appear, try to anticipate how regularly and in what sorts of quantity before you start. 

You should also thumb through the material and identify the different sorts of information in those documents that you think is likely to be important, and look for documents that might be missing. This will help you decide which information is a priority to pull out of the material and will guide the design of your data capture processes. This scoping work is critical to designing and organising the digitization process: estimating how long it will take, how much investment in technologies and labour may be required, how much it will cost, and ultimately whether it's worth doing yourself, or at all. Digitization may be better contracted out to a specialised company, though you will have to assess whether this is both secure and affordable.

  • Test the water before leaping in: after deciding the route you want to take, design a draft process and test it out on a small sample of the material you have. Such “dry runs” expose and test your assumptions and ideas, and can help you identify problems that may be hard to correct later.

Step 1:  Digital imaging of hard copy materials

This is the technical process of moving hard copy material into a digital format. There are a range of different aspects to this:

  • Organisation: even if you only have a small amount of materials, you should create a scan plan to quantify the amount of work. This lists out the hard copy that you have, and is used to decide which materials to scan and when, and should tell you what's been done and remains to be done. Scan plans can help manage your time, and increase your confidence that you haven't forgotten anything.
  • Hardware: you will need a computer, and will have to obtain a scanner. Scanners designed for home use often don't cost much but not designed for even moderately heavy, professional use. Ideally, the scanner you choose should have an automatic document feeder enabling it to scan loose sheets one after the other, and a duplex function so it can automatically copy both sides of a sheet of paper. Try to establish a scanner's duty cycle, which indicates how quickly it can scan and how long it can be used continually before disaster strikes. Where possible, try to test out scanners or cameras before you buy them: they may appear a good match, but may be tedious to use, or have a terrible build or usability flaw that only reveals itself during heavy use. These may include unreliable software, overheating, and badly designed feeders that jam or don't pick up sheets. 

If you have a lot of books you need to scan, then it will be painful using a flatbed scanner and may be worth making friends with someone with a book scanner. Smartphone cameras are very high quality and a number of apps have been developed to scan documents, though they don't (yet) seem geared to high-scale needs. 

  • Software: you will need driver and scanning software. Most off-the-shelf scanners are either “plug and play” or come with driver software to install on your computer to control the scanner. You will also need software to manage the scanning and processing of the resulting digital images. There are commercial options such as Adobe Acrobat, and open source alternatives like XSANE and ScanTailor. These enable you to define the scan quality, which includes amongst other things DPI (dots per inch), resolution, colour and file format. ScanTips has excellent guidance about all these issues.
  • Digital storage: After materials are made digital they will need to be stored safely. You will have to consider how to keep the files safe from corruption or unauthorised modification on the digital storage media you are using. This means having a back up plan and making sure that they are accessible to the only people who need to use them. Some media may need a large amount of storage space, so it it important to plan ahead to ensure that you don't run out of space, and that you have ample space for backups as well. Chapter 2 of our Security in a Box has a guide to storing information safely.
  • Quality assurance and 'chain of custody': after a document has been scanned, there are three things you need to do. First, check that it accurately matches the original hard-copy. Second, process the scan to improve its quality and organise it in a way that fits your needs. For example, where you have scanned a double page spread in a single, software like ScanTailor can split the image into two pages. Third, decide what to do with the original hard copy. Do you need to be able to show others that your digital versions are perfect copies of the originals? For example, in a legal process digital copies of documents may not be acceptable as evidence. In these cases, you will design to think about a digital 'chain of custody' that can be used to show how the physical and digital materials have been handled.

Step 2: Organising, indexing and contextualising digital files

After you scan in the hard copy, you will then have to organise and catalogue it digitally, making the material easier to find, sort and relate to other materials. 

  • Organising raw files : Scanned files will appear on your computer in image formats like .TIFF perhaps .PDF. Most scanning software automatically gives each scan a filename, such as DSCR23453.TIFF – you should change these to a file-naming scheme that makes sense to you. Most digital files also contain techical metadata describing the size, creation date, date of last modification and so on. Some files also have special sorts of metadata.  For example, pictures taken in smart phones or digital cameras with GPS devices may contain metadata about the location where the picture was created (software like ExifTool or MediaInfo can help you find this data). This automatically-created metadata is useful for organising digital files. File browsers such as Windows Explorer, or Nautilus on Linux should be adequate for organising, filtering and search large collections of digital files using this sort of technical metadata.
  • Cataloguing files: beyond managing the raw files themselves, you can also create new sorts of metadata that describe what is in a document or file. You have to define what sorts of metadata you think it is important to add. Whilst quite a heavy read, the Dublin Core site has a thorough description of what metadata is, and has some useful ideas you can adapt to your own needs. Metadata could include terms that  indicate who created the material, what it's about, the events that it relates to, people that are mentioned in it and so on. This sort of data is particularly important for visual material like videos and images,  which often contain a small snapshot that can't really be understood without knowing  the surrounding context. WITNESS has written a thorough guide to managing and cataloguing video and audio materials about Human Rights. 

Step 3: Extracting content from the materials

Whilst digitisation has a wide value, for activists its purpose is often to better understand the information within the material itself. 

  • Automated content extraction: A document that has been scanned in remains an image, which means the text in it cannot easily be 'read' in the same way that a document created in a word processor can. It is possible to use Optical Character Recognition (OCR) (such as Tesseract or some of tools built into commercial imaging software) to find and extract text from images. However, prepare yourself for disappointment: even where OCR software is used on scans of typed material that is plainly laid out, they are fiddly to use, erratic in their output and always require a human eye to ensure their accuracy.
  • 'Old school' content extraction: Realistically, extracting content from the materials is likely to be a manual process, which means finding, reading and hand-typing actual information contained in a digital document a entering it into something like a database of spreadsheet. We touched on tools and processes for capturing and entering data in How-to Note 2

Quick notes on digitising other media

In this section, we have focussed on the challenges of paper. However, video and audio tapes, photographs and maps all still regularly appear as resources in most kinds of investigative work. Here are some tips and links should you need to digitize these sorts of media:

  • Video and audio: Physical media – like tapes or DVDs- create three particular challenges: time-based media is generally more difficult to manage, digital versions require a lot more storage space, and preserving original physical versions over the long term is complex. When digitising, try to capture at the highest quality possible for you to do. When digitizing older or damaged media, it may be better to work with specialised third party you can trust than try to do this yourself. For a brief overview of digitizing video, the TAPE project has some useful guidance and resources. For a very detailed practical guide, have a read of this guide to digitizing moving images by the Consortium of Academic and Libraries in Illinois (CARLI) Guidelines to the Creation of Digital Collections. 
  • Maps: Moving printed maps into a digital form requires first scanning in the map. Depending on the size of the map, you may either have to scan parts of it using a flatbed scanner in and create a range of smaller tiles. The alternative is to locate a wide format scanner. After creating a digital image of your map, it will need to be geo-referenced and rectified. This means finding where your maps sits on an existing, accurate digital map such as the Open Street Map (read our profile, here). Finally, the scan can  be uploaded to an online mapping service so it can be viewed online. MapWarper (tutorial video) is an online geo-referencing system that does this.  If you don't want to upload your materials to a server, geo-referencing of maps can also be done using desktop GIS software such as QGIS (here's a basic guide to georeferencing, and a tutorial video)