Data & Design How-to's Note 2: Data basics

Introduction

... if you're interested in an industrial plant and you think that there are environmental crimes being committed there, you're going to have a very hard time turning up at the front door and knocking on the plant door and seeing if they'll let you come in and tell you whatever crimes they are doing. But what you can do is assume that if they're dealing with toxic chemicals and there's a good chance they have a bad safety record, so what you can do is go to the local fire department and ask if there are any documented incidences of a hazmat response. In other words, have there been any instances where you've been called to anything to do with hazardous waste. So you start to build up evidence around the thing that you are looking at when you can't look at it directly.
Trevor Paglen, co-author of Torture Taxi : On the Trail of the CIA's Rendition Flights.
 
Activists and journalists regularly discover and expose a world of wrongdoing. But it may have been concealed beforehand by commercial or political considerations, hidden because nobody asked the right questions at the right time, or hard to see because little bits of information about it are in different places. Activists and journalists need to find data and join the dots together themselves: only rarely will the complete picture land on their desks. They need to seize opportunities and often work with very limited resources. To store and make sense of the data they collect, they often turn to cheap, ubiquitous software such as the spreadsheets that come pre-installed on most computers.
 
For example, a group we have worked with that documents prisons uses a spreadsheet to track over 40 pieces of data about each prison, including address, capacity and security status. They make numerous updates each day. The source of each piece of data is also recorded, which doubles the number of columns. Over time this increases the amount of time and frustration accrued in managing, analysing and sharing the growing body of information. What started as a simple list, appropriate for a spreadsheet, becomes a major data collection and analysis initiative for which a single spreadsheet soon proves inadequate. 
 
Detecting, correcting and avoiding these sorts of problems are the topics of this second Data and Design How-to. From experience, we'll suggest some practical starting points for collecting and working with data and look at basic, robust technologies that can help you out.

Data basics 1: thinking it through before you start

The different ways activists and journalists use data for advocacy can be divided in to three broad areas. First, data about an issue itself can be used to expose, comprehend and explain an issue. Second, data can be used operationally to organise and run campaigns or to strengthen an organisation (for example, monitoring and evaluation, fund-raising or mobilising/coordinating). Third, data can form the backbone of services provided by a group.
 
We'll begin by running through five considerations that we have found helpful in choosing the right direction to take.

1. How the data specifically fits into your aims 

This relates to what you are trying to achieve, who your audiences are and the sorts of data  products that are likely to be useful to them. 
 
What sort of data would they need, find credible and can you collect it?  How are you going to get the data to them and what do you hope they will do with it?  How are you going to know that the data is useful in advancing your goals?  Thinking through in advance how you will use the data will significantly affect how you collect it and in how much detail.  Being clear about this from the outset will save you from having to go back and fill in holes.

2. Existing sets of data that could be re-used  

There is likely to be an 'ecosystem' of both advocacy and official organisations and groups collecting and publishing information about your issue. 
 
Could you improve how existing information is used such as adding value to it by producing a new analysis or getting it out to new audiences in interesting new formats or services?  Could you achieve your aims by establishing a partnership with such a group? On the other hand, there might be strong reasons to go ahead independently even if others are working on the issue. They include creating critical and alternative information resources, fostering the resilience of the advocacy sector as a whole by doubling up, and building your own skills and capabilities. 

3. Research methods and supporting technologies 

What sort of method are you going to use to gather the data for you project? 
 
Documentary and investigatory methods common in Human Rights work for example, rely on reporting about or interviewing victims and survivors of incidents that might constitute Human Rights violations. However, to use this sort of material as the source of a meaningful analysis will require an understanding of statistical sampling, content analysis and a range of technologies.

4. Risks that you and others involved in your project will face 

As much advocacy focuses on sensitive, taboo or politically charged topics, it is essential to have a sensitivity to the kinds of risks that collecting data may involve. 
 
What measures are you going to take to protect the identities of interviewees and the substance of the material they give you? If you store information on a computer, what measures will you take to ensure it is available to the right people, and doesn't fall into the wrong hands?

5. The scope and sustainable growth of your initiative

Try to sketch out the scope, scale, geographic coverage, and comprehensiveness of the data collection initiative that you are planning. 
 
Will the scale at which you are working enable you to cover the issue usefully?  Are you allowing enough time and resources to test things out?  Are you running a 'one off' project or a longer term initiative?  If you are planning for growth, what sorts of systems and people management challenges do you think you will encounter as your initiative scales up?

Data basics 2: structuring, categorising and standardising your data

 
 

Collecting data to monitor specific activities or to document events can seem exciting at first, especially if you have a vision for how this information may contribute to a debate. In order for it to be useful, it is essential that the data is well organised and designed so that it can be pulled together, analysed and presented in a meaningful way. You will need to know about things which may at first seem challenging: how to standardise information, how to enter information so that it can be collated later and how to work with data in a group. We discuss each of these in the sections below.

Data has to be entered consistently. If it is not, then it is harder to search, count, sort and filter accurately. Here's a simple example:

Name Sex
Jacques m
Ali Male
Maya

f

 

 

 

 

 

The problem here is easy to spot,  but it is one that is recreated daily in one form or another when data is collected. A way to reduce these errors is to standardise how data is entered. This means making choices about how it can be represented consistently. Think of all the things that you can describe about a thing; it is a process we do quite naturally and there are many different, equally plausible and accurate ways to do it. 

For example:

  • Dates and times: Last Thursday, Thursday November 1, 01 November 1976, November 1st 1976, 19761101 are all ways of representing the same date.
  • Names: The naming of people and things is very complicated and varies across different geographical areas and cultures. Do you use 'United Nations' or 'UN'. Do you use a person's first and second names in different columns or in the same column? Is there a commonly accepted naming protocol for surnames? 
  • Places: Where something is located or where an event happened are commonly recorded data. But how specific do you need to be when describing geographical data? You can be precise by using latitude and longitude; or general, such as using a country's administrative (town, city, district), electoral geography (ward, constituency) or operational geographies such as areas covered by a police station.

The key challenge of standardising data is to make a choice and then stick to it. It will save an enormous amount of time and frustration.

The next challenge is to be clear about how, in practice, to apply standards to the data you are collecting. For example:

Our source of information tells us that 600 people attended a demonstration and we want to create an entry in a spreadsheet. We have categories for 'small', 'medium' and 'large'. How do we decide which term best describes the size of the demonstration?

When you look in your spreadsheet you need to be able to know that every time you see a demonstration described as 'small' it means the same thing. Design a set of rules to let everyone working on the data know that:

  • Small = between 0 and 99 participants
  • Medium = between 100 and 499 participants
  • Large = between 500 and 999 participants.

Everyone entering the data needs to follow the same rules each time.

The challenge is tougher with certain types of data that involve evaluating something and making a judgement about it. With more complicated issues that don't break down to a set of numbers you need to find 'baskets' in which to fit a variety of different sorts of factual information. You need to be sure that simplifying them will still be useful to you later. 

For example: Field monitors for a human rights organisation have interviewed a victim of serious physical mistreatment by the police. We have categories for 'Torture', 'Inhuman and Degrading Treatment', and 'Grievous Bodily Harm'. What term best describes it? In this case, these terms have legal meanings in international and domestic laws. To increase the consistency with which the terms are applied you could develop a guide sheet explaining each term, explaining the sorts of situation, information and evidence that are needed to make a choice and then give examples.

If you want to compare your data to others, consider whether they have used the same sorts of data, and whether they have applied the same rules to their data. This will be covered later in Working together and sticking to standards and structures.  This is a serious concern: projects can fail because different country groups collect data differently, making regional or global comparisons  a waste of time, resulting in the need to start the initiative from scratch.

Standardising your data against an external resource is also useful. For example, if you are using geographical information such as place names, they can be recognised by Google Maps (using an automated technique called geo-coding) which simplifies the process of creating a map in that service. Deciding too late that you want to make a Google Map, and then having to go back through all your data and re-enter place names that Google Maps does recognise, can be a tedious exercise.

Another form of standardisation has to do with the structure of how you record your data. For example, a good rule of thumb when entering data is to put one piece of data in one field or cell. Then, your spreadsheet can sort and filter it easily for you. Here are some short examples:

Scenario 1: A human rights organisation documents where people were harassed by police in Phnom Penh, Cambodia

Problematic:

Incidents Recorded in Phnom Penh

Date and Time

Place

01/11/1976 at four thirty in the morning

At the lake in Phnom Penh

 

 

 

 

 

 

Better:

Date

Time

Town

Specific Location

01/11/76

0430

Phnom Penh

Lake Boeung Kak

 

Scenario 2: A research organisation documents the gender of detainees at prisons

Problematic:

Facility Name

Demographics

Alcatraz Island

Adult males (DAM), adult females (DAF)

Better:

Facility Name

Demographic 1

Demographic 2

Alcatraz Island

Adult males (DAM)

Adult females (DAF)

The problem in Scenario 2 is not solved by adding another column of data. What would be better would be to create a new unique category called “DAMF” to be used when a facility imprisons both adult males and adult females.

The structure of your data also affects your ability to count different things about your data. An issue often experienced by users of spreadsheets is that they structure their data around the wrong thing. For example:

Customer

Order

Ahmed

Humous, tabbouleh

Dima

Falafel, tabbouleh

Here there is more than one sort of data in each cell. This may allow you to count how many customers you had, but it makes it harder to tell how many portions of tabbouleh you sold, or orders were placed? There are only two entries here, what if there were thousands each week?

A better way to organise this would be:

Customer

Order

Ahmed

Humous

Ahmed

Tabbouleh

Dima

Falafel

Dima

Tabbouleh

This makes things easier as it enables you to use the spreadsheet to count and rearrange the data. Initially it means more work and is less readable, but it will allow you to do proper analysis later. Also, by doing this, you are still unable to answer some important questions, such as the overall number of orders that were made.

In geeky language, these sorts of issues are all about a logic concept called normalisation. They are very common and reflect the difficulty of trying to squash quite complicated information in a single table of information and keep it useable. Where there are big problems of 'normalisation' it might be time to move the data to a different sort of tool, like a database. We discuss this further in the section called Data collection essentials 4: growing out of spreadsheets.

Thinking through the structure and the standards of the information before you start will be of great benefit later on. By standardising the way you enter data, you have a better chance of spotting where connections are made and where relationships and patterns exist. By structuring your data in this way you ensure you are not missing opportunities for useful analysis.

Further resources

 

Data basics 3: working together and sticking to standards and structures

Working in a team to manage data can increase a group's ability to take on a project which may otherwise be too unwieldy or too time consuming. It can also increase the value of the data by putting it in the hands of more people. However, working in a group can add to the complexity of the work and can increase data errors. It also has an effect on privacy and  confidentiality of information, requiring you to consider who has access and how to safely transfer files. Here are some tips about where errors can occur and some ideas for detecting and mitigating them.

A. Tracking data entry errors in teams

Everyone makes them (even NASA), and there are hundreds of ways that errors can be made in spreadsheets. Data management can be mundane and repetitive. The more people that enter or use data on a spreadsheet. the more chance for error. You can create simple processes that can identify simple errors in data entry. Here are some examples:

  • If one of your fields has dates in it, sort it to show the earliest dates to check if there are dates listed in the distant past (for example the year 201 instead of the year 2011)
  • Where you are using a set of standard terms in a cell, like country names, people working on the data may not enter them consistently. For example, a user might make a typing error, entering 'cambodia' rather than 'Cambodia'. Most spreadsheets show these by listing the unique values that are contained in any column: it will treat the two differently so you can see that an error has been made.
  • If every row of data should have a piece of information inside,  an empty cell may be an indication that someone has forgotten to enter data.  You can ask the spreadsheet to count any empty cells in a row, and highlight the row in 'red' if so.

B. Maintaining consistency of data entry

As mentioned above, some data is interpretative: it represents a judgement made by the person entering the data. For example, two people may interpret guidance as to whether someone is 'Happy' or 'Very Happy' about something or it could be something more serious, for example, whether a human rights violation involves 'Moderate violence' or 'Severe violence'. Do you have clear guidance to help people make these choices, and do you have processes to check that all the people entering data are applying it in the same way? You can address this by having:

  • a 'double entry' system, where the same data is entered twice by two people, and where differences arise, the data is flagged as problematic. 
  • regular 'levelling' and meeting of people working on the data, to discuss different data and how they should enter it.
  • a single person entering a particular field of data, where it requires some specialised knowledge.

C. Keeping track of who changed what 

Spreadsheets tend to get passed around, or worked on at different times by different people in different places. 'Version control' is a helpful concept to think about how to manage collaborative working. This issue may be made easier to manage on a low budget through a change in the tools that you use. For example, keeping track of who has access to data and what data has changed, may be far easier on an online platform like Google Spreadsheets (which we have profiled here).

Where security and privacy are a concern, some database tools such as Martus (which is our Waiting Room) are designed specifically with security in mind, and they enable users to exercise a high degree of control over who has access to data contained within a Martus network. More generally, keeping information secure whatever the tools, requires a physical and digital security plan, which we touch on in our Be Safe, Be Smart section.

The Land Matrix: using data to investigate the new rush for land

Kerstin Nolte, a researcher at the German Institute of Global and Area Studies (GIGA), and Mathieu Boche from Centre de coopération Internationale en Recherche Agronomique pour le Dévelopement (CIRAD), work on the Land Matrix. This is a partnership of different research, advocacy and development groups from around the world. They pool and improve the data collected about large land deals which change land use from smallholding to commercial use, with potentially devastating consequences for poor rural populations. These so-called “land grabs” hit the headlines in 2008 as global food prices soared. NGOs, researchers and governments wondered what was driving it, whether it was anything new and what sorts of policy responses should be made to it. Getting behind quite incendiary headlines was not easy, as Mathieu says, “there is no transparency, no real information. We know that there are a lot of investors, but nobody really knows who they are, and where they come from.” More evidence was needed.  

“So many reports and rumours are going on about 'land grabs' or land acquisition and nobody knows what's going on. There's just  no structured information,” says Kerstin. Mathieu adds: “There are some case studies popping up about instances all over the world but it was really difficult to have a global overview of the phenomenon.” Responding to this, a range of different research and advocacy groups had started spreadsheets documenting these transactions, but each in different ways, making the data difficult to compare.

“The International Land Coalition (ILC) had this amazing database - but it was a bit unstructured. They had different Excel sheets for every single country and what we at GIGA had was a database that had a better structure but not as many cases. So we merged these two databases,” says Kerstin.  This became a common standard for documenting land transactions. During this merger, they created fields and codes for capturing over 100 types of data about the dates and size of the transaction, the amount of land, the investors and organisations involved, the eventual use of the land and the consequences of the land transaction. 

In their spreadsheet, Land Matrix have now documented over 2,300 large scale acquisitions. An initial analysis of this data, published in 2009, was able to show that in the last decade over 203 million hectares of land (over 8 times the size of the United Kingdom) may have changed hands and usage type from smallholder farming land to large-scale commercial use. With this data, Land Matrix and its partners have been able to broaden the argument about who and what was driving the issue, tying it to prevailing economic trends that dispossessed the rural pool rather than simply the specific events of 2008's global food price crisis. However, getting to the point of having usable evidence was far from easy. 

There was a lot of time-consuming data entry to bring together the two datasets because of differing ways that data had been entered. Mathieu notes that it was hard coming to a common understanding of what to include in the database. For example: “Do we only consider the project of land acquisition by foreign investors in one particular country, or do we include in the database a guy from a local elite trying to acquire 300 hectares to grow some crops. Is it land grab, or is it just concentration of land for some powerful elite?” Land Matrix also needs to keep track of changes that are relevant to the accuracy of the data, such as whether a transaction actually resulted in a change of land use. To do this, they work with a network of organisations around the world to cross-check data on each land transaction based on a common standard.

Most of the data management work is done using spreadsheets, and the analysis is done using STATA, a statistical analysis tool. Technology has been important but secondary to concerns about data quality, standards and accommodating the different interests partners have in the land matrix data. Later this year, however, the data will be moved to a secure online database to enable a greater degree of collaboration between different partners. Parts of the data have also been made public on the coalition's website.

Data basics 4: growing out of spreadsheets

It makes perfect sense that a  large number of activists use spreadsheets to organise data. But far fewer consider using databases when the problems they face using spreadsheets become more apparent.

Five signs you might have grown out of your spreadsheet:

  • You start colour coding things in the spreadsheet and have created little 'hacks' (like adding 'AAA' or  '!!!!' to a row of data to ensure it appears at the top) to find data.
  • You scroll around a lot to find and edit information or perhaps you have bought a bigger computer monitor so you can see more data on screen. 
  • Different people need to enter data into the spreadsheet so you spend time emailing it around and copy-pasting data into a 'master' spreadsheet.
  • You regularly have to reformat to fit the needs of different tools to make charts, maps or graphs. 
  • You create multiple spreadsheets to keep count of data in other spreadsheets.
If you are doing any of the above, it is time to start thinking of a different type of tool.

Spreadsheets are a great 'Do It Yourself' data tool, widely used to record, analyse and create simple visualisations of data. They were designed as digital ledgers for book-keeping and accounting, but the grid format and ability to re-arrange data simply has made it irresistible for countless other uses. Nearly everyone who can use a computer can 'sketch' with the simple and intuitive interface, piecing together columns and rows to create a basic model of some issue or thing they want to record data about. Spreadsheets don't require much technical knowledge to get started and come ready installed on most computers, so you can get up and running quickly. 

A large appeal – and perhaps a downfall - of using a spreadsheet is that it can be made to look like a written document. A spreadsheet can be given a beginning, an end, a title, some authoring information and a date of publication. It can be constructed like a narrative, containing a mix of  numbers and text, having elements of place, time, protagonists, locations, costs, consequences and outcomes. Structuring data in this way – by intuitive, narrative and visual logic – can work well for simple uses, but if your initiative grows or you want to use data in different ways, problems will soon emerge. 

In this sense a spreadsheet is a compromise tool: the method of storing information is the same as the means of looking at and working with the information.  At some point, these two needs can't be reconciled, and one gets in the way of the other. The need to make data legible to the eye in a spreadsheet means making it far less useful analytically; the reverse makes the data largely unreadable and hence, less useful. 

A database, however, separates the two: the way data is stored has far less influence over how it can be displayed. In fact, the way data is stored is often completely hidden from the user, enabling abstract, complex ways of storing data that gives the user more power over it. A key benefit of a database is the ability to feature multiple tables of data and the technology to stitch them together to find out answers to specific questions. 

This is how data might look in a database rather than a spreadsheet:

Database Table 1: Customers

Customer Name Customer ID
Ahmed 1
Dima 2

Database Table 2: Dishes

Dish Dish Code
Humous A
Tabbouleh B
Falafel C

Database Table 3: Waiter

We also wanted to know who took the order:

Staff Customer
Benito A
Charlie B
Eldrich C

Database Table 4: Booth

And in what booth:

Booth Number
10
11
12

Behind the scenes, we can tell the database how these sorts of information are related. We can then ask it to create another table that combines data from 'Customers', 'Dishes', 'Waiter' and 'Booth' , in addition to other information we need to know about an order:

Table 5: Orders

Customer ID Dish Code Waiter Booth Time Order Number
1 A B 11 16:00:00 431
1 B B 11 16:00:00 431
2 C E 10 16:06:00 432
2 B E 10 16:06:00 432
 
Freed from the need to be legible to the eye, the complexity has increased dramatically and clearly it is very hard to track all the different sorts of data that are in the table.  In this example, there are still only two customers, sitting at two different tables, ordering 3 different dishes. Imagine trying to manage this data when there were hundreds of customers every day, ordering from a large menu. Databases have better powers of storage and retrieval of data. Using something called Structured Query Language (SQL) we can ask the database questions. For example, we can ask it to tell us:
  • How many orders Benito, or any waiter, took.
  • How many orders were placed in a day, or week or month.
  • How many portions of each dish were sold, and during which parts of the day.
  • The average size of each dining party.
The database enables a flexibility in how we can use the data that is harder to achieve with the spreadsheet. This flexibility creates opportunities to you and your audiences to use, present, published and access data in ways that can serve your campaigning aims. 
 
Moving work from a spreadsheet to a database is a leap at many levels: 
  • Conceptually, bringing databases into your work requires some acknowledgement of the challenges of working with data, some of which we have outlined in this Note. 
  • The world of databases is also full of different technical choices database platforms, programming languages, interface types and so on that are very intimidating for newcomers.
  • It is unlikely you will be able to do it yourself. You will need to work with technical people like information architects, programmers and interaction designers. Knowing who to trust, and what to expect from these people is hard, and getting it wrong can be costly.
  • As we reflected in Note 1, using any technology is a managerial challenge too. You need to ensure a database solves the right problem in your work and that you have the resources to use it sustainably. 
But perhaps most importantly, moving from a spreadsheet to a database represents a huge attitudinal shift away from making do with the things in front of you, and challenging established ways of working. For activists and journalists who make data central to their work, the question is: how much do the limitations of the tool waste your time, under use the data or hold you back? 
 
If you are interested in this area and want to move your ideas forward, the resources below contain more discussion about databases, how they are created and what it takes for an organisation to commission them: