Data for the first batch of completed diaries now available!

by ral104 moderator, scientist

Operation War Diary has now registered 218 completed diaries - that's thousands of tags registered by our Citizen Historians.

Zooniverse have developed a simple interface, called War Diaries Data Digger (WD3), to allow us to share the data with you, so you can see just what it is you're generating. You can find WD3 here: http://wd3.herokuapp.com/public

When you follow the link, you will be directed to a full list of the completed diaries. Below each diary you'll find two options, the first of which will allow you to download all tagged data as a .csv file (a comma-separated values file, which is a format you can easily load into a spreadsheet programme like Excel) and the second of which will take you to the National Archives site, where you can download their copy of the scanned diary (please remember that there may be a charge for this, although many institutions such as universities and public libraries have a subscription which will allow you to download items for free on their premises).

The files you can download do not include every individual tag generated by our Citizen Historians. Rather, they are consensus data - each page was tagged by at least five volunteers, which allows us to get a greater level of accuracy for each tag. Using consensus data allows us to smooth out natural inconsistencies which creep into the tagging process as a result of error, bad scans, awful handwriting, etc. If you're interested in the algorithm behind the consensus data, I'll be putting together a blog post about it in the near future.

There are 14 data columns in each file. I'll list them below, along with a brief description of what they contain:

Order - the order in which the tag appears on the page. Where numbers are missing, this is likely to be because individual tags were rejected by the consensus data algorithm (but don't worry, we still store all individual tags as well)

Page - this is just the system page reference and won't be of much interest in analysis

Page type - the type of document being tagged (diary page, signal pad, etc.)

Page number - the actual page number in the diary (we haven't included things like cover pages)

Count - the number of volunteers who have tagged this data item. The higher the count, the greater the level of certainty in the information.

DateTime - The long format date and time identified by volunteers for each data item.

Date - a more user-friendly date format

Place - Place name, where identified. This field may contain multiple values, where it has not been possible to establish consensus across volunteer tags.

Lat/Lon - The latitude and longitude generated by the map when tagging place. Where it was not possible to identify the place on the map, this field will just contain zeroes.

Time - Time, where identified.

Type - This is the type of tag, e.g. A date, a place, a person, etc.

Label - The detail contained within each tag. For example, where a person was tagged, this field will contain their name and rank. This is the headline data for a tag.

Data - this shows the full set of data associated with each tag. So where a person has been identified, you'll see all facts associated with them here, not just their name and rank.

Geonames - this is the full set of mapping data generated when place names were assigned to the map during tagging.

There are a wealth of ways in which these data can be used. The project team itself plans on making them available to the National Archive to help enrich their catalogue descriptions and on feeding them into Lives of the First World War, IWM's permanent digital memorial, to provide evidence of the experience of named individuals. Academics will also be making use of them as a rich and accurate resource that will help them develop their understanding of how the war was fought. We'd love to hear from you about the questions you think they might help you answer, along with any ideas you might have about visualisation tools they could be fed into (one example of this is CartoDB, an online mapping tool, which Zooniverse has already used to produce visualisations of troop movements and casualties over time).

Posted May 28, 2014 3:09 PM
by HeatherC moderator

It's fascinating to download one you've tagged and look at it in Excel to see what data comes out of it. Thanks for the link Rob. I can't wait to see this all eventually fed into Lives so that information about individuals can be searched for there.

I've just flipped through 1 Glos for Aug-Dec 1914 and all those names of Officers and ORs in the appendices, which took me HOURS to tag are all neatly listed. Makes it feel worthwhile!

Posted May 28, 2014 5:37 PM
by alanlw

Great stuff, makes it wonderfully easy to find people and places. Well done all of you.

Is is possible to submit corrections and if so, how?

Posted May 29, 2014 8:28 PM
by Maria64

Well done, everyone! Very user friendly and a brilliant way to see what we've generated. Thank you. 😃
Best wishes,
Maria.

Posted June 1, 2014 10:45 AM
by ral104 moderator, scientist

Glad to hear it's useful! What sort of corrections do you mean @alanlw ? The data isn't meant to be 'correct' per se, in the sense that we're not aiming to provide an accurate transcription of the diaries. Rather, we've focussed on producing a structured data set, with the consensus mechanism in place to root out any really substantive issues. Certainly when researching named individuals or places, it's likely that some manual intervention will be necessary where no clear consensus can be reached.

Posted June 3, 2014 9:42 AM