1. Introduction
For more than a century Geoscience Australia, and its predecessors (the Australian Geological Survey Organisation (AGSO) and the Bureau of Mineral Resources, Geology and Geophysics (BMR)), have been collecting rock samples from around the world, with a particular focus on Australia and the surrounding region. Many samples in the collection are irreplaceable and come from locations that are now inaccessible and in some cases no longer exist. From these samples over 250,000 thin section microscope slides (also called petrographic sections or thin sections) have been produced.
Thin sections have been utilised in the geosciences since around 1850 (). Traditional thin sections are made using a diamond saw to cut a thin sliver of rock, which is mounted on a glass slide and ground down until the sample is only 30 μm thick, and can then be examined by a variety of techniques including plain polarized petrographic microscopes, reflection microscopes, electron microscopes and electron microprobes. The finished microscope slide is a useful aid in determining the mineralogy of the parent rock sample. Thin sections are one of the key ways of determining not just the mineralogical composition of a rock but the relationship between minerals within the rock. They are an aid in defining the history of a rock with respect to how the rock initially formed and subsequent events such as deformation and metamorphism. Minerals also control the chemical and physical properties of rocks (density, reflectance, magnetisation, etc.) and hence knowing the mineralogy of a rock can provide valuable insights into the interpretation of other geophysical and geochemical datasets. The collection also includes microscope slides of microfossils, mud smears and nannofossils.
Today, by far the greatest volume of geoscientific data and information is more likely to be derived from high volume remotely sensed data sets () (e.g., airborne data, satellite data, drones, etc.). Such collections tend to measure proxies of the real world (e.g., satellites can measure infrared radiation but the data needs to be mathematically manipulated to give temperature; airborne radiometric surveys measure gamma-rays produced by the radioactive decay of potassium, uranium and thorium are used to estimate concentrations of those elements). Hence, position-located, real world physical samples have a valuable role to play in the calibration of modern remotely sensed data sets, particularly those that can provide information on the minerals present at each site and their formation. It would be prohibitively expensive to collect the required samples from scratch. Increasingly, researchers are turning to historic collections of physical samples and thin sections in support of the calibration and enhancement of remotely sensed data sets. Indeed, one of the drivers for establishing this project to ‘rescue’ the historic Geoscience Australia collection was to demonstrate the potential usefulness of this collection to modern initiatives including the Australian Federal Government’s current Exploring for the Future minerals, groundwater and energy programme to increase the attractiveness of Australia’s North to investment ().
However, the value and potential of this collection was not fully realised and the management system for this collection, started in 1901 in parallel with the collection of the samples, remained largely hardcopy based (Figure 1), with no online presence. Proposals to digitise the metadata records that show the distribution of the samples and (where available) sample descriptions have suffered from a lack of prioritisation, focus and available funding. While the collection is technically publicly available, the unstructured state of the hard copy management system has acted as an inhibitor to the wider discoverability, accessibility and reusability of this public asset. For example, it was impossible to use valuable mineral identifications to aid in the interpretation of very high cost geophysical and geochemical data sets. The opening up of this collection to electronic discovery is in line with Geoscience Australia’s desire to maximize data potential and to improve access to collections. The opening up of the data through the project has broadened the Geoscience Australia stakeholder base and should enhance the relationship.
This paper describes the ‘rescue’ of a subset of 40,000 items from this collection, trialing
- a method that made extensive use of a team of non-specialist citizen scientists to transcribe letter for letter, number for number, a series of handwritten cards into digital form;
- incorporation of these data into a preexisting data system; and
- the ways in which these data are now accessible on line and other potential uses.
The paper also describes future plans. Throughout this paper we will use the term “thin section slides” to describe microscope slides that are prepared from individual rock samples.
2. The Initial State
Most of the 250,000+ thin section slides are recorded only via hard copy records (Figure 1). These records stretch back into the first quarter of the 20th century and include important historic items such as microfossil slides made from samples taken by Sir Douglas Mawson during his 1911–14 expedition to the Antarctic (). In undertaking the project it was decided to concentrate on the rock thin sections, i.e., thin sections made from rock samples.
Given the size and breadth of the collection, to reproduce the thin section slide collection today, would cost 100s of millions of dollars. Some localities no longer exist (e.g., sites that have been mined out), whilst for some others, land access issues inhibit the potential to recollect samples.
The primary challenge for the project was to convert thousands of handwritten paper entries, from multiple generations of authors, to meaningful machine-readable, structured metadata/data suitable for online consumption in the modern digital age. The labour-intensive nature of the traditional manual transcription process meant that staffing costs would be prohibitive for a standard approach, and hence if the cards were to be transcribed new alternative approaches to the project would need to be determined.
Due to the different media that the relevant data had been collected on over decades, the variability and quality of the handwritten source material (Figure 1), and the inconsistent and variable length of content originally provided (Figures 3, 4, 5, 6), automated digitisation was not viable. It was clear that comprehensive human validation of the digitised material would be essential and that the project would support the use of a crowd source solution. Some initial hesitancy existed about the transcription being undertaken by non-experts. However, it was agreed that if the transcription was done in a letter for letter, numeral for numeral manner then non-expert resources could be crowdsourced for the initial transcription phase with a small number of subject matter experts available if/when required.
In looking for an approach DigiVol was identified as a possible crowdsourced alternative to traditional transcription. DigiVol () was developed by the Australian Museum in collaboration with the Atlas of Living Australia as part of the Australian Federal Government-funded National Collaborative Research Infrastructure (NCRIS) () initiative. While initially developed for transcribing records of non-geological content, it is fundamentally a crowdsourcing platform that facilitates the digitising of data including the transcription of historic scientific records. DigiVol has multiple capabilities, including visual analysis of data from camera traps and the transcription of hard copy material such as old journals and metadata cards. The resulting data are stored on the DigiVol platform with suitable provenance metadata and are readily retrievable.
3. The Rescue Process
This section describes the processes employed to identify the relevant records, establishing and using the DigiVol platform, and the methods of capturing and retaining the interest of volunteers. It also outlines a method of accessing the resulting metadata records and where this might lead in the future.
3.1. Identifying thin section slides within the area of interest
Given the scale of the collection it was decided to focus the initial rescue effort round a geographic region of current interest to Geoscience Australia to test the methodology of using citizen science in translating technical scientific data from handwritten cards into a digital database.
The area of interest for the project was loosely defined as Mt Isa (Queensland) to Tennant Creek (Northern Territory) and north (See Figure 2). In total it amounted to an area of approximately 400,000 km2 with the bulk of the samples in these areas collected during mapping projects conducted during the 1960’s and 70’s that were based on national map sheets (1:250,000 equivalent).
3.2. Metadata
Prior to having a thin section made during these projects, a sample submission card had to completed by hand (Figure 1) that provided minimal metadata including sample number, grid reference, map sheet, location description, and some basic information on the nature of the sample (e.g., rock type, texture, colour, mineralogy). Once the thin section was available to the scientist more detailed information was often filled in including mineralogy on the reverse side of the card (Figure 1). Unfortunately most of this additional information, which contained very valuable scientific detail, was handwritten, and at times, this handwriting was not very easy to decipher. It was definitely not of sufficient quality to enable scanning and automatically translate.
3.2.1. Uniquely identifying each sample
Critical to the project was the assumption that each sample in the collection chosen was uniquely identified. Fortunately, each sample in the area was uniquely identified due to a field site numbering system started in Geoscience Australia during the 1960s (and still used today). An 8– to 10-digit code, e.g., 67846023, was assigned where:
- 67 = the year that the field site was mapped or the sample was collected.
- 84 = the two-digit identifier number of a mapping/sampling project, which could be related to the geographical region the sample was collected from. The region typically covered multiple map sheets.
- 6 = an identifier of the geologist that collected the sample/field site. Each field worker was usually assigned their own number by the project.
- 023B = a 3 digit, usually sequential sample number; these often had additional letters and/or number suffices to create a unique identifier for the sample or field site, e.g., 023A, 023B.
This 8-digit code was later increased to 10 by prefixing 2 digits to enable a 4 digit year to be included.
The systematic use of this numbering system throughout the area of interest proved to be very fortunate for the rescue effort, as for the most part, it did create a unique sample numbering system (although not always fail safe and some duplicate sample numbers did occur). Whilst the project identifier number only provided the general location of the sample site to be known. (e.g., Project 20 ‘Cloncurry, Qld’, ‘Qld. Mt Isa’, or Project 10 ‘McArthur Basin, NT [Northern Territory]’, or ‘McArthur Basin, Mount Isa Inlier’), there was sufficient information encoded in this sample numbering system to then make discovery of additional metadata and/or identification of the original collector of the sample much easier.
3.3. Imaging sample submission cards and registers
The sample submission cards (Figures 3, 4, 5) were imaged by a commercial provider. Where data was also present on the reverse side (Figure 4), or within attached pages (Figure 5), these were concatenated to form a single image suitable for use within the DigiVol platform (Table 1). The generated images have also been used to fulfil archive requirements under the Australian Archives Act () as described in the National Archives of Australia Scanning Specifications (as amended 22 Aug 2013) ().
Criteria | Requirement (DigiVol) () |
---|---|
Format | JPEG |
File Size | 1–2 MB (2 MB max) |
Number of images per task | One |
3.3.1. Examples
3.4. Establishing a DigiVol ‘Expedition’
Each activity or project within the DigiVol platform is referred to as an ‘Expedition’. An Expedition comprises a quantity of ‘like activities’ all based around a common data capture template. The slide collection expeditions were based on transcriptions but other format expeditions exist within DigiVol. During the course of the transcription process six sample submission card-based expeditions and two expeditions based on bound registers (Figure 6) were conducted. The table below (Table 2) indicates the size of each expedition that was undertaken.
Item Transcribed (Register/Card) | Number of pages (Registers) | Number of Sample Submission Cards |
---|---|---|
Rock Register 2 | 471 | |
Rock Register 1 | 505 | |
Sample Cards #1 | 901 | |
Sample Cards #2 | 959 | |
Sample Cards #3 (Long format/multi page) | 390 | |
Sample Cards #4 | 990 | |
Sample Cards #5 | 1409 | |
Sample Cards #6 | 1208 |
Currently (6 January 2019) DigiVol has 3,758 volunteers working on 13 ‘Expeditions’ (Current DigiVol Landing page), but since launching DigiVol in 2011 1,246,277 tasks (DigiVol term for a record) have been completed.
3.5. Running an expedition
DigiVol allows the Administrator of an expedition to define the template by:
- Selecting an underlying structure from a number of predefined formats (e.g., number of columns, spreadsheet format, camera trap/questionnaire, journal etc.);
- Adding fields of default size (labelled to suit the specific expedition);
- Amending the priority of each field to balance the layout and structure that the volunteer will see. Fields cannot be moved within the template other than by changing the priority;
- Loading of pick lists to be associated with fields to assist in ensuring more consistent results.
Once the template was defined, images of the sample cards and register pages to be transcribed were uploaded as .JPG files in batches on to the DigiVol platform. Some larger files were reduced in resolution to meet size restrictions. Each file was automatically allocated a sequence number to allow later sorting.
3.6. Transcription process
The actual transcription approach within DigiVol is a two-pass process: firstly the initial transcription, followed by a separate validation process.
3.6.1. Initial transcription
The first pass allows all registered DigiVol users to select and transcribe a document. The user is presented with an image for transcription and fields in which to enter the data. (Figure 7). The user is provided with instructions on what is required via ‘Tutorials’. If needed the user is able to seek specific or general advice on a topic via the expedition forum. Questions and comments entered by users can be addressed by either the expedition administrator or other users.
A transcriber is also able to make notes on a task to indicate areas of issue that could not be resolved (e.g., difficult handwriting). These notes are available to the subsequent validator and the expedition administrator and remain with the transcribed information.
3.6.2. Second pass
Once a task (DigiVol term for a record) has been transcribed it is then available for the second pass, ‘Validation’. This allows a selected group of users for that expedition to review the transcribed documents and to:
- validate the data;
- modify; or,
- reject the transcription.
In a similar manner to the initial transcriber, the validator is also able to provide notes on a task to provide feedback to the expedition owner.
3.7. Social aspects – capturing and keeping interest
Working with volunteers requires some adjustment of approach from paid employees in a conventional workplace: capturing and keeping their interest is critical to the success of an ‘expedition’. By and large the volunteers are participating in a DigiVol expedition out of interest and while they remain interested they are likely to keep contributing. The DigiVol platform comes with an existing user base. The current user base is 3,758 volunteers (). Although this created a ready ‘market’ for selling the expeditions to, attention had to be paid to capturing and retaining their interest. For a community used to capturing information on biological specimens, rocks were quite different.
3.7.1. Capturing interested volunteers
For this project the initial capture of the volunteer’s interest was achieved by presenting the Geoscience expeditions as interesting, both from a visual sense (using related images that are visually appealing (Figure 8) and from a story sense (giving a sense of the history and background as to why the collection is important).
3.7.2. Maintaining the interest of the volunteers
Given the uniqueness of the task, once volunteers were ‘captured’, maintaining the interest, and through that, the continued active support of volunteers, was achieved by:
- Platform-based support. The DigiVol platform provides a forum environment that allows transcribers to ask questions of the administrator, expedition author and each other. This virtual community in addition to addressing questions also provides positive feedback when transcribers reach milestones (e.g., 100 expedition transcriptions). The mutual support offered by the forum partially ameliorates the impact of time zones on the ability of the expedition author to respond to issues raised;
- Timely response to questions. We endeavoured to address questions raised via the expedition forum within 24 hours. Anecdotal feedback from volunteers indicates that they found that this allowed them to comfortably proceed more rapidly and they also indicated that the responsiveness demonstrated the expedition administrators’ respect for contributors. ();
- Additional related information. Where a volunteer would ask for further information on a tangential topic we would develop a tutorial both to address the immediate question, and also try to provide a more complete background on the topic. These tutorials would then be stored and referenced within the DigiVol expedition and contributed to building up a Frequently Asked Questions resource. An example of this is enabling users to understand how ‘Air Photos’ were used in determining spatial information in the pre-GPS era. ();
- Additional items of interest. Periodically, to help encourage continued support of the expeditions we would release a ‘tutorial’ that raised the profile of the expeditions. These tutorials did not directly relate to the transcription but drew upon the work of the volunteers to provide examples of transcribed minerals (), background on the making of thin sections (), and even the meaning of words through the use of well-known board games (); and
- Acknowledgement of contribution milestones. On the basis that we all seek acknowledgement for our contributions we would note, on the expedition forums, when volunteers had achieved significant milestones (e.g., multiple of 100 transcribed tasks for the expedition) (). This is in addition to the virtual rewards system built into the DigiVol platform where volunteers are rewarded with virtual tokens for achieving milestones. Other volunteers would often respond to these posts with congratulations of their own.
3.8. Download, review and prepare
Once all the cards within an expedition had been firstly transcribed and subsequently validated the information was downloaded from the DigiVol platform as a CSV file. This file was then reviewed for consistency (e.g., ‘Queensland’, ‘QUEENSLAND’ or ‘QLD’) by Geoscience Australia employees. Any inconsistencies are made more evident given the column and row nature of the data presentation.
In order for the resource information to be used further, the actual location of the sample (and its uncertainty) needs to be known. Due to the variations in the manner in which spatial data was recorded for samples over the decades, volunteers were asked to simply transcribe exactly what was written letter by letter, number by number. The spatial information provided for each sample varied in:
- Level of detail. Some geologists recorded positions with all the accuracy possible, whilst others simply noted the map sheet with a general reference to being near or in direction X from town/homestead Y.
- Manner of recording. (Figure 10) When a position was given it might have been given as grid coordinates, Latitude/Longitude, points on a particular air photo or a description of the location. Due to the considerable effort involved in determining locations based on air photos it was decided that these samples would use the centroid of the map sheet provided as the sampling location. To preserve the integrity of the digital database, the accuracy of the location of the observation was also recorded.
- Projection used. Even when a point location is given via coordinates (e.g., Lat/Long or grid reference), due to the time over which the samples have been collected the various mapping projections and datums have changed. Determining the correct projection involved additional investigation during preparation of the data for upload. Some of the earliest submissions predate any form of modern spatial projections (e.g., AGD66, AGD84, etc.) but were defined using the Clarke 1858 spheroid latitudes and longitude.
3.9. Access and discovering the newly digitised data
Once the data from the cards had been transcribed, validated and downloaded from the DigiVol platform, it was then uploaded into the relevant Geoscience Australia databases. In addition, each parent sample, as well as the derivative thin section slide was assigned an International Geo Sample Number (IGSN), a global identifier system designed to provide an unambiguous globally unique persistent identifier for physical samples and facilitate the location, identification, the citation of physical samples and the ability to link any sample to other data or any publications derived from that sample. By assigning a unique IGSN identifier to both the ‘parent’ rock, as well to the ‘child’ thin section slide, means that other derivative sample preparations (e.g., mineral separates, rock powders, etc.) and derived data sets (e.g., geochemistry, physical rock properties, etc.) that these also have unique persistent identifiers assigned to them ().
The use of the existing internal Geoscience databases and the minting of IGSNs on each thin section slide and its parent sample, means that a wealth of tools already developed for these can now be applied to the rescued samples. For example, from the Geoscience Australia Databases, data and metadata for the 40,000 thin sections rescued in this project, were integrated seamlessly with the existing records of thin sections. Progressively this combined resource is now being made available via an Open Geospatial Consortium (OGC) web services that enables users to search spatially for thin section slides within their area of interest (). Once the desired samples have been located by the user, via the use of the webservice, the samples can be requested for viewing or borrowing.
Since commencing the work Geoscience Australia has received external requests for thin section slides to support PhD research and internal investigations. Provided the thin section slides are contained within the 67,000+ catalogued in Geoscience Australia, initial retrieval of the thin section slide only takes 10–15 minutes. This is a significant efficiency gain for internal collection managers and results in faster delivery to clients (Collection managers have suggested that retrieval could take some days).
The use of OGC web services also allows access to the thin section metadata via any tool capable of consuming OGC data services. Hence the metadata can be utilised by other external tools and data systems and the user is not dependent on using only those tools provided by Geoscience Australia on its website.
Industry interest in the slide collection to date has been limited. This is believed to be in large part due to the lack of knowledge and inaccessibility of the slide collection. With the move of the handwritten metadata to an electronic format followed by the minting of an IGSN for each thin section slide, as well as its parent sample which in turn, will enable the linking to other derivative data such as geochemistry and petrophysics, the value of the new collection to industry will only increase.
Figure 11 indicates the Australian-based Thin Sections that can now be accessed.
3.10. Future developments
With the provision of web services (), users will be able to access the sample metadata and, where available, descriptions of that sample. To access the actual thin section slides, they need to be physically shipped to the requester or viewed at Geoscience Australia. This creates a number of risks and issues, relating to delivery timeframes and the risk to the thin section slides themselves, including breakage and loss. Many of the older thin section slides are extremely fragile, with some of the material used in their manufacture deteriorating over time (e.g., resin used for attaching labels obtained from the Balsam Fir tree (‘Canada balsam’)). For this reason, some slides are unavailable for shipping. An alternative is to ‘deliver’ quality digital images of the thin section slides for screening and potential analysis purposes to help reduce the numbers that need to be actually shipped. One example of this approach is the British Geological Survey’s Britrocks web application (Figure 12). Britrocks allows users to find thin sections and examine them visually both in plane and cross polarised light ().
It is worth noting that in the Scottish component of the British Geological Survey project, some 100,000 thin sections were photographed, in both plane and cross polarised light, by volunteers over a period of approximately 15 months (). Geoscience Australia could follow a similar path and progressively provide images of their thin section slides.
4. Discussion
This thin section data rescue project was an undertaking that would not have been possible from a financially or resource availability perspective without the active participation of volunteers who were supported by access to current subject matter experts from within Geoscience Australia.
The volunteers consisted of two main groups:
- Online: These volunteers (49 ()) were mostly Australia based and undertook both the initial transcription of records and the subsequent validation through the web based DigiVol platform.
- Onsite: This small number of volunteers (3) provided detailed subject matter expertise in the areas of geology, contemporary processes and cartography. These were all past employees of Geoscience Australia and its predecessor and had participated in the collection of some of the samples being transcribed.
4.1. Working with the DigiVol volunteers
4.1.1. What sort of person is volunteering for DigiVol and the Rock Expedition?
Figure 13, shows examples of the volunteers that worked on the Geoscience Australia expeditions.
Lang, in a report for the Australian Museum ‘DigiVol online volunteer evaluation report’ (), noted as an initial observation that DigiVol volunteers tend to be mature (54%- retirees represent the single largest group), female (63%), working from their home computer (64%) and most likely with a higher education qualification (Figures 14, 15, 16).
The evidence from the Thin Section Slides Project supports Lang’s findings with the top 2 transcribers (both female retirees) transcribing almost 44% of the total tasks (Figure 19).
4.1.2. Dedication and drivers
Table 3 indicates the estimated time taken to transcribe each style of sample document.
Type of card | Length of time to Transcribe | Number of items | Estimate of effort by volunteers |
---|---|---|---|
Rock Register page | 45 min | 976 | 658.8 hours |
Sample Card (simple) | 5 min | 2850 | 614.9 hours |
Sample Card (long) | 20 min | 390 | 130.0 hours |
Total | 1143.7 hours |
While acknowledging that these numbers are only indicative it does suggest that the volunteers expended significant effort in transcribing the data for no tangible return to themselves.
4.1.3. What drives people to expend considerable amounts of their own time?
Discussions with some of the volunteers involved with the Slide Based collections expeditions indicated a variety of reasons for participating these include:
- They like doing the work and seeing the outputs of their efforts while gaining more knowledge.
- Keeping mentally active is a key concern.
- Sense of Satisfaction. Volunteers indicated that participating in a DigiVol expedition provided them with a sense of ‘job’ satisfaction that was also contributing to society.
- ‘Discovering’ new places. Volunteers indicated that their transcriptions had informed their subsequent travels within Australia. This, they indicated, added a level of ‘reality’ to the sample information that they were transcribing on their return.
- Friendly rivalry. With the DigiVol platform providing statistics and a ranking system volunteers often indicate a competitive attitude. One volunteer indicated that, as the register transcriptions were taking longer than normal tasks, they were doing additional transcriptions from other (shorter) expeditions to ‘keep up’ their statistics.
- Personal statistics and virtual prizes. The platform provides individual users with information on the number of transcriptions and validations that they have done. The individual is also able to earn virtual achievement awards (Figure 9).
- Formation of an online ‘community’. Access to the platform’s forum facilities encouraged the forming of an online community and a sense of belonging.
The below figures (Figures 17, 18) indicated that most transcriptions occurred during what would be standard business hours for Australia (particularly in eastern Australia.)
Figure 19 indicates that approximately 64% of all the transcriptions done as part of Geoscience Australia Slide Based collections project were undertaken by 3 transcribers.
As mentioned, the project also made use of a much smaller group of onsite volunteers (3). These onsite volunteers were all past employees of Geoscience Australia, or the organisation’s predecessors (AGSO and BMR), in the early 1960s and had participated in many of the field programs that collected the samples, and/or were involved in the curation of the samples and the various generations of that curation process. Access to this subject matter expertise and experience aided greatly in determining the practices and processes of the period from which the samples came. These individual volunteers did not participate full time but acted in a consultancy style capacity.
4.1.4. Relative ‘costs’
DigiVol showed that valuable information could be captured by letter for letter, number by number transcription of aging pre-digital data formats stored on varying hard copy formats. The DigiVol volunteers came with an ability to decipher often appalling handwriting, and it could be suggested that with respect to interpreting handwriting they were ‘subject matter experts’. This transcription process showed that the transcribers did not need firsthand geological knowledge, although the provision of a lexicon was useful.
Subject matter experts were a scarce resource and the ability to focus their attention to specific issues made more effective use of this limited resource. They were also invaluable in the development and review of the DigiVol tutorials that in turn educated and helped many of the DigiVol volunteers.
The project would not have been possible without either group of volunteers.
5. Conclusions
The use of volunteers in the rescue of this valuable data resource has proved beneficial to Geoscience Australia in terms of the availability of the data and the ability to access physical samples that would otherwise have continued to languish. The volunteers have indicated that they also found the work beneficial in the form of the mental stimulation, the sense of achievement and the social interaction.
Access to the collection will potentially help industry and academic researchers to conduct virtual preliminary geological surveys of areas and refine their planned field surveys without the expense of field time.
The project opened up an old collection to modern access methods and had the added benefit of raising Geoscience Australia’s profile, and geology more generally with a new segment of the community.