Building your repository – a ground up guide

Matt Holland
Librarian, LKS ASE

Michelle Dutton
Librarian, Trust Library Services, Manchester University NHS Foundation Trust

This article responds to a number of questions about repositories received from colleagues. Replies usually contain the same information that might be more usefully summarised in a public forum, for example in an article for the HLG Newsletter! Almost none of the information here is of a technical nature. A sound technical appraisal of any repository solution is essential, but will need to be sourced elsewhere.

The budget

How much you spend on a repository is determined by too many factors to be specific about figures. However, the costs are greater than just a subscription to the software. Included below are the budget headings that you should consider in order to understand the full cost of your repository. If you have the opportunity to apply for additional funding, you might usefully include these to get the full costs funded.

  • The repository
  • Marketing and design
  • Time for planning and testing
  • Staff training
  • Data entry back catalogue to date
  • Review and data quality
  • Data entry ongoing

The repository costs come in two, or possibly three parts. Part one is a setup fee in year one, and part two is a subscription fee for year one. You only pay the setup fee in the first year, so if you are working on a three-year or more plan, the costs in years two and three will only be the subscription to the software. Support may come bundled in with this, but a third cost maybe an additional support package.

Marketing

What is the point of the repository? Well, clearly we could have a long discussion about that. However, the main purpose is to showcase the research of your institution. This means infiltrating the repository into as many places as you can, and enlisting the support of corporate communications and PR as the repository will answer one question for them, “What research is our trust publishing?” To engage them and your end users it needs to be well presented and competently marketed. To get to the point, this means help with design and feel, colour schemes, logos, names, marketing materials and much more. You could imagine a figure such as £2.5K. It is a lot, but if you don’t ask/plan, you will not get.

The name of your repository

Although we love thinking up acronyms, the first iteration of amber was called NERD, a mistake on many levels. However, this is important. Every repository has short and a long name. The best name for amber was taken by another institution, and I cannot say who!  Anyway, the point is that you need a snappy short name that people can remember, as well as a long name/strapline that explains the purpose of the repository that they cannot remember, but looks good in print. The addition of a strapline, e.g. “the home of ambulance research” is optional. However, you may need a strapline if the name of your repository does not actually stand for anything. Get this right and people will remember your repository, even if they have no idea what it is.

The structure of your repository

Repository software is both flexible and rigid. You can have many options about how you structure your repository, but re-engineering it if you change your mind will waste valuable time and money. In other words, once you start down the road it is hard to turn back. The best structures are around the structure of your institution, as people who work in the same place often research the same topics. From a marketing point of view, it is also easier to point to the work of Department X, or link to your repository from the Department’s website. Should you choose to depart from this structure, think long and hard about what you want, and how you will sustain it over the long term. Remember also that there are other ways of grouping the material in your repository around topics and projects if you build in the appropriate thesaural terms and tags into the metadata. This is also the first thing you will need to plan, as it is integral to the repository set up and data entry into it.

Who will input material into your repository?

Just to save you the suspense, you will be either organising or doing the data entry. You might have guessed when we talk about data entry that this is a library task. Any idea that your potential users and contributors are going to add their own research is for the birds. Universities have two levers that might help them do this: one is to mandate contributing to the repository; and the other is taking part in the Research Excellence Framework (REF). I am not sure that even these can leverage the most recalcitrant researcher!

Where is the stuff?

The stuff is everywhere. But there are a few things you can do to help you understand how much stuff you actually have. Firstly, decide when your repository will start: amber started in 2006, the date of the implementation of the recommendations from the Bradley Review, and a major reorganisation of ambulance services in England. So for you, the date your Trust started, or merged with another trust might also apply. You might also consider more than one start date, and plan to work through the backlog in the future when time and resources allow.

Once you have that, I would suggest a few preliminary database searches to understand how much material there is. You could say that it takes 20 minutes to add each item, so by multiplying that up and adding 20% will give you an idea of the time commitment that is necessary for the data entry, and the cost. This presupposes that you have made the decision to include only material in major, indexed journals. This might be a good starting point, and in the future other material can be added, such as posters, conference presentations etc., but we would suggest that would be a second tier of data entry.

There is probably a hierarchy of places you could look, and these would include any local databases and lists that already exist, departmental webpages with lists of research and personal webpages of known prolific researchers. Just asking people who are likely to have material to contribute for publication lists is also a useful tactic.

Ingesting your data from an existing database

Typically, the process of transferring data is in three stages:

  • extracting data from your in-house/existing system;
  • transforming your data into a form that it can be ingested into your new repository solution;
  • ingesting data into your new system and troubleshooting any issues or inconsistencies in the data.

You will need to have the ability to extract your data from your own system. Generally, the preferred format will be a CSV or comma-delimited file.

It is likely that you will have to map your data to an existing data standard. Several data standards exist, however, the most likely data standard that you will encounter is the Dublin Core Metadata Initiative, also known as Dublin Core. A key piece of data that you also need is the Digital Object Identifier (DOI) for each record. It is possible that you could supply a list of DOIs and ingest the data using these. If you have the option to go down this route, you should allow time to edit and prepare your data. There will also be a technical learning curve if you have for example to populate a proforma spreadsheet of data to give to your repository provider.

What is the point of my repository?

This question relates to marketing and communicating the purpose of your repository to end users. In essence, you need a one or two sentence summary to get across the reason that your repository exists and who it is for. So we have, “amber – the home of ambulance service research” and “amber contains records of published research authored by NHS staff working in Ambulance Services in England.” It could be improved, but it gets across the basic aim of amber. Underlying this are some other important questions you need to answer before you get started. Firstly, can you say clearly and concisely who is eligible to be included in your repository? Because if you have a complicated organisational history, you might have to think about how organisational affiliation will affect contributors. Additionally, what material will you include? Would you consider unpublished material such as posters, conference proceedings, presentations and reports? Will the authors or a nominated researcher or clinician manage the quality control? Are you going to be a Green Open access route for you researchers and staff? Are you on top of current debates around open access trends in publishing? Could you explain them or be an advocate for them if necessary? Are you aiming to have full text in your repository? This is a great draw for users, and a percentage of full-text content is a requirement for some international aggregators. Do you have a list of the top five points on the general benefits of a repository? Questions, questions and still no repository in sight, and there are certainly more questions that you could or should ask yourself.

Policies

Policies are good for a number of reasons. Writing policies does make you think about the right questions and articulate the answers. You only have to write a policy once, but it can answer the same question many times. They are useful to keep you on the straight and narrow if you find your repository is drifting away from its original purpose, and they can also provide you with some legal protection. They can communicate something to your intended contributors without the need to endlessly repeat the same points, and are a convenient start to a conversation about your repository. Luckily, those wonderful people at JISC have a basic policy that you can just download and use, and amber has an additional note directed only towards its contributors and users. The OpenDOAR policy tool can be found here, and the amber advisory note – not quite a policy – can be found here. A related and useful website you should also be aware of is Sherpa/ROMEO, the database of publishers’ open access policies that will guide as to what can be included in the repository, and when.

A related policy area is version control as it is traditional to not delete material from a repository, but you may have to decide which one has primacy, if there are a number of versions of the same item. This is a really arcane area known as version control, which repositories are actually quite good at. It is almost certainly a bridge you will only need to cross if you ever get there.

Thesaurus, subject terms and authority control

If you have ever tried to maintain a thesaurus then you will already know it is a road to hell. However special you think the research in your institution is, and by definition it will be at the cutting edge of medical research and possibly hard to describe, use MeSH. The other forms of authority control you need to be aware of are for names: in most cases, this will be ORCID, although other forms of name authority control are available. These things can get a bit obscure, but they do effect directly the data quality of the repository and you will need to be on top of them.

Data entry

Once Matt had organised all of the planning and behind-the-scenes arrangements of the amber repository, it then came to ourselves at MFT to start thinking about the data. A few meetings and phone calls with Matt allowed him to impart to us what he envisaged, and from that, we could begin to plan for the practicalities. Within the team at MFT, we had staff with the experience of building a Trust-based repository, however, because of other service commitments, they were not able to lead on this project. Their experience did allow us to build some initial ‘lessons learned’ style advice, and to answer some quick questions as we got started! In a continuation from the scoping searches that Matt had already undertaken, we collaboratively agreed on the list of databases that should be covered, and began collecting the references from searches. This resulted in a simple search-strings Word document (always good practice for repeating searches to ensure a consistent approach), featuring affiliation searches. In this document, we also started defining some basic rules, such as ‘don’t use that field/use this’, and where to/where not to include punctuation.

The plan had been to use a piece of Reference Management software to keep track of search results and rule out duplicates, but as the project grew, we found this to be more cumbersome than beneficial. Should we go back and build these in retrospect? I am not sure, but also have not deleted the files just yet. Instead, we relied upon a locally-built additions form that would not only act as a prompt for ensuring that each field was completed correctly, but would also act as a form of communication between the inputter and author, as well as a long-term paper trail to list why entry decisions were made at that specific time. We referred to this as the flow paperwork, as it enabled us to track the process, and will be referred back to in the future.

For staffing we looked at our collective skill sets and opted for a good knowledge of the journals and citations (mostly gained from sourcing inter library loans), to help the correct data entry process, and cataloguing skills for the ability to identify and refine the appropriate thesaurus terms and author index entries. From our team, this was a combination of an experienced Library Assistant, Senior Library Assistant and Librarian. We started the process as having four individual stages, with each step carried out by different member of staff:

  • sourcing and preparing the articles from existing database search results;
  • passing the details on for addition to the repository, according to the flow paperwork;
  • cross-checking the entry for errors and authority field entries, cross-checking keywords, authorising and mapping across collections where needed;
  • filing flow paperwork in alphabetical/chronological file for easy cross reference.

Once we were all comfortable with the process, we realised that we could combine the first two steps, as it was easier if the person adding the item had seen the article from the outset. However, to hone the steps for the entry addition, it was better to have the headspace to be able to find the best approach and be comfortable with the knowledge of the repository fields.  With busy library workloads it is always tempting to have it as one step, but the two sets of eyes for a cross-check will always help eliminate easily-missed errors. The paperwork filing is possibly overkill, but is always useful as a historical backup when it is using third-party software, regardless of the number of Excel downloads that we take!

We opted for manual data entry as it seemed to be the easiest process for our entry level on this project. We did visit a team that maintains another NHS repository, who used the same software, to observe an alternative approach of the preparation and importing of files. This allowed me to see the possible advantages of approaching the task in this manner (who does not love a neat Excel sheet?!), but we decided to continue with manual input, as it suited the teamwork approach that we had already established, and did not seem to reduce the need for cross-checking. However, if we chose to work on another repository, this is something that I would definitely look at again.

The nature of the make-up of the Ambulance Service meant that it made sense to have collections organised by geographic region and matching ambulance Trust. This meant that where there were authors from more than one Trust, we would add the initial entry and then subsequently map it to other collections. The software allowed this to be a relatively easy process, and the flow paperwork would prompt the inputter each time to ensure that this step was checked, and actioned where necessary.

As expected, the main two areas that needed to be kept uniform and pertinent are the indexes for the keywords and author entries. The data entry staff team all have experience of working in hospitals with emergency care, but obviously, prhospital care has a slightly different language and set of associated terms. For keywords, we agreed to try and align with Pubmed MeSH as far as possible, but also decided to include UK-based phrases and terminology, when appropriate. There was an urge at the beginning of the process to start with a top 100-style list of preferred terms, but it had to grow organically from the project as it progressed. As Matt mentioned, the authors format was cross checked against ORCID for ambulance authors. Once the historical entry is completed, we plan to cross-check the indexes on a monthly basis, to ensure accuracy for retrieval.

Bulk editing

It is possible to bulk edit your data should it be required.  This happened in two instances with amber. The first was where the DOI ended up in a field that the repository software couldn’t convert to a live link.  The second was where the chosen type of publication field could not be read by the Open Access Initiative Protocol for Metadata Harvesting (OAI-PMH).  Both these were easily solved by downloading the entire repository and moving the data to the correct fields and re-uploading the data to the expected fields, with a subsequent modification of data entry.  The learning point is to make a time after the first few weeks/months of data entry to review the data and see if there are any inconsistencies in data presentation or processing that are attributable to the choices you made about data entry fields.  The OAI-PMH issue came to light when amber was integrated into the National EBSCO Discovery Service.  As this might be the ambition of all NHS repositories a check against the expected OAI-PMH/Dublin Core fields at the start is an easy quality win.

HLG Newsletter
Spring 2021

%d bloggers like this: