October 27, 2022

TREEHOOSE: Developing a cloud-based platform for securely analysing sensitive data

In the fifth in our blog series showcasing the DARE UK Sprint Exemplar Projects, Dr Christian Cole and Dr Simon Li from the University of Dundee discuss the TREEHOOSE project and development of an open-source platform to securely analyse sensitive data.

Data used by researchers can contain personal and sensitive information which needs to be managed securely and safely to avoid uncontrolled access or leaks. The data is needed for research or analysis purposes, for the benefit of the public, and cannot be simply locked away. Special computational facilities called trusted research environments (TREs) have been established for many years as places where approved users can access sensitive or personal data for research and analysis. TREs have secure procedures to manage data, people, processes and computing hardware and this comes at a significant cost to set up and maintain.

There are no standard ways to run a TRE, and many different incompatible solutions have been developed over the years. Each one uses a variety of hardware resources and different software implementations, and often requires special training to use. They can be expensive to buy, require regular maintenance and are usually proprietary.

The TREEHOOSE project aimed to capture the 10+ years’ experience within the University of Dundee’s Health Informatics Centre (HIC) of running a TRE for health data research, and to make it available to the wider research community. This was achieved by making our recently developed cloud-based (accessible over a secure internet connection) TRE code open source and freely available to anyone.

The TREEHOOSE team was made up of infrastructure specialists from HIC, and our cloud partner, Amazon Web Services (AWS). We worked closely to deliver a TRE codebase, combining HIC’s experience of working with health informatics researchers into AWS’ existing open source codebase Service Workbench.

The practical needs of a TRE for health data research are:

  • Suitable data – the data available in the TRE must be de-identified, yet still be able to answer research questions.
  • Usability – the computing environment needs to match the expectations of researchers.
  • Streamlined controls – the TRE managers need systems to easily control data and data access.

Overall, it is one thing to provide a secure environment and another to provide a usable environment. The TREEHOOSE open-source TRE is now being considered by other organisations and academic projects to speed up adoption of secure environments.

Engaging the public

An important part of the re-use of health data is to be open and transparent about it. The public recognise the importance of using patient data to improve health outcomes, and to maintain public confidence we need show what we do, and include them in development of future research.

We ran workshops where we spoke with the public to understand how they view health data research in the cloud, and to guide ourselves and others on how future projects can better inform the public. We prepared a report covering what we found in the workshops, but the overall messages were:

  • Re-use of patient data should be done for research in the public benefit.
  • Data should be freely available and not ‘sold’, even if it the money were to be given to the patients who the data originates from.
  • There is little difference between cloud and on-premise (on-site) TREs in terms of security and acceptance for health data research.

With TREEHOOSE, we have taken a step forward to aid the wider adoption of TREs to support safe use of sensitive data within academic, public sector and private sector organisations.

Find out more about the TREEHOOSE project and access the project outputs.