Alberto Labarga
With a background in Biomedical Engineering and Bioinformatics, for the last fifteen years I have been involved in biomedical data science projects and systems administration, both at University and research centers, and as professional activity at several start-ups company which I contributed to create.
Currently I lead the Health Data Unit at Barcelona Supercomputing Center, where we develop tools and best practices for secure health data management and analysis within and open science environment: FAIR data, federated learning, virtual research environments, etc. Before joining BSC, I lead the Data Engineering team at IOMED, a health tech startup applying artificial intelligence and natural language processing to electronic health records. We process thousands of clinical records daily, applying latest NLP technologies to extract information in a cloud neutral containerized environment using Python, SQL, PostgreSQL, MongoDB, docker, Kubernetes, Airflow, etc.
Session
Data engineering has experienced enormous growth in recent years, allowing for rapid progress and innovation as more people than ever are thinking about data resources and how to better leverage them. In this tutorial, we will build an end-to-end modern data platform for the analysis of medical data using open-source tools and libraries.
We will start with an overview of the platform components, including data warehousing, data integration, data transformation, data orchestration, and data visualization. We will then dive into each component, exploring the technologies and tools that make up the platform.
We will review Python-based tools such as DBT, Apache Airflow, Openmetadata, and Querybook to build the platform. We will walk through the process step-by-step, from creating a data lake to integrating data from multiple sources, transforming the data, orchestrating data workflows, and visualizing the data.
Attendees will benefit from this tutorial if they are interested in learning how to build an end-to-end modern data platform for biomedical data using Python-based tools. They will also benefit from learning about the open-source tools and libraries used in the tutorial, which they can then apply to their own data engineering projects.
Time breakdown:
Introduction and overview (20 minutes)
Data integration (20 minutes)
Data transformation (20 minutes)
Data visualization (20 minutes)
Q&A (10 minutes)
Check the workshop materials https://github.com/bsc-health-data/pycones-23-modern-data-stack and install the requirements in advance for a better experience