Dicom Read Library (Apache Spark Third-Party Contribution)
Article by Nirali Gandhi, Big Data & Cloud Lead Developer
Dicom Data Source for Apache Spark
This Project provides a scalable spark-based mechanism to efficiently read Dicom Images in Spark-SQL Dataframe.
One year ago, we came across a problem in the Healthcare Domain where reading a huge number of Dicom Images for analysis was involved in Spark-Hadoop Cluster. Though Spark provides API to read many input formats like CSV, JSON, Parquet, and even jpeg/png Images, there was no direct API to read Dicom Images available in Spark.
So, We, Bigdata Engineers at Abzooba, built a Spark-Scala library to parse Dicom Images in Spark Dataframe, which will improve developer productivity and also performance when it is involved Dicom Images for Analysis in Spark.
This project is built using Spark 2.4, Scala 2.11, and a Java-based Open Source Library “dcm-4che”.
Introduction to Dicom Images
DICOM (Digital Imaging and Communications in Medicine) is a standard protocol for the management and transmission of medical images and related data and is used in many healthcare facilities.
DICOM is the international standard to communicate and manage medical images and data. Its mission is to ensure the interoperability of systems used to produce, store, share, display, send, query, process, retrieve and print medical images, as well as to manage related workflows.
Vendors who manufacture imaging equipment — e.g., MRIs — imaging information systems — e.g., PACS — and related equipment often observe DICOM standards, according to NEMA.
These standards can apply to any field of medicine where medical imaging technology is predominately used, such as radiology, cardiology, oncology, obstetrics, and dentistry.
Dicom Images comprise of Images and Metadata (Associated Information to Images regarding Patient, Device, etc.)
Sample Use Case with Architect Diagram
In the medical domain, there would be a requirement to perform searching on all available DICOM images based on some metadata Value.
E.g., Get all Dicom Images where patient age is between 20–25 and patient gender is female. There could be a billion Images in the Data Source System.
In the traditional approach, the searching team has to check all the images manually and then make them ready for further analysis. Using this Library, we can automate this searching as we can extract metadata (In Spark Batch Job) and then store metadata values with Image reference in some Data Store like HBASE where it is possible to query directly on metadata or metadata JSON can be fed directly to any search engine like Solr or Elastic Search. The sample Architecture diagram in this case would be as follows.
Dicom Images Reading in Spark
This Library reads Dicom Images from the specified Location and generates Dataframe with the below Schema:
- origin contains the file path of dicom file.
- metadata contains information (Patient, Device) in JSON.
- pixel data contains the pixel data as an array of bytes.
It also provides a corrupt result in another data frame.
In this section, we are going to look at how to load Dicom Images in Spark Dataframe. The complete project is available on Github.
import dicomreadval (dcmdf,cdf) = dicomread.readDicom(path,sparksession,numpartitions)
dcmdf is the Dicom data frame and CDF is the exception data frame (provides a list of corrupt Dicom files or non-Dicom files)