Dicom Read Library (Apache Spark Third-Party Contribution)

Article by Nirali Gandhi, Big Data & Cloud Lead Developer

Dicom Data Source for Apache Spark

One year ago, we came across a problem in the Healthcare Domain where reading a huge number of Dicom Images for analysis was involved in Spark-Hadoop Cluster. Though Spark provides API to read many input formats like CSV, JSON, Parquet, and even jpeg/png Images, there was no direct API to read Dicom Images available in Spark.

So, We, Bigdata Engineers at Abzooba, built a Spark-Scala library to parse Dicom Images in Spark Dataframe, which will improve developer productivity and also performance when it is involved Dicom Images for Analysis in Spark.

This project is built using Spark 2.4, Scala 2.11, and a Java-based Open Source Library “dcm-4che”.

Introduction to Dicom Images

DICOM is the international standard to communicate and manage medical images and data. Its mission is to ensure the interoperability of systems used to produce, store, share, display, send, query, process, retrieve and print medical images, as well as to manage related workflows.

Vendors who manufacture imaging equipment — e.g., MRIs — imaging information systems — e.g., PACS — and related equipment often observe DICOM standards, according to NEMA.

These standards can apply to any field of medicine where medical imaging technology is predominately used, such as radiology, cardiology, oncology, obstetrics, and dentistry.

Dicom Images comprise of Images and Metadata (Associated Information to Images regarding Patient, Device, etc.)

Sample Dicom Image

Sample Use Case with Architect Diagram

E.g., Get all Dicom Images where patient age is between 20–25 and patient gender is female. There could be a billion Images in the Data Source System.

In the traditional approach, the searching team has to check all the images manually and then make them ready for further analysis. Using this Library, we can automate this searching as we can extract metadata (In Spark Batch Job) and then store metadata values with Image reference in some Data Store like HBASE where it is possible to query directly on metadata or metadata JSON can be fed directly to any search engine like Solr or Elastic Search. The sample Architecture diagram in this case would be as follows.

Architecture diagram to process Dicom Images

Dicom Images Reading in Spark

  1. origin contains the file path of dicom file.
  2. metadata contains information (Patient, Device) in JSON.
  3. pixel data contains the pixel data as an array of bytes.

It also provides a corrupt result in another data frame.

In this section, we are going to look at how to load Dicom Images in Spark Dataframe. The complete project is available on Github.

import dicomreadval (dcmdf,cdf) =  dicomread.readDicom(path,sparksession,numpartitions)

Here, dcmdf is the Dicom data frame and CDF is the exception data frame (provides a list of corrupt Dicom files or non-Dicom files)

Dicom data frame

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BigData & Cloud Practice

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey