GRAM: Graph-based Attention Model for Healthcare Representation Learning

Reason to Read: We have all seen use cases for deep learning in medicine, especially in tasks where image processing is involved:

Deep Learning Algorithm for Detection of Diabetic Retinopathy

Deep learning algorithm does as well as dermatologists in identifying skin cancer

Caution: Do Machines Actually Beat Doctors?

But we don’t really see much progress with combining textual data sources (ie. electronic health records (EHR)), which seems strange since we generate these records on every single clinical visit. Some of this can be attributed to privacy concerns but lately, we are seeing a surge in large, publicly available clinical databases. A few of my colleagues and my old professor have been steadily making progress with using deep learning and clinical records to draw out insights that can drastically aid the field. Implementation is below along with public access to the datasets.

TLDR; Data insufficiency and interpretation are two common issues when using deep architectures. This is especially true for a lot different healthcare related tasks where these two issues cannot be overlooked. This paper introduces the GRaph-based Attention Model (GRAM) which uses an hierarchical attention mechanism to represent medical concepts.

Detailed Notes:

  • The objective of the models discussed in this paper is to take clinical event information from a patient’s health record and predict the events that will occur in the next visit. These events could be a multiple of different things but a common type of event is a set of International Classification of Diseases (ICD) codes from each clinical visit.
  • Data insufficiency in terms on electronic health records (EHR) occurs in the form of limited data samples for a particular rare occurrence (disease, condition, etc.). Since we do not have enough data, it is hard to develop useful representations for these events to receive accurate predictions.
  • Interpretability is the other issue, especially with health related tasks, because it is very important to know why predictions are as they are. It is important that the representations these deep models create are highly consistent with our medical understandings of these inputs.
  • The paper uses medical ontologies (hierarchies of medical information where one ICD code can be a parent and a child of other parent nodes) to encode the inputs. This incorporation of medical ontology information helps the data insufficiency issue by reducing the number of input features needed to be assessed. So if you think of the model training as learning how to classify many different types of input combinations (say for a particular disease), if the disease does not have much data, this will be a difficult task. But by using the hierarchal information for the disease, we are able to retrieve more information about the diseases and it’s parents. These parents will be the parents of other diseases as well, which now gives us some more useful information and context to encode our rare diseases/events.
  • The GRAM method infuses the medical ontologies into the encoding process by using an attentional interface. I don’t think I can say it better than the paper can for how they use this information to create representations:

“Considering the frequency of a medical concept in the EHR data and its ancestors in the ontology, GRAM decides the representation of the medical concept by adaptively combining its ancestors via attention mechanism.”

The intuition is that when a medical event lacks enough data points, more weight is given to the ancestors of that event (parent nodes), since they can be learned better (since data exists for them via other child events). This effectively solves both the issue of data insufficiency and interpretation.

Representations via GRAM:

  • Now we will discuss the details of how the authors decided to embed each of the patients information.


  • The codes (ex. ICD codes) are all represented by a set C and the clinical records of each patient is represented by V (for T visits), where each visit V_t is composed of x_t which is a binary result for each code (whether or not it is given for that clinical visit).

diagram1 eq1eq1

  • Now we need a way of embedding the codes for each visit. The paper uses GRAM to incorporate the medical ontology information in the embedding procedure.


  • We take the embedding matrix G to embed the visit V_t into v_t, which will be used in the end to end architecture. The embedding involves doing a matmul of G and each x_t and applying a tanh nonlinearity. Then for each v_t (which represents an embedded visit), we feed into an RNN whose output is used with a softmax operation. So the predicted y (yhat) has \mathcal{C} items where each item is the chance of that code appearing for the next visit. We use this with a binomial cross-entropy loss (since ground truth is also \mathcal{C} items but with binary values). This is repeated for each t in V_T visits. (Ex. V1 used to predict V2, V1 + V2 used to predict V3, up to V1 + … V_T-1 to predict V_T).

Unique Points:

  • Paper uses publicly available data (Ex. MIMIC-III dataset) which allows us to test the results. However, be wary of the large data, as it contains ICU events for ~7,500 patients over a 11 year period. I recommend playing with Scala to understand the data a bit before implementing anything. I might upload some of my old MIMIC scripts to help with understanding some context with the data.


  • GRAM was tested against recurrent architectures and different initialization schemes to represents the medical ontology. The results were quite impressive even for accuracy @5 and @20 considering the number of clinical class code groups.


  • Paper also plotted some scatterplots of the final representations to see if they are consistent with medical ontologies. You can see that the attention was able to incorporate the ontologies well as similar classes of clinical codes form distinguished clusters.


  • The authors also conducted a study of the attention behavior for ICD codes that have lots of events associated with them and ones that do not. We expect that when the data is low for a particular clinical event, more attention will be given to the parents since that will allow us to better represent the clinical code. Figure 4 confirms our hypothesis by showing the amount of attention the leaf and parent nodes receive. In the diagram, (a)’s leaf is rarely observed, a few more events available for (b) and plenty of clinical events available for leaf nodes (c) and (d).


  • I think the GRAM interface proposed by the authors has a lot of potential. Data insufficiency is a major deterrent when it comes to using health records to garner insights. Using medical ontology to mitigate this issue and incorporate valuable clinical information that aligns with medical expectations offers a powerful representation. Though, we still have quite a long way to go before we use clinical records to give direct actionable insights to professionals, artificial intelligence is helping make leaps towards that goal.


2 thoughts on “GRAM: Graph-based Attention Model for Healthcare Representation Learning

  1. Tom M. SchaulE says:

    Haven’t checked out the paper yet but are there any other large datasets available? And I’d very much like the mimic scripts you mentioned. And I like your posts but I’d like to see a bit more on the experiments and some comparisons with similar studies. With that being said, I think it should be a strong requirement for every publication to have one of your “neural perspectives” 🙂 I also really like when Deep Mind publishes some supporting blogs for very detailed papers.

    Liked by 1 person

    • gokumohandas says:

      Appreciate the comment, I’ll try to keep them coming! And yes, I think it would be very helpful to offer some context on recent papers working on the same task. Usually papers will have a long detailed section about previous/related work but this specific prediction task has received very little attention (apart from the this group itself, JHU, and few isolated others). Hopefully, in the future we’ll see more!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s