Web Scraping and Data Visualization of PubMed Biomedical Articles

Hongwei Liu

Source Code


These days people are paying more and more attention to health related problems. When I was studying Food Science & Nutrition as an undergraduate, I learned that there had been a lot of concerns on aging baby boomers' health and nutrition conditions. The fact is that 20% of U.S. GDP has been spent on health care with majority spendings on the elderly. Nowadays people are utilizing advanced scientific methods to create innovative treatment/diagnostic techniques in various areas like Health Informatics, Biomedical Informatics and Bioinformatics to improve the healthcare system.

One of the popular research area is applying NLP/Deep Learning methods in Electronic Health Records. In order to find out cooperations between universities and research institutes, I web scraped articles in PubMed database by searching "clinical natural language processing" and recorded all the returned 1600+ articles' information.



Number of Published Papers Each Year:

Clearly we could see a rapid growth in papers tagged keywords like 'Clinical NLP'. In 2017, the current number of published papers is supposed to grow continuously since article information is collected in August.

Institution Information:

When looking at those institution information, there are many departments/centers/institutes inside and outside a univeristy which makes it hard to conclude organization information. As a result, I decided to reorganize organization information based on its city location. Furthermore, I put a lot of efforts in finding out cooperations between different univeristies(presented as different cities). To make the relationship graph more clearly, I only chose papers published in 2016 and 2017(total about 300 papers) and visualized their cooperation with others. You could zoom in to have a better look.

From the above graph, three cities - Boston, Salt Lake City and New York have the most publications. As is known to all, Boston is famous for its healthcare industry with many high reputation universites, hospitals and research institues like Harvard Medical School and Brigham and Women's Hospital. Similar to Boston, universities and institutes like Columbia University, Mount Sinai School of Medicine, Memorial Sloan Kettering Cancer Center are contributing to Clinical NLP research in New York. Meanwhile, Univeristy of Utah in Salt Lake City is the main contributor.

Furthermore, considering cooperation between different universities and institutes in the same city is quite common, I reprocessed the data to see which cities have more cooperations with other cities. The top three are Boston, New York and Seattle which are cities with great location and proactive research environment.

Country Information:

From the above pie chart, obviously most papers are published by universities, research institutes in the U.S. indicating that America is taking the lead role in healthcare as well as Clinical NLP research.

More information about authors and journals could be found in the PubMed Jupyter Notebook in Source Code.