Sometimes I am curious about which people are tweeting online on certain topics. One day in May 2017 I decided to collect tweets on Twitter by tracking hashtags like #healthcare, #health. As a result, 5000+ tweets had been collected. Some basic analysis was conducted to explore these tweets. Of course it took a lot of time and efforts to clean and reorganize those messy data.
- Data Collection & Storage: Use Twitter Streaming API to collect tweets with hashtags like #healthcare, #health and store these json data into MongoDB.
- Data Extraction & Preprocessing: Extract tweets data from MongoDB and use natural language processing methods to clean, stem and filter desired words.
- Data Transformation: Extract geolocation from tweets and transform them to GeoJSON format for further mapping; Use Pandas to retrieve statistical results and build a dataframe for further plotting.
- Data Analysis & Visualization: Conduct descriptive analysis and create graphs & map tweets to the map.
Build a word cloud to present a general view of high frequency words.
A few interesting words in the word cloud could be found. It seems when people posting tweets with healthcare hashtags, they tend to talk about the topic of "hiring" and "jobs" meaning healthcare workers are in high demand or there are many companies trying to recruit people on Twitter. Also tags like "bigdata", "healthtech" indicate innovative technology like big data, machine learning is being deployed in healthcare industry. In addition we could also find "trumpcare", "obamacare", "ahca"(American Health Care Act of 2017), "cboscore"(The Congressional Budget Office Score for the Republican Health-Care Plan). It shows that people are paying attention to the recent change in healthcare plan especially after House Republicans have voted to repeal Obamacare and replace it with Trumpcare.
To get a better sense of those hashtags' quantitative distribution. Top 12 hashtags have been collected and ploted below.
In the pie chart, "hiring" and "job" account for 70% of the total hashtags while healthcare policy related tags take up about 10%.
Lets compare healthcare tags with others.
In the histogram chart, More about quantitive relationships between healthcare and other keywords could be learned. For every 5000 healthcare tags, there are about 100 tweets talking about each of health care reform related topics.
Below is the geolocation of all the tweets.
The majority of tweets are posted inside the United States. Except big cities on east and west coast, it is surprising to see many red dots in the Midwest. One reason could be that there are more elderly people in the Midwest while many skilled healthcare talents flow into east and west coast. It might also due to the burgeoning tele-medicine industry in the Midwest with the help of advanced data science & computer science technology.