Machine Learning Times
Machine Learning Times

2 years ago
Mind Your Own Text: Public Data for Political Insights


Who is your favorite president? Can Data Science aid in evaluating presidents and their policies?

From the suburbs of Washington, D.C., our research group at George Mason University (Abhishek Kamath, Suyameendra Wadki, and Abhishek Madhusudhan) have developed a Text- Mining system to help you answer these questions!

Governments around the world are racing to advance open government initiatives. The US federal government for instance, is one of the main collectors and distributors of data. Some of the American governmental data are private or confidential, while many other parts are public. Federal agencies post some of their public data online; nevertheless, most government data are residing on private servers, personal computers, and saved in unstructured formats.

The Open Data Initiative ( has pushed most agencies to adhere to pulling their datasets together, and sharing them with the public. However, a recent study showed that the majority of shared data is not machine-readable (i.e. formats that are not readily useable for analysis such as data in pdf files). The quandary with those formats is that they prevent agencies from extracting insights or sharing them on public outlets such as In order to achieve the required data science dynamism at government, agencies need to address the instant need for increased data science expertise.

US federal agencies should invest in data sooner rather than later, or they will miss the unstoppable train of data science that most organizations (including other governments) are already riding. It is not necessary for all federal analysts to become data science experts, but they have to become power-users of data tools. Federal employees should invoke such tools in the culture of their divisions and be data-science champions at their agencies. To accomplish that, federal analysts are to be provided with advanced data tools enriched with contents relevant to their governmental tasks.

Based on that notion, we have decided to solve this problem by building a big-data engine that mines through bodies of governmental data (focusing on textual data). Agencies have massive amounts of textual public data from political campaigns, debates, and state-of-the-union addresses.

The goal of the engine is to collect political datasets, structure them, provide insights into the politician’s views, and observe trends throughout the years on issues that Americans care about the most. Using streamed data and live data visualizations, the system introduces predictions, correlations, and quick conclusions into policies and their effect on the country.

In this article, the Governmental Text Engine (GTE) is presented through an example of mining all State-of-the-Union addresses throughout the years (from the days of George Washington to current day presidency). Figure 1 shows the process applied through Hadoop’s Map Reduce. The first step was to upload all the text from all State-of-the-Union addresses into a Hadoop cluster. A Python scraper program (using the library “Beautiful Soup”) was written to scrap textual data from the website:

Afterwards, Text Mining algorithms were deployed for extracting political insights from presidential addresses (since year 1790). GTE intends to find correlations between the most frequent words that appear in the presidential addresses and the trends in the issues facing the country. A word-count process was applied through splitting the text, mapping words together, shuffling, and then producing counts.

Figure 1: Map-Reduce Word Mining and Counting Process

Once the data is filtered and structured, important words and trends are driven using ZingChart.js & Chart.js dashboards. Figure 2 shows a Word-Cloud built to summarize the main trends during a certain presidential period; Figure 3 shows a group of Pie Charts that break-down the most common words by year during the Obama administration. It is evident for example, that during that presidency, the words ‘jobs’ and ‘healthcare’ were the most two uttered words by that president.

Figure 2: Word-Cloud (for president Jackson’s most used words)

Figure 3: Insights on One Presidency

Figure 4 shows that President Lincoln’s most uttered words were: ‘states’ and ‘congress’, as well as ‘unified’ (given the Civil War situation during that period), while president Eisenhower’s (Figure 4 as well) most uttered word was ‘world’ (due to the post World War II era). The power of the GTE system is illustrated in the quick insights driven from the textual data. Figure 5 for example, shows how two presidents are compared to each other in terms of text and addresses.

Figure 4: President’s Lincoln’s Doughnut and Eisenhower’s Radar

Additionally, the GTE system has term-wise analysis; that helps in comparing two consecutive presidents. For example, the second term of President George W. Bush is compared with President Obama’s first term’s (transitional data analysis), refer to Figure 5.

Figure 5: Comparing Two Presidents by Issues (Obama and Trump)

The GTE system, its code, the user interface, and the data are published on GitHub (we are big believers in Open Science and Open Source!). We encourage you to explore the website, and see which president(s) you prefer:

The GTE system illustrates that for data science at government, the future is now. Tools similar to the GTE system aid agencies in sharing more data with the public. The eventual goal is to increase government accountability, mend transparency issues, and improve national discussions regarding policy making.

See these for further reading:

  1. Open for whom? An overview of for government file formats:
  2. Batarseh, F., and Yang, R., “Federal Data Science: Transforming Government and Agricultural Policy Using Artificial Intelligence”, 1st Edition, Published by Elsevier’s Academic Press. ISBN: 9780128124444. Agricultural/dp/0128124431/ref=mt_paperback?_encoding=UTF8&me

About the Author

Feras A. Batarseh is a Research Assistant Professor with the College of Science at George Mason University (GMU), in Fairfax, VA, USA. His research spans the areas of Data Science, Artificial Intelligence, and Context-Aware Software Systems. For more information, please refer to:

His three students: Abhishek Kamath, Suyameendra Wadki, and Abhishek Madhusudhan have developed the system as part of their MSc work at GMU.

Leave a Reply

Pin It on Pinterest

Share This