Agent Bayes is an algorithm that replicates the process of automated classification (sometimes called ‘predictive analytics’) used for processing open source intelligence (OSINT) in Next Generation Information Access (NGIA) systems. NGIA systems automate the collection and analysis of data and produce reports. OSINT is intelligence collected from publicly available sources and may account for as much as 80% of the intelligence database. Agent Bayes evaluates Twitter posts and labels them as either suspicious or not suspicious in regards to national security threats. CSIA agents, or visitors to the site, can compare the algorithm’s decision to the original tweet and it’s associated metadata, and are invited to agree or disagree with the algorithm's decision. This feedback will be integrated into the algorithm to improve its accuracy. Agent Bayes is named after the Bayes theorem, the most common formula used by intelligence agencies for establishing the probability of an event.
Supervised machine learning procedure for Agent Bayes
Agent Bayes uses a supervised machine learning classifying algorithm that is modeled after the data collection processes described in technical reports and documents that have been leaked or released through Freedom of Information Act requests. Creating Agent Bayes began by manually assembling a corpus of social media posts that were labeled either suspicious or not suspicious. Ambiguous posts were discarded. Next, the posts were broken down into their ‘feature vectors’, or individual measurable property. In this case, the feature vectors are the individual words that make up the posts. Stop words, or commonly used words such as ‘and’, ‘the’, and ‘is’, were removed. The feature vectors are then processed using Bayes’ theorem to find patterns of speech that occur the most frequently within each category—suspicious and not suspicious. The patterns form models against which new social media posts can be compared. The algorithm then labels the new posts based on their statistical similarity to the models. Visitors to the site can agree or disagree with Agent Bayes’ predictions and their feedback will be integrated into the classifier to improve its performance.
Participant reviewing social media posts at Science Gallery Dublin.
The Crowd-Sourced Classifier is a supervised machine learning classifer trained on Twitter posts that were manually labeled at Science Gallery Dublin during the SECRET exhibition. From August 8 until November 1, 2015, a CSIA terminal was available for visitors to evaluate Twitter posts. The Crowd-Sourced Classifer is made from the over 14,000 posts that were labeled during that time. The decisions of the Crowd-Sourced Classifier reflect the aggregated decisions made by visitors to the exhibition.