Data Collection and Content Classification.
Our database of Media profiles has 2 distinct jobs. Collecting intangible data, like revenue, ownership, years online…) and Classifying content for our taxonomy and how sites are “spotted as” (like “fake news”, “junk science”…)
Data Collection is a multi-references, cross checking and evolution watch crawling exercise when…
Content Classification is all about Machine Learning.
And all about “bags of words”. For every classification job, we build datasets made of words onto which the frequency of occurence is used to train a classifier.
As mentioned above, we have 2 types of Classification: Taxonomy and “spotted as”.
As in the graphic above, every articles is matched against our taxonomy datasets so we can classify each and every article. This gives us a clear picture of a feed, and thus, the whole media.
This, of course, makes a (big) lot of operations: 75,000 per article. Yes, 75 Billions ops per million of articles daily.
Taxonomy fun facts (as of today!)
Hereafter is the visualization of the New York Times, Tech section’s DNA.
Sensitiveness and depth customization. Tailor-made for the analyst.
Datasets used to classify articles can use a customized buffer of time for those datasets and thus, manage how sensitive to daily news the taxonomy will be. In addition, cliffs can also be customized to select a depth of expertise, from “dedicated” to “covered” or even “all sounds”. Both combined, plus the “always up-to-date” factor, makes our taxonomy perfectly tailor-made for the job the analyst wants to run. Reason why we use “Corpus Intelligence” as our tagline.
We can also link our taxonomy to our Enterprise Client’s taxonomy, so Corpus Intelligence can use the client’s business environment, (We’ll cover this in a dedicated post later. If you can’t wait, ask using the form below)
“Spotted as” Classification.
Point of being AI-Operated is we do not have any emotion or opinion. Everything is made for our client to define what they truly need and trust for content.
TrustedOut does not score nor judge anything or anyone. In addition, notions like “fake news” is not as cristal clear as people may think. The “Media, Trust and Democracy report” says it perfectly in its introduction: “Concern about “fake news” is high, but we can’t agree on what that means.”
A vivid picture on how a Media is “spotted as”.
As, TrustedOut profiles Media and their brand values, we have developed a sophisticated way to classify how a Media is “spotted”. In other words, we do not score or judge, we tell you if a Media is “spotted as” a fake news publication, for example.
In addition, the way a Media is “spotted as” varies over time. Some are getting worse, some are just revivals of previously shutdown ones, some are, of course, fixed and improved. This is why it’s mandatory to keep an always updated classification. And consequently, have your Corpus of documents always up-to-date.
Works with any terms. Bad or good.
“Fake news” is always the first coming to mind, then all toxic or suspicious terms like “Extreme bias”, “Junk Science”… but it can also works perfectly for neutral or positive terms, like “Visionary”, “Optimistic”… This opens doors to Enterprise-wide personalization.