Tom Tague isn't content to let an article just be an article. "How do I
take a chunk of text," he asked, "and turn it into a chunk of data?"
He
was speaking Thursday night at a panel
discussion hosted by Hacks/Hackers, a San Francisco-based group that
bridges the worlds of journalism and engineering. Coinciding with the 2010 Semantic
Technology Conference, Thursday's presentation dealt with the Web's
evolution from a tangle of text to a database capable of understanding
its own content.
Tague,
vice president for platform strategy with Thompson Reuters, was joined
by New York Times Semantic Technologist Evan Sandhaus, allVoices CEO Amra Tareen, and Read It Later
creator Nate Weiner. The
semantic Web is already here, they explained; and it's getting smarter.
Make news worth more
Simply
put, the semantic Web is a strategy for enabling communication between
independent databases on the Web.
For example, Sandhaus said,
there's a wealth of priceless data in databases at Amazon, the
Environmental Protection Agency, the Census Bureau, Twitter and
Wikipedia. "But they don't know anything about one another," he said, so
there's no way to answer questions like, "What is the impact of
pollution on population?" or "What do people tweet about on smoggy
days?" (Sandhaus said he did not do his presentation as a representative
of the Times.)
This is a particular problem for news publishers,
said Tague. Publishers need to monetize content, engage with users and
launch new products; since news articles lie in a "sweet spot" between
fleeting tweets and durable scientific journals, they have the most
potential to grab and retain readers.
In other words, it's
possible for publishers to improve the value and shelf life of news. All
that's required is rich metadata.
Metadata, Tague said, improves
reader engagement by linking together related media. For readers, that
means more context on each story and a more personalized experience. And
for advertisers, it means better demographic data than ever before.
But
there's a problem: Currently, the economics of online news doesn't
support the manual creation of metadata.
Let algorithms curate
Tague's
solution to the Internet's overwhelming volume of news is OpenCalais, a Thomson Reuters tool
that can examine any news article, understand what it's about, and
connect it to related media.
This is more than a simple keyword
search. OpenCalais extracts "named entities," analyzing sentence
structure to determine the topic of the article. It is able to
understand facts and events. For example, when fed a
short article about a hurricane forming near Mexico, an OpenCalais demo tool
recognized locations like Acapulco, facilities like The National
Hurricane Center and an even occupations like "hurricane specialist." It
also understood facts, synthesizing a subject-verb-object phrase to
express that a hurricane center had predicted a hurricane.
OpenCalais
has already been put to work at a wide range of news organizations,
including The Nation, The New Republic, Slate, and Aljazeera. Each
site's implementation is unique; for example, DailyMe uses semantic data to monitor
each user's reading habits, presenting the user with personalized
reading suggestions.
Both The Nation and The New Republic saw
immediate benefits to the use of OpenCalais, Tague said; the tool
coincided with significant gains in time-on-site, and it automatically
generates pages dedicated to a single topic, which had been a
labor-intensive process for editors.
Overcome overwhelming content
As OpenCalais frees
editors from the minutiae of searching for complementary stories, Nate
Weiner's software facilitates the gathering of reading material. Read It Later integrates with
browsers and RSS readers; when users see something that they want to
read later, they simply flag the page and the application gathers it for
later consumption.
Unfortunately, users can sometimes wind up
with an overwhelming, disorganized collection of articles. So Weiner
decided to teach the application how to group similar items, making them
easier to skim and select.
Initial experiments with manual
tagging didn't work out, since users weren't interested in taking the
time to add tags to every article they collected. So Weiner turned to
semantic applications that could automatically analyze each article and
organize related topics. His tool of choice: OpenCalais, which turned
Read It Later's "Digest" view from an unwieldy list into a magazine-like
layout.
Organize the organizing
Sandhaus
described the alchemy of the semantic Web as "graphs of triples," which
drew furrowed brows from his audience. But it turned out not to be as
complicated as it sounds; the "triples" are just simple
subject-verb-object sentences, chained together. For example, if a tool
detects "Barack Obama" in an article, it will scan nearby words to
create a relationship like "Barack Obama is the President." Then it can
build on its knowledge of "the President" to branch further out: "The
President lives in the White House," "The White House was burned in
1814," and so on.
These relationships are derived from massive
databases that grow larger and larger by the day. For example, DBpedia has turned Wikipedia into a
database of 2.6 million entities; Freebase
is a database of databases with 11 million topics; GeoNames tracks 8 million place
names, and MusicBrainz can
recognize 9 million songs.
But the real magic happens when the
databases come together, such as when the BBC wanted to create a comprehensive resource for
information about bands. By merging its own information with entries
from Wikipedia and MusicBrainz, the BBC created a website that seems to
know everything about music.
Trust
algorithms, but trust humans more
As smart as the
semantic Web can be, it's still not as smart as a human editor. "Our
algorithms can never be perfect," said allVoices CEO Amra Tareen. Her
company provides citizen journalists with their own news platform,
incentivizing high-quality reporting with payments based on page views.
Since
its launch in 2008, allVoices has scanned articles to generate what
Tareen called a "bag of words" that connects each story to complementary
reporting. Depending on a reporter's algorithmically calculated
reputation and users' engagement with the story, the story can work its
way up from a local section to national or even global focus on the
site.
Tareen estimates that the curating of news on the site is
about 20 percent human and 80 percent algorithmic.
Expect to see more semantic Web tools
Expect
to see more semantic Web technology -- lots more, and soon. "There's
growing momentum in this space," said Sandhaus, gesturing to a slide
showing exponential growth of connected databases. "The more that you
put yourself out there and people point back to you, the easier you are
to find."
Fortunately for journalists, the semantic Web will work
for humans, not the other way around. "We don't want to get in the way
of the journalistic process," said OpenCalais' Tague. That's welcome
news to any reporter who has been frustrated by a clunky content
management system, a labyrinthine tagging and categorization system or
manual photo management.
Semantic Web developers' goal, Tague
said, is to free journalists to report, rather than sentencing them to
generate endless metadata for the sake of SEO. "I hate the idea of
journalists writing for searchability," he said. "That's a problem we
should solve on the tech side."
Weiner of Read It Later agreed. Speaking on behalf of developers, he
advised journalists, "Keep doing what you're doing. We'll try to adapt."