Semantic XBRL Data Search Using SPARQL -from HitachiDataInteractive.com Written by Ashu Bhatnagar
We all use Google for Web searches on a daily basis and admire the simplicity of its front-end user interface. It’s nice, clean, fast, and simple. Behind this simplicity lie sophisticated index databases and advanced search technologies, but we as users don’t need to know or understand these. All we need to know are smart keywords that help direct our searches from hundreds of billions of marked-up HTML pages scattered across the global Internet.
When we try to search using regular SQL database search technologies, though, we run into difficulties. Why? Because most of this web content is in distributed HTML flat files and isn’t organized in any centralized database with well defined data structures and schema. It’s like a world full of roads with no roadmaps. Go discover!
Search engines like Google, Ask, and others find the content that matches with our queries by building and employing centralized databases that contain metadata, where every keyword acts as a tag and has fast and efficient links to corresponding websites. In other words, a search engine acts like a very knowledgeable guide for us, responding to our queries with found/not found answers based on the Internet roads it has access to and has crawled before. Read more at Semantic XBRL Data Search Using SPARQL
Why not use such a powerful search front-end to query financial research data? During my experience working with both sell-side and buy-side research analysts, there has been a long standing request to build such a tool, but until recently, the short answer to this request has been “No!”
No, because it’s technically too difficult or it’s too expensive.
No, because Google deals with text and not data, which has both context and meaning. Data is far more challenging to search, because even when it’s on the Web, it is marked up with HTML as text, not as data, thereby losing its context for meaningful search.
No, because there are no generally accepted standard financial dictionaries, or taxonomies, that define terms such as revenue, sales, or net income as synonyms.
Until recently this list of No’s has been long. The good news is that the list is now shrinking quickly with the increasing adoption of XBRL and EDGAR standard taxonomies and the release of several XBRL tools.
All that is needed to accomplish powerful search of financial research data is to subscribe to the SEC’s XBRL filings as free RSS feeds, extract XBRL data into our own relational or Google-like index databases, and use SQL to find answers to our queries. As an alternative, we could subscribe to third-party data services firms like Bloomberg, Thomson Reuters, Factset and others that would add XBRL data to their current aggregate data and continue to offer this as a service.
The news gets even better when we add SPARQL, a W3C specified query language for RDF, to XBRL and Linked Data.
Jim Rapoza, Chief Technology Analyst of eWeek, explains:
Called SPARQL (pronounced "sparkle"), this standard brings about a standardized SQL-like query language for the Semantic Web. And, like most Semantic Web standards, it is heavily based on RDF (Resource Description Framework), although it also makes use of many Web services standards, such as WSDL (Web Services Description Language).
SPARQL essentially consists of a standard query language, a data access protocol and a data model (which is basically RDF).
Some people out there are probably thinking, So what? Sounds like just another search tool—big whoop. But there’s a big difference between blindly searching the entire Web and querying actual data models.
The ability of database queries to pull data from giant databases is pretty much the basis of a large number of enterprise applications. No one argues about the value of being able to write a query in an application that can pull relevant customer and product data.
Now, imagine writing a similarly small application that does the same thing—only with data stored across the entire World Wide Web.
That would include all the companies who not only file in XBRL but also, in conformance to SEC requirements, will be posting XBRL data on their own company websites.
In essence, with SPARQL, we can choose to build centralized databases to query XBRL data, but we don’t have to. We simply can point our queries to so-called SPARQL endpoints that — unlike traditional database requests that must be under one administrative control — can span the Web over thousands of company websites with XBRL data and obtain results as if they came from one centralized database. Imagine the cost savings in not having to build and maintain a huge and growing centralized database.
Applications for publishing XBRL as Linked Open Data are limited at this time, but they are emerging. As one example, Roberto García and Rosa Gil describe their work undertaken at a Research Group at Universitat de Lleida, Spain, which extracted 1.34 million triples from 612 XBRL filings. (Triples are semantic data elements in RDF format.) The process of extraction is machine automated and results in transforming XBRL data into Semantic Web formatted RDF data.
In addition, sufficient examples in the current Web exist to give us insight into how the user experience might look when Semantic XBRL applications go into production use. Next time you search for the best flight for your air travel on sites such as Orbitz, Kayak, or FareCompare, take a pause and observe that the flight schedules, prices and airline details are being pulled not from any one centralized database but from a variety of airline databases, in real time, to match your exact itinerary requirements, thanks to some very specialized and complex technologies.
In summary, SPARQL makes Semantic XBRL searches possible on-demand across a distributed web space while simplifying front-end design, and keeping the complexity of technology hidden and out of sight from end users.
A Google-like experience of searchable financial research data is coming. The future looks bright.
Ashu Bhatnagar is CEO of Good Morning Research, a Softpark company that specializes in building Semantic XBRL technology. The GoodMorningResearch.com machine automates XBRL tagging of Excel data in RDF format with one-click Save As XBRL functionality. Mr. Bhatnagar also moderates the Semantic XBRL group on LinkedIn.


