How to parse SPARQL results? - parsing

I am using Twinkle (A SPARQL Query Tool). I did a SPARQL over a RDF file, and got a results file like below. Since it doesn't seems a typical file format like CSV, do you know a library to parse this format? Any programming language is fine.
---------------------------------------------------------------------
| name |
=====================================================================
| "Egypt"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Iraq"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Jordan"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Kuwait"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Libya"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Mauritania"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Somalia"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Sudan"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Syrian Arab Republic"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Tunisia"^^<http://www.w3.org/2001/XMLSchema#string> |
| "United Arab Emirates"^^<http://www.w3.org/2001/XMLSchema#string> |
| "Yemen"^^<http://www.w3.org/2001/XMLSchema#string> |
---------------------------------------------------------------------

That's not any standard format, so you'd have to write a parser for that by hand; it looks like the default CLI output of a query command for a database (which one I wonder?).
The query command from the CLI probably has the option to provide standard SPARQL results formats, such as SPARQL/XML or SPARQL/JSON, which you can use any standard RDF library, such as Jena or Sesame if you are working in Java, to parse the results in that format. That is the best way to accomplish what you're attempting.
Generally, you should not interface programmatically with CLI output and instead use API's provided with the database.

That looks like it could be Jena output.
The ResultSetFormatter class contains ways to format results in all the standard formats (XML, JSON, TSV, CSV) as well as this display format in text.
ResultsetFormatter.outputAsXML
ResultsetFormatter.outputAsJSON
ResultsetFormatter.outputAsTSV
ResultsetFormatter.outputAsCSV
The text format is not for parsing - more for simple display and debugging.
The command line has args to set the results format e.g. --results json
And the query form in Fuseki allows you to choose the output format.

The format you see is a Typed RDF literal. The URI http://www.w3.org/2001/XMLSchema#string is a XSD type "string" saying that your value is just a "string" (it could be an "int", etc...). If you just want the value, you can omit the URI after "^^" or use the STR function in your SPARQL query.

Related

How to add the vertical pipes we see in Examples feature of the Scenario Outline in SpecFlow

I would like to add the vertical pipes so that I can have my data table down there under "Examples" feature in Specflow. Anyone to give me any tip so I can go through it?. My scenario outline looks like:
#mytag
Scenario Outline: Check Noun Words
Given Access the AnalyzeStateless URL
And language code
And content of <Sentence>
And the Expected KeyWord <Expected KeyWords>
And the Expected Family ID <Expected FID>
And the index <Index>
When return the XML response
Then the keyword should contain <FamilyID>
Examples:
| Index | Sentence | Expected KeyWords | Expected FID |
| 1 | I need a personal credit card | personal | 92289 |
The "Examples" feature has been manually entered in above case. I have a thousand of rows on an excel file, any appropriate way to get all of the values in one go?
Have you looked at Specflow.Excel which allows you to keep your examples in your excel files?

query for passive hosts to be removed?

Can someone please help me to remove passive hosts in splunk. the query i am using is:
| metadata type=hosts
| sort recentTime
| convert ctime(recentTime) as Latest
You should compare the recentTime with the current time, work out the difference and compare the difference with a threshold to identify those hosts
Example query:
| metadata type=hosts | eval diff=now()-recentTime | eval threshold=3600 | where diff>threshold
Note: query not tested but you should get the idea

How do you find the list of wikidata (or freebase or DBpedia) topics that a text is about?

I am looking for a solution to extract the list of concepts that a text (or html) document is about. I'd like the concepts to be wikidata topics (or freebase or DBpedia).
For example "Bad is a song by Mikael Jackson" should return Michael Jackson (the artist, wikidata Q2831) and Bad (the song, wikidata Q275422). As this example shows, the system should be robust to spelling mistakes (Mikael) and ambiguity (Bad).
Ideally the system should work across multiple languages, it should work both on short texts and long texts, and when it is unsure it should return multiple topics (eg. Bad song + Bad album). Also, it should ideally be open source and have a python API.
Yes, that sounds like a list for Santa Claus. Any ideas?
Edit
I checked out a few solutions, but no silver bullet so far.
NLTK parses text and extract "named entities" (AFAIU, a part of a sentence that refers to a name), but it does not return Wikidata topics, just plain text. This means that it will likely not understand that "I shot the sheriff" is the name of a song by Bob Marley, it will instead treat this as a sentence.
OpenNLP does roughly the same.
Wikidata has a search API, but it's just one term at a time, and it does not handle disambiguation.
There are a few commercial services (OpenCalais, AlchemyAPI, CogitoAPI...) but none really shines, IMHO.
You can use Spacy to retrieve Named Entity then link them to WikiData using the search API.
For what remains of the sentence that is not matched as named entity by Spacy you can create a list of ngrams from the sentence starting with the biggest ngram you use the WikiData search API to lookup WikiData topics.
POS tagging can be put to good use, that said syntax parse informations is more powerful since you can know the relations between the words. For instance given the following output from link-grammar:
Found 8 linkages (8 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)
+-------------------------Xp-------------------------+
+----------->WV---------->+ |
+-------Wd------+ +---------Osn--------+ |
| +---G---+----Ss---+----Os----+ | |
| | | | | | |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] .
You can tell that the subject is “Bob Marley” because
“wrote” is connected to “Marley” with a S which connects subject nouns to finite verbs.
“Marley” is connected to “Bob” using a G which connects proper noun together.
So a “Bob Marley” is a good candidate for an entity (also it has both word capitalized).
Given the above parse "tree" it difficult to tell whether “Natural” and “Mystic” are related even if they are on the same side of the sentence.
The second parse provided by link grammar has the same cost vector and links together “Natural Mystic” with again a G.
Here is it:
Linkage 2, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)
+-------------------------Xp-------------------------+
+----------->WV---------->+ |
+-------Wd------+ +---------Os---------+ |
| +---G---+----Ss---+ +----G----+ |
| | | | | | |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] .
So in my opinion “Bob Marley” and “Natural Mystic” are good candidate for a wikidata search.
That was the easy problem where grammar and spelling are correct.
Here is one parse out of 11 of the same sentence with lower cases:
Linkage 1, cost vector = (UNUSED=1 DIS= 0.15 LEN=14)
+------------------------Xp------------------------+
+----------------------Wa---------------------+ |
| +------------------AN-----------------+ |
| | +-------------AN-------------+ |
| | | +----AN---+ |
| | | | | |
LEFT-WALL Bob.m marley[?].n [wrote] natural.n mystic.n .
LG doesn't even recognize the verb.

Splitting examples in Given and Then for SpecFlow Scenario Outline

I am writing a specflow scenario with multiple input and output parameters (about 4-5 each). When using scenario outline, I need to write a wide table giving both input and output columns in the same row. Is there any way where I can specify the examples separately for the step definitions? This is for improved readability.
Current state
Given - State of the data
When I trigger action with parameters <input1> and <input2> and ...
Then my output should contain <output1> and <output2> ...
Examples:
| input1 | input2 |... | output1 | output2 |...
Can I do this?
Given - State of the data
When I trigger action with parameters <input1> and <input2> and ...
Examples of input
Then my output should contain <output1> and <output2> ...
Examples of output
No, unfortunately that (or anything similar) is not possible.
You could make your inputs and outputs more abstract and possibly merge a few columns. Example: instead of Country | PostalCode | City | Street | House | Firstname | Lastname | etc. you should have | Address | Job title | with values like "EU", "US, missing postal code", "HQ" for the address.
You can't have multiple Example tables for scenario outline but you can pass in data tables for regular scenarios.
The data table will be accessible only to the step that uses it, however you could save it in Scenario Context for subsequent steps.
Not sure if this will work for you if your scenario is complex and spans multiple lines but I thought I'd mention it.
Scenario: Checking outputs for inputs
Given - State of the data
When I trigger action with the following parameters
input1 | input2 | input3 |
data | data | data |
Then my output should contain the following outputs
output1 | output2 | output3 |
data | data | data |

YQL | why I'm getting a syntax error?

I'm trying to query a simple Google search using YQL but apparently it seems not working. Here's my exact query
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%27https%3A%2F%2Fwww.google.com/search?q=Google+Guice&ie=utf-8%27%0A&format=json
And the error is
{"error":{"lang":"en-US","description":"Query syntax error(s) [line 1:74 mismatched character ' ' expecting ''']"}}
The error is pointing to line 1:74 which is near 20where. This is also the encoded version of the URL, and it is difficult for me to exactly understand where the error is.
Here's your url:
http://query.yahooapis.com/v1/public/yql?
q=select%20*%20from%20html%20where%20url%3D%27https%3A%2F%2F
www.google.com/search?q=Google+Guice&ie=utf-8%27%0A&format=json
The URL query parts are separated into the following (separated by &):
+--------+---------------------------------------------------+
| q | select%20*%20from%20html%20where%20url%3D%27https |
| | %3A%2F%2Fwww.google.com/search?q=Google+Guice |
+--------+---------------------------------------------------+
| ie | utf-8%27%0A |
+--------+---------------------------------------------------+
| format | json |
+--------+---------------------------------------------------+
As you can see, YQL is not receiving the full query string as you wanted it to. This is because the & character that should be part of the query string has not been url-encoded to %26.
The URL should look like …Guice%26ie=utf….
Aside: There are a few other issues that you are going to face. The first is that the Google search URL embedded into the query is malformed since it will contain a literal space character between Google and Guice, which Google does not accept. Secondly, the URL is restricted by Google's robots.txt so even if the URL is fixed, you won't be able to get any results from there.

Resources