XML Schema - Allow Invalid Dates - xml-parsing

Hi I am using biztalk's FlatFile parser (using XML schema) to part a CSV file. The CSV File sometimes contains invalid date - 1/1/1900. Currently the schema validation for the flat file fails because of invalid date. Is there any setting that I can use to allow the date to be used?
I dont want to read the date as string. I might be forced to if there is no other way.

You could change it to a valid XML date time (e.g., 1900-01-00:00:00Z) using a custom pipeline component (see examples here). Or you can just treat it as a string in your schema and deal with converting it later in a map, in an orchestration, or in a downstream system.
Here is a a C# snippet that you could put into a scripting functoid inside a BizTalk map to convert the string to an xs:dateTime, though you'll need to do some more work if you want to handle the potential for bad input data:
public string ConvertStringDateToDateTime(string param1)
{
return DateTime.Parse(inputDate).ToString("s",System.Globalization.DateTimeFormatInfo.InvariantInfo);
}
Also see this blog post if you're looking to do that in multiple places in a single map.

Related

Use Annotation tool configuration / Automatic annotation service from brat

I'd like to use a personnal API for named entity recognition (NER), and use brat for visualisation. It seems brat offers an Automatic annotation tool, but documentation about its configuration is sparse.
Are there available working examples of this features ?
Could someone explain me what should be the format of the response of the API ?
I finally manage to understand how it works, thanks to this topic in the GoogleGroup diffusion list of BRAT
https://groups.google.com/g/brat-users/c/shX1T2hqzgI
The text is sent to the Automatic Annotator API as a byte string in the body of a POST request, and the format BRAT required in response from this API is in the form of a dictionary of dictionaries, namel(
{
"T1": {
"type": "WhatEverYouWantString", # must be defined in the annotation.conf file
"offsets": [(0, 2), (10, 12)], # list of tuples of integers that correspond to the start and end position of
"texts": ["to", "go"]
}
"T2" : {
"type": "SomeString",
"offsets":[(start1, stop1), (start2, stop2), ...]
"texts":["string[start1:stop1]", "string[start2:stop2]", ...
}
"T3" : ....
}
THEN, you put this dictionary in a JSON format and you send it back to BRAT.
Note :
"T1", "T2", ... are mandatory keys (and corresponds to the Term index in the .ann file that BRAT generates during manual annotation)
the keys "type", "offsets" and "texts" are mandatory, otherwise you get some error in the log of BRAT (you can consult these log as explained in the GoogleGroup thread linked above)
the format of the values are strict ("type" gets a string, "offsets" gets a list of tuple (or list) or integers, "texts" gets a list of strings), otherwise you get BRAT errors
I suppose that the strings in "texts" must corresponds to the "offsets", otherwise there should be an error, or at least a problem with the display of tags (this is already the case if you generate the .ann files from an automatic detection algorithm and have different start and stop than the associated text)
I hope it helps. I managed to make the API using Flask this morning, but I needed to construct a flask.Response object to get the correct output format. Also, the incoming format from BRAT to the Flask API could not be catch until I used a flask.request object with request.get_body() method.
Also, I have to mention that I was not able to use the examples given in the BRAT GitHub :
https://github.com/nlplab/brat/blob/master/tools/tokenservice.py
https://github.com/nlplab/brat/blob/master/tools/randomtaggerservice.py
I mean I could not make them working, but I'm not familiar at all with API and HTTP packages in Python. At least I figured out what was the correct format for the API response.
Finally, I have no idea how to make relations among entities (i.e. BRAT arrows) format from the API, though
https://github.com/nlplab/brat/blob/master/tools/restoataggerservice.py
seems to work with such thing.
The GoogleGroup discussion
https://groups.google.com/g/brat-users/c/lzmd2Nyyezw/m/CMe9FenZAAAJ
seems to mention that it is not possible to send relations between entities back from the Automatic Annotation API and make them work with BRAT.
I may try it later :-)

Convert JSON to XML format through Azure Logic app

Scenario 1 - I have some XML files stored in FTP.Those files are being fetched by FTP connector in Azure logic app. Then I am reading those files by parsing it into JSON & storing those objects in String variables for my operation. Then after my processing I want to convert that json back to XML for the output.
Scenario 2 - I am merging multiple XMl files(all are of same structure) into an single one. after merging I can get the output in JSON format but then I want to convert the same into XML format.
So please suggest how can I convert JSON to XML through Logic App & Azure function only.
Try the 'xml' function.
screenshot of xml function example in Logic App
Make sure that your JSON input is structured suitably for conversion to XML, for example you should only have a single element at the top level, which will form your XML root element.

Is a url query parameter valid if it has no value?

Is a url like http://example.com/foo?bar valid?
I'm looking for a link to something official that says one way or the other. A simple yes/no answer or anecdotal evidence won't cut it.
Valid to the URI RFC
Likely acceptable to your server-side framework/code
The URI RFC doesn't mandate a format for the query string. Although it is recognized that the query string will often carry name-value pairs, it is not required to (e.g. it will often contain another URI).
3.4. Query
The query component contains non-hierarchical data that, along with
data in the path component (Section 3.3), serves to identify a
resource within the scope of the URI's scheme and naming authority
(if any). ...
... However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, ...
HTML establishes that a form submitted via HTTP GET should encode the form values as name-value pairs in the form "?key1=value1&key2=value2..." (properly encoded). Parsing of the query string is up to the server-side code (e.g. Java servlet engine).
You don't identify what server-side framework you use, if any, but it is possible that your server-side framework may assume the query string will always be in name-value pairs and it may choke on a query string that is not in that format (e.g. ?bar). If its your own custom code parsing the query string, you simply have to ensure you handle that query string format. If its a framework, you'll need to consult your documentation or simply test it to see how it is handled.
They're perfectly valid. You could consider them to be the equivalent of the big muscled guy standing silently behind the mob messenger. The guy doesn't have a name and doesn't speak, but his mere presence conveys information.
"The "http" scheme is used to locate network resources via the HTTP protocol. This section defines the scheme-specific syntax and semantics for http URLs." http://www.w3.org/Protocols/rfc2616/rfc2616.html
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
So yes, anything is valid after a question mark. Your server may interpret differently, but anecdotally, you can see some languages treat that as a boolean value which is true if listed.
Yes, it is valid.
If one simply want to check if the parameter exists or not, this is one way to do so.
URI Spec
The only relevant part of the URI spec is to know everything between the first ? and the first # fits the spec's definition of a query. It can include any characters such as [:/.?]. This means that a query string such as ?bar, or ?ten+green+apples is valid.
Find the RFC 3986 here
HTML Spec
isindex is not meaningfully HTML5.
It's provided deprecated for use as the first element in a form only, and submits without a name.
If the entry's name is "isindex", its type is "text", and this is the first entry in the form data set, then append the value to result and skip the rest of the substeps for this entry, moving on to the next entry, if any, or the next step in the overall algorithm otherwise.
The isindex flag is for legacy use only. Forms in conforming HTML documents will not generate payloads that need to be decoded with this flag set.
The last time isindex was supported was HTML3. It's use in HTML5 is to provide easier backwards compatibility.
Support in libraries
Support in libraries for this format of URI varies however some libraries do provide legacy support to ease use of isindex.
Perl URI.pm (special support)
Some libraries like Perl's URI provide methods of parsing these kind of structures
$uri->query_keywords
$uri->query_keywords( $keywords, ... )
$uri->query_keywords( \#keywords )
Sets and returns query components that use the keywords separated by "+" format.
Node.js url (no special support)
As another far more frequent example, node.js takes the normal route and eases parsing as either
A string
or, an object of keys and values (using parseQueryString)
Most other URI-parsing APIs following something similar to this.
PHP parse_url, follows as similar implementation but only returns the string for the query. Parsing into an object of k=>v requires parse_string()
It is valid: see Wikipedia, RFC 1738 (3.3. HTTP), RFC 3986 (3. Syntax Components).
isindex deprecated magic name from HTML5
This deprecated feature allows a form submission to generate such an URL, providing further evidence that it is valid for HTML. E.g.:
<form action="#isindex" class="border" id="isindex" method="get">
<input type="text" name="isindex" value="bar"/>
<button type="submit">Submit</button>
</form>
generates an URL of type:
?bar
Standard: https://www.w3.org/TR/html5/forms.html#naming-form-controls:-the-name-attribute
isindex is however deprecated as mentioned at: https://stackoverflow.com/a/41689431/895245
As all other answers described, it's perfectly valid for checking, specially for boolean kind stuff
Here is a simple function to get the query string by name:
function getParameterByName(name, url) {
if (!url) {
url = window.location.href;
}
name = name.replace(/[\[\]]/g, "\\$&");
var regex = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)"),
results = regex.exec(url);
if (!results) return null;
if (!results[2]) return '';
return decodeURIComponent(results[2].replace(/\+/g, " "));
}
and now you want to check if the query string you are looking for exists or not, you may do a simple thing like:
var exampleQueryString = (getParameterByName('exampleQueryString') != null);
the exampleQueryString will be false if the function can't find the query string, otherwise will be true.
The correct resource to look for this is RFC6570. Please refer to section 3.2.9 where in examples empty parameter is presented as below.
Example Template Expansion
{&x,y,empty} &x=1024&y=768&empty=

mvc.net DateTime with Time part in URI

I have a set of actions that are returning time-series data with-in ranges specifiable to the minute.
They work fine with querystrings,
i.e.
/mycontroller/myaction?from=20091201 10:31&to=20091202 10:34
with or without URL encoded colons, but I thought it would be nice to have a pretty URL
/mycontroller/myaction/from-20091201 10:31/to-20091202 10:34
but this now strikes fear in the hear of IIS as it doesn't like colons in the URI so I get 'Bad Request' responses.
My question then, is what's a recommended/standard course of action to ensure I can keep the time in there?
Do I need to write a custom ModelBinder to parse my own datetime format? Should the actions just take strings for from and to and parse with a custom format eg "YYYYMMDD-HHmm". Can I specify a custom format somewhere? If so where? Or should I just give this up as folly and stick with querystring parameters?
Oh, and I see a lot of people go on about RESTful URLs; from what I've read there's nothing that says query strings aren't RESTful - it's more about appropriate use of existing HTTP action types.
You're right REST doesn't mean if it's its not in a folder structure its not REST.
The path structure is there to describe the resource. Querystrings can still be used to describe a filtered subset of such a resource. A date range fully qualifies as a filter criteria and should thus be perfectly RESTful being passed in as a querystring.

how to obtain URLs from Dmoz ODP

I want to use a database of URLs present in DMOZ ODP for my application. ( an array of URL strings OR a file containing the same ). Is there any way of obtaining it , ( other than the manual copy-paste ) ?
EDIT :
Is there any script / code to parse the rdf file..
Take a look at http://rdf.dmoz.org/, you'll need to find a way to parse the RDF into your database.
I did this the other day using the odp2db scripts from Steve's Software. They're old, but the format hasn't changed significantly so they work fine.
I found I didn't need to do the iconv and xmlclean.pl steps suggested in the readme, just uncompressed the dumps and ran the structure2db.pl and content2db.pl scripts. You'll need to create the database tables manually (see the SQL at top of script for that) and modify the connection details in the scripts before you start.
With the mid-January 2009 dump I used, there's 756,962 categories and 4,436,796 websites. It took a while to run through them all, but not excessively long, though I did dispense with the site descriptions as I didn't need them. Also, may be worth adding database indices after creating the tables to speed access up later. The raw structure and content files were 75MB and 300MB compressed respectively. 848MB and 2GB respectively.
I've actually done this in java. I just used the SAX API to read through the RDF files. It was pretty straight forward. In my case I wanted to pull out every URL that was in a topic with "Weblogs" in the topic name.
Basically what did was implement a org.xml.sax.helpers.DefaultHandler
Then to setup the code you do:
InputSource is = new InputSource(new FileInputStream("filename.rdf"));
XMLReader r = XMLReaderFactory.createXMLReader();
r.setContentHandler(new MyHandlerClass());
r.parse(is);
and that's pretty much it. In my handler class I had to implement:
startElement(String uri, String localName, String qName, Attributes attributes) then I had an if statement to see if it was an "ExternalPage" tag, in which case I went to another state to look for "topic","Title" and "Description". I had another
characters(char[] ch, int start, int length) where I read in the topic, title, and description text depending on which one had been most recently sent to startElement
endElement(String uri, String localName, String qName) where I checked to see which element was ending, and if it ExternalPage, that meant the end of the current element.
The whole thing was 80-90 lines of code for the basic parsing. So pretty easy to write. It was able to chew through the multi-gigabyte files in... I don't remember maybe a minute or two? If you just want to query out some specific data, it might be easier just to write the code to do that in your handler, rather then trying to load it into a DB.
If you find a tool that works well, that's obviously better then writing your own code. But writing your own code isn't hard! RDF is just an XML format, and it's not nested or anything. A simple SAX parser is easily doable in a day or so.
You could always pay one of the currupt editors there and they will help you out :)

Resources