What kind of error should be checked by a validator while validating biological file formats like GFF and FASTA - fasta

I'm working on a project to create a library(in Java) that can validate various biological file formats like GFF, FASTA, OBO etc.
But as I'm not from this field, So I'm little confused about what kind of validation should be performed by the validator program.
There are some online tools like Genome Tools that can validate GFF file format, So can anyone help me understand what kind of validation rules should be applied on easy of these files.

Related

Saxon XSD or Schema parser in java

Is there any way we can parse Schema or XSD file using saxon?, I need to display all possible XPath for given XSD.
I found a way in org.apache.xerces but wanted to implement logic in Saxon as it supports XSLT 3.0 (we want to use same lib for XSLT related functionality as well)
thanks in advance
Saxon-EE of course includes an XSD processor that parses schema documents. I think your question is not about the low-level process of parsing the documents, it is about the higher-level process of querying the schemas once they have been parsed.
Saxon-EE offers several ways to access the components of a compiled schema programmatically.
You can export the compiled schema as an SCM file in XML format. This format isn't well documented but its structure corresponds very closely to the schema component model defined in the W3C specifications.
You can access the compiled schema from XPath using extension functions such as saxon:schema() and saxon:schema - see http://www.saxonica.com/documentation/index.html#!functions/saxon/schema
You can also access the schema at the Java level: the methods are documented in the Javadoc, but they are really designed for internal use, rather than for the convenience of this kind of application.
Of course, getting access to the compiled schema doesn't by itself solve your problem of displaying all valid paths. Firstly, the set of all valid paths is in general infinite (because types can be recursive, and because of wildcards). Secondly, features such as substitution groups and types derived by extension make if challenging even when the result is finite. But in principle, the information is there: from an element name with a global declaration, you can find its type, and from its type you can find the set of valid child elements, and so on recursively.

How to output structured error information in bazel rule?

Compiler error messages usually include lots of human parseable information about the underlying error. I have custom rules in which I would like to additionally expose this information in a machine parseable manner. This would allow things like integration with my editor showing me the locations that need to be fixed.
What is the recommended way of doing this? The best thing that I can come up with is to have a fairly simple structure that meshes well with the human readable part and include it in stdout/stderr and parse that. But this seems much more error prone than including a machine parseable output. But given that actions fail in a binary fashion, there cannot be any output files available, and I cannot think of any other mechanism to get data out.
Take a look at the Build Event Protocol. Consuming "Progress" messages could be useful here.

Adding structure to HDF5 files - Equivalent of NetCDF "Conventions" for HDF5

NetCDF4 has the Conventions convention for adding structure to NetCDFs. I'm looking for the analogous thing but for HDF5 specifically.
My general aim is to add structure to my HDF5 files in a standard way. I want to something like what HDF5 does with images to define a type, using attributes on groups and datasets ~like:
CLASS: IMAGE
IMAGE_VERSION: 1.2
IMAGE_SUBCLASS: IMAGE_TRUECOLOR
...
But as far as I can tell that images specification is stand alone. Maybe I should just reuse the NetCDF "conventions"?
Update:
I'm aware NetCDF4 is implemented on top of HDF5. In this case, we have data from turbulence simulations and experiments not geo data. This data is usually limited to <= 4D. We use HDF5 for storing this data already, but we have no developed standards. Pseudo standard formats have just sort developed organically within the organization.
NetCDF4 files are actually stored using the HDF5 format (http://www.unidata.ucar.edu/publications/factsheets/current/factsheet_netcdf.pdf), however they use netCDF4 conventions for attributes, dimensions, etc. Files are self-describing which is a big plus. HDF5 without netCDF4 allows for much more liberty in defining your data. Is there a specific reason that you would like to use HDF5 instead of netCDF4 ?
I would say that if you don't have any specific constraints (like a model or visualisation software that bugs on netCDF4 files) that you'd be better off using netCDF. netCDF4 can be used by NCO/CDO operators, ncl (ncl also accepts HDF5), idl, the netCDF4 python module, ferret, etc. Personally, I find netCDF4 to be very convenient for storing climate or meteorological data. There's a lot of operators already written for it and you don't have to go through the trouble of developing a standard for your own data - it's already done for you. CMOR (http://cmip-pcmdi.llnl.gov/cmip5/output_req.html) can be used to write CF compliant climate data. It was used for the most recent climate model comparison project.
On the other hand, HDF5 might be worth it if you have another type of data and you are looking for some very specific functionalities for which you need a more customised file format. Would you mind specifying your needs a little better in the comments ?
Update :
Unfortunately, the standards for variable and field names are a little less clear and well-organised for HDF5 files than netCDF since this was the format of choice for big climate modelling projects like CMIP or CORDEX. The problem essentially melts down to using EOSDIS or CF conventions, but finding currently maintained librairies that implement these standards for HDF5 files and have clear documentation isn't exactly easy (if it was you probably wouldn't have posed the question).
If you really just want a standard, NASA explains all the different possible metadata standards in painful detail here : http://gcmd.nasa.gov/add/standards/index.html.
For information, HDF-EOS and HDF5 aren't exactly the same format (HDF-EOS already contains cartography data and is standardised for earth science data), so I don't know if this format would be too restrictive for you. The tools for working with this format are described here: http://hdfeos.net/software/tool.php and summarized here http://hdfeos.org/help/reference/HTIC_Brochure_Examples.pdf.
If you still prefer to use HDF5, your best bet would probably be to download an HDF5 formatted file from NASA for similar data and use it as a basis to create your own tools in the langage of your choice. Here's a list of comprehensive examples using HDF5, HDF4 and HDF-EOS formats with scripts for data treatment and visualisation in Python, MATLAB, IDL and NCL : http://hdfeos.net/zoo/index_openLAADS_Examples.php#MODIS
Essentially the problem is that NASA makes tools available so that you can work with their data, but not necessarily so you can re-create similarily structured data in your own lab setting.
Here's some more specs/infomation about hdf5 for earth science data from NASA :
MERRA product
https://gmao.gsfc.nasa.gov/products/documents/MERRA_File_Specification.pdf
GrADS compatible HDF5 information
http://disc.sci.gsfc.nasa.gov/recipes/?q=recipes/How-to-Read-Data-in-HDF-5-Format-with-GrADS
HDF data manipulation tools on NASA's Atmospheric Science Data Center :
https://eosweb.larc.nasa.gov/HBDOCS/hdf_data_manipulation.html
Hope this helps a little.
The best choice for a standard really depends on the kind you data you want to store. The CF conventions are most useful for measurement data that is georeferenced, for instance data measured with a satellite. It would be helpful to know what your data consists of.
Assuming you have georeferenced data, I think you have two options:
Reuse the CF conventions in HDF like you suggested. There are more people looking into this, a quick Google search gave me this.
HDF-EOS (disclaimer, I have never used it). It stores data in the HDF files using a certain structure but seems to require an extension library to use. I did not find a specification of the structure, only an API. Also there does not seem to be a vibrant community outside NASA.
Therefore I would probably go with option 1: use the CF conventions in your HDF files and see if a 3rd party tool, such as Panoply, can make use of it.

xml merging with Lua script

I have one task... of course I am not expecting you people to give me ready-made solution, but some outline will be very much helpful. Please help, as Lua is a new language for me.
So the task is:
I have three xml files. All the xml files are storing the data about the same objects say equipment. Except the name of the equipment, the parameters, xmls storing are different.
Now I want to make a generic xml file, which is carrying all the data(all parameters) about the equipment.
Please note that, the name will be unique and thus it will act as a key parameter.
I want to achieve this task with Lua script.
Lua does not do xml "by default". It is a language thought to be "embedded" into other systems, so it could happen that the system you have it "embedded in" is able to parse the xml files and pass them on to Lua. If that's the case, translate the xmls to Lua tables on the host system, then give them to Lua, manipulate them, in Lua, and return the resulting Lua table, so that the host can transform it to xml.
Another option, if available, would be installing a binary library for parsing xml, such as luaxml. If you are able to install it in your system, you should be able to manipulate the xml files more or less easily directly from Lua. But this possibility depends on the system you have embedded Lua into; a lot of systems don't allow installation of additional libraries.

Framework for building structured binary data parsers?

I have some experience with Pragmatic-Programmer-type code generation: specifying a data structure in a platform-neutral format and writing templates for a code generator that consume these data structure files and produce code that pulls raw bytes into language-specific data structures, does scaling on the numeric data, prints out the data, etc. The nice pragmatic(TM) ideas are that (a) I can change data structures by modifying my specification file and regenerating the source (which is DRY and all that) and (b) I can add additional functions that can be generated for all of my structures just by modifying my templates.
What I had used was a Perl script called Jeeves which worked, but it's general purpose, and any functions I wanted to write to manipulate my data I was writing from the ground up.
Are there any frameworks that are well-suited for creating parsers for structured binary data? What I've read of Antlr suggests that that's overkill. My current target langauges of interest are C#, C++, and Java, if it matters.
Thanks as always.
Edit: I'll put a bounty on this question. If there are any areas that I should be looking it (keywords to search on) or other ways of attacking this problem that you've developed yourself, I'd love to hear about them.
Also you may look to a relatively new project Kaitai Struct, which provides a language for that purpose and also has a good IDE:
Kaitai.io
You might find ASN.1 interesting, as it provide an absract way to describe the data you might be processing. If you use ASN.1 to describe the data abstractly, you need a way to map that abstract data to concrete binary streams, for which ECN (Encoding Control Notation) is likely the right choice.
The New Jersey Machine Toolkit is actually focused on binary data streams corresponding to instruction sets, but I think that's a superset of just binary streams. It has very nice facilities for defining fields in terms of bit strings, and automatically generating accessors and generators of such. This might be particularly useful
if your binary data structures contain pointers to other parts of the data stream.

Resources