Which file formats can be previewed on CKAN Data Preview tool? - preview

I am working on CKAN and will like to know the appropriate file formats that can be previewed on CKAN. I could not find any information on this topic online, so I decided to start this topic and hope to garner more responses on this which will be useful to CKAN developers in future. Here's a list of file formats that I have gathered after experimenting with my own CKAN and looking through other CKAN instances such as those from UK and Australia.
Can be previewed:
CSV (Comma separated values)
XLS (Microsoft Excel Binary File Format)
HTML (HyperText Markup Language)
JSON (JavaScript Object Notation)
PDF (Portable Document Format)
RSS (Really Simple Syndication)
TXT (Plain Text)
WMS (Web Map Service)
XML (eXtensible Markup Language)
Cannot be previewed:
DOC (Microsoft Word)
RDF (Resource Description Framework)
HTML (HyperText Markup Language)
KML (Keyhole Markup Language)
SHP (Shapefile)
WFS (Web Feature Service)
XLSX (Microsoft Excel Open XML Document)
ZIP (archive)
Help add on to my list and correct me if any of the above is wrong, then I will update the list above. Thanks! ;)

For each ckan release, the data viewer's functionality may differ.
Refer to the DataViewer section in the documents of the CKAN version that you are using.
http://docs.ckan.org/en/latest/maintaining/data-viewer.html

With some simple tweaks to the config file XLSX files can be previewed, as can Tab separated text files (tsv format/extension).
Edit the config.ini file to include
ckan.datapusher.formats = csv xls xlsx tsv application/csv application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
HTML and RDF are also previwable:
refer CKAN documentation http://docs.ckan.org/en/latest/maintaining/configuration.html?highlight=preview#ckan-preview-loadable

Related

OpenXml get page number to which each paragraph in a .docx file

I have a Word docx file and I want to retrieve all the paragraphs in OpenXml with c#.
I need to know:
1.-The number of pages of the Documents.
2.-The page number to which each paragraph belongs.
Can you show an example where the paragraphs of the document are read?
Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.
Few month ago, I reprogramed a python package call docx2python to do similar thing. I reproduced a structured(with level) xml format file from a docx file. As far as I know, a paragraph contains several Runs and each Run contain one only text. You can read this document to see how to do it. Plain paragraphes are not hard to read. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 . Glad if this helps.

Did QAF support excel version with suffix is .xlsx

I use an excel file with suffix is ".xlsx", qaf use CSVUtil to analyze this file, and throwing exception.
If I use ".xls" file, it worked well.
As of qaf version 2.1.15 Excel file with xlsx extension is not supported. It is lower priority feature may be because people prefers xml/json/csv format over xls and feature available to have custom data providers as well.
Another alternate is using CSV files (with .csv extension) which can be opened in Excel for editing as well as can be edit/view outside Excel. For example consider git as repository, one can review/edit pull request if it is csv but not if it is xlsx.

Node.js - Download .docx file exported as html from onedrive using microsoftgraph api call

When making a call like this example from here
client
.api('/me/drive/root/children/Doc.docx/content')
.getStream((err, downloadStream) => {
let writeStream = fs.createWriteStream('Mydoc.docx');
downloadStream.pipe(writeStream).on('error', console.log);
});
It works as expected. What I want is to get the .docx file as html. Is there any way to download it in html format? Or do I have to save the file and then try to export it to html. Thanks
Word Documents (.docx) do not use HTML, they use Office Open XML (OOXML). Technically they are a zipped package that contains several elements along with the raw OOXML of the document.
OneDrive itself does not provide any document conversion tools, it is just the cloud storage the document is stored in.
In order to convert a document from one format to another (OOXML to HTML for example), you'll need to use a 3rd party tool or service for that purpose. I'd suggest taking a look as Aspose. They offer a slew of file format conversion tools including one for Word. I've had a number of developers report good results using their Aspose Cloud services as well.
You can add the query parameter format=html to download in html format but supposedly you have to use the beta endpoint.

Convert doc to pdf programmatically with out using WORD / thirdparty tools

Is it possible to convert a doc file to a pdf file programmatically, with out using WORD application/third party tools. Preferably in Delphi XE4. If so, how?
Yes, you can convert .doc/.docx files to .pdf without Word and without third-party controls. The specifications are publically available - [simply] read and parse the .doc/.docx file according to the specification and generate the content according to the .pdf specification.
Here is the specification for MS-DOC (.doc) file format :
MS-DOC Specification (622 pages) -- Word97 through 2007
MS-DOCX Extensions Specification (105 pages) -- Word2010 through 2013
See also - Open Document and OpenXML Format
And the specification for the .pdf format :
PDF Reference (1310 pages)
Really, I think you'll find you probably want to use a third party component...

Grabbying text from various document formats in Ruby on Rails

I'm new to Rails but am developing a web app that requires taking text from a large database of text files and displaying the text in html. The files are in .doc, .docx, .wps, and .pages, and are currently just sitting on a hardrive. There are a small enough number of files in .wps and .pages that I could convert these to .doc manually, but the question remains: how do I get to the text inside a .doc or .docx file so that I can save it into a sqlite database for later use?
Thanks!
Take a look at Yomu. It's a gem which acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:
Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)
It's a long roundabout way, but open office can convert files, and there are programmatic ways to do that: http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/
That may not be the best way yet, but maybe it will grease the wheels a bit.

Resources