extract data from Pdf using Web harvesting - webharvest

How can i extract data from PDF using Web Harvesting? I am getting all the relevant PDFs url in a page but i am not been able to extract data out of those Pdf.I am using Web Harvest version 2.0 for extracting the Pdfs url. Please help.
how will i incorporate pdfcommand in web harvesting to get the text? Is there any other way to do without running any batch file?

I think web harvest is not sufficient for this. You should use WGET and pdfbox to get your result. First download all the PDF through your URL into a folder with the help of WGET or Web harvest itself. Then run pdfbox command to get text from PDFs. You may get some knowledge on pdfbox from URL http://pdfbox.apache.org/commandline/. You can also create a batch file to run these things in order.

Related

How to Upload Video to Rails Server from App Inventor

My goal is essentially to upload a video file, like this, but with Ruby on Rails instead of PHP. I can successfully send JSON data to my server, but haven't been able to get file uploads working. The end objective is to have the file be in the server's /tmp directory, just like files uploaded via a webpage file_field_tag.
This image:
shows what I've tried so far. The result is that on the server, the parameter list is empty, unlike if you had used a file_field_tag. In the PHP example, they are able to get the contents of the file from the input stream... maybe there is something similar in Rails?
I know my API works, as I was able to successfully make a request using a JavaScript XMLHttpRequest, so I'm led to believe the solution involves working around what App Inventor offers for HTTP requests.
Edit: Removed unsupported header since the PHP example doesn't use headers anyways

Node.js - Download .docx file exported as html from onedrive using microsoftgraph api call

When making a call like this example from here
client
.api('/me/drive/root/children/Doc.docx/content')
.getStream((err, downloadStream) => {
let writeStream = fs.createWriteStream('Mydoc.docx');
downloadStream.pipe(writeStream).on('error', console.log);
});
It works as expected. What I want is to get the .docx file as html. Is there any way to download it in html format? Or do I have to save the file and then try to export it to html. Thanks
Word Documents (.docx) do not use HTML, they use Office Open XML (OOXML). Technically they are a zipped package that contains several elements along with the raw OOXML of the document.
OneDrive itself does not provide any document conversion tools, it is just the cloud storage the document is stored in.
In order to convert a document from one format to another (OOXML to HTML for example), you'll need to use a 3rd party tool or service for that purpose. I'd suggest taking a look as Aspose. They offer a slew of file format conversion tools including one for Word. I've had a number of developers report good results using their Aspose Cloud services as well.
You can add the query parameter format=html to download in html format but supposedly you have to use the beta endpoint.

How to read pdf and extract text from pdf in symfony1.1?

I am working on Symfony-1.1 in an existing project. How can I read pdf files and extract text from them?
It's not a Symfony 1.1 related question, actually. It's a PHP one. There several libraries to handle PDFs in PHP. Following are some suggestions.
https://github.com/smalot/pdfparser
http://pastebin.com/dvwySU1a
http://www.pdflib.com/
If you just need to parse pdf in anyway and then process the text in PHP, you can also consider using a java library like the following.
http://pdfbox.apache.org/ (Is there a PDF parser for PHP?)

Extract metadata without uploading

I hava a Ruby on Rails application that works with video playlists. Now I would like to extract timecode information from the video file without uploading it to the server (takes to long). Is the possible?
If it is not possible, is there a way to export te metadata locally and uploads these xmls (for example) to the server?
Thanks in advance
afaik, this is not possible in browsers that do not support File API. Take a look at jquery file upload and the way it allows user to identify the file extension before upload.

Adding server-side script and RSS feed to Sharepoint 2007?

I am investigating if the functionality of some CGI scripts written in Perl that we run on a web server can be migrated to our Sharepoint 2007 server (MOSS).
The CGI scripts are not complicated. Basically they display and process contents of files that resides in the network file system.
For instance one script just displays the contents of small text files that are being added to a specific folder.
These files are part of a production process and cannot be moved into a Sharepoint document archive.
The CGI scripts are being used to give an overview on what is "new in the queue" for this production process.
When the production process has finished, it removes the files from the folder. But new files may arrive to the folder at any time.
I have done some investigations and found that using a "Data View" web part would give possibilities of displaying the data in a good way.
The files need to be transformed from text to XML format, before some xslt could make it look good in a Data View WP. I guess that could be done by some kind of server-side script?
But how and where do I add such a script to Sharepoint?
Would it be a good idea implementing this as an RSS feed instead? But an RSS feed would also require a server-side script, wouldn't it?
I am new to Sharepoint development and would appreciate any useful advice.
Why not just write a Custom WebPart to read the content of those text files and display them. This way you wont be making changes to those text files.
Note : The link to custom Web Part is my blog. There are tonnes of other articles in the net :)

Resources