Extract text of OneDrive documents using Graph Api

Extract text of OneDrive documents using Graph Api - microsoft-graph-api

I have been using Ms Graph API, to download the files of OneDrive successfully.
I was looking for a way to read only the text content (for indexing purpose in my application) using Graph API, for different types of files(pdf,xls,zip,Images etc.) instead of going by the conventional approach of downloading the complete file and then extracting the text using some "Text extracting api" and then index the file, which would be a time consuming task. I am aware GraphAPI has its own search features, but it lacks ability to do complicated search like regular expression search (please correct me if I am wrong). I am sure OneDrive does its own indexing for each file which helps a user to do the basic search.
So, is there any way I can get the text content of the documents using the Graph API?

I don't believe getting a 'preview' of text-based documents is currently available through the API. You will need to make a GET request to fetch the content. If you don't want the full document, you can request a partial range of bytes that you believe would be enough for the document. In addition, to make it easier to handle different file types, we currently support converting common file formats to PDF (to possibly standardize your file parsing logic).

Related

OneDrive Search api does not support Expand query on ListItems

I need to search for files with .docx extension in my OneDrive. So this bit is simple and it works using OneDrive search api. The piece that does not work, is that in the response, with each DriveItem, I also need the custom properties we created under ListItem.Fields associated with this DriveItem. These custom properties contain information I need to create some sort of a report.
Expanding ListItem seems to work on root/children resource without any search, but it does not solve my problem, I need the files with the .docx extension in their filenames, these files can be under root or any sub folder under the root.
So this request returns the CustomProperty with the response
/_api/v2.0/drives/[drive id]/root/children?select=*%2cwebDavUrl
%2csharepointIds&expand=listItem(select%3dfields%3bexpand%3dfields(select%3dCustomProperty))
But when I try to expand ListItem on the DriveItems returned from search query as below:
/_api/v2.0/drives/[drive id]/root/search(q='docx')?select=*%2cwebDavUrl%2csharepointIds&expand=listItem(select%3dfields%3bexpand%3dfields(select%3dCustomProperty))
I get an error:
Error: {"error":{"code":"notSupported","message":"The request is not supported by the system."}}
Is expanding ListItem.Fields on a DriveItem not supported in OneDrive Search api ?
If it is not, then is there another way for me to achieve what I want to do here? I am not trying to search on the CustomProperty, just want to retrieve that value as part of the response with its associated DriveItem.
Expectedly, I get same/similar error if I run this through Microsoft Graph Search api instead of OneDrive api.
One workaround I could do,
is to first search for .docx files without the expand keyword and it will recursively search and return all .docx files in my OneDrive. Then I could make individual calls to request these items again one by one from OneDrive using their DriveItem.Id and the expanded ListItem.Fields property. That would be a terrible workaround though. Because instead of achieving what I need in a single request, I would have to make 1000s or 10000s of individual I/O requests (one per .docx file) to get the expanded ListItem properties..

This is a known issue with the /search endpoint. Unfortunately, there isn't a good workaround available at the moment either. In order to retrieve the ListItem resources, you will need to retrieve each DriveItem from your search result directly:
/drives/{drive-id}/items/{item-id}?$expand=listItem

How to scrape different URLs from database with Nokogiri with different requirements

I tried using Feedjira to assist with content analysis from newsfeeds, but it appears that RSS feeds now only link to content rather than including them with RSS as I found out in "Feedjira not adding content and author". I plan to use Feedjira to get the URL for the article, but then use Nokogiri to scrape the article and pick out the relevant parts.
The problem is that each media outlet will have a different format for their pages and I need to know the best way for Nokogiri to take the URL from the database (supplied by Feedjira) and depending on the associated feed title (also the database from Feedjira sync) scrape the page in a specific way and save it to a separate table in the database. Anyone got any suggestions?

I don't know your special use case but I'm also doing content analysis using news feeds.
Maybe you'll have a look on Readability which provides you a generic content scraper.

The problem you've encountered is that every feed generator does it a bit differently, just as with HTML generators. You can assume certain fields are going to be in place in an RDF, RSS or ATOM feed, however the author of the feed could use optional tags that you could find very useful, so you have to write code to look for them.
I wrote several feed aggregators in the past, including one that was handling well over 1000 feeds daily. By sniffing out the feed type, ATOM vs. RSS vs RDF, then I could make sensible checks for fields that were interesting given that format, and extract the data if it was available.
Pre-canned parsers get it wrong too often, either grabbing data you don't want and making a mess of the output, or skipping data you do want leaving gaps in the output, so be prepared to write code if you want it done correctly.
You'll probably want to take advantage of a backing database too, to keep track of what you looked at last and when you're supposed to look at it again; That's part of being a good network citizen. You'll also want to keep track whether a feed was down the last n times you looked so you can trim out dead sites.

Is there a function in the Google Docs API that lets one get information on the headings in a doc?

I am trying to help out an associate (also a non-developer) scope out the possibilities for a tool that will gather metadata on the documents in a Google Docs account. One of the types of information they want to access is the metadata about about styles used in the document (e.g., Title and Heading styles). This data could be used to create a master Table of Contents for a particular collection.
I found the API Reference section on the Google Drive SDK site, but it does not seem to contain any information about styles or headings. Does this capability exist in some other form, or somewhere else? If not, does anyone have any suggestions on a more brute force way to get this information?
Thanks.

This seems like something that could be done with the newer Google Drive API and the indexableText attribute:
https://developers.google.com/drive/v2/reference/files#resource
your script would need to parse the document content and determine the indexable text including headings, section titles, etc and put the strings in the indexableText attribute to allow Google Drive search to locate the files based on this data.

How to get data from a web service?

I'm trying to write an iOS application that'll get data from a web server and display it as I want. I want to use JSON for this purpose. But as I'm absolutely new to web apps I've got no idea how I'm going to get the url to a certain feed. Now here're the two big questions:
How do I find the url to a feed provided by a web service? Is there a standard way or is it publicly or exclusively handed to the web service subscribers?
Is the format they provide data in up to their preference (like XML or JSON)? I mean, do I choose my data parsing method according to the format the web service gives data in? So that if the feed is in XML format using NSJSONSerialization class makes no sense.

The URL to use is dependent on the web service and is usually well described in the documentation.
The type of data they return and the the structure is also usually well described in the documentation.
The common bits you'll need to know are how to get to the web-service (NSURLRequest/NSURLConnection or any of the many asynchronous wrappers that are open source and available with a bit of searching), And how to deal with the the returned data - whether it's in JSON (NSJSONSerialization, JSONKit) format or XML (NSXMLParser, libxml, or any of the many open source implementations that are available and described with a bit of searching)

best way to extract info from the web delphi

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.

I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.

Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Born on this Date
Died on this Date
Daily Poll
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
Relevant post here on SO.
Super Object
Wikipedia.

When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.

Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).

All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
program imdbrating;
{$apptype console}
uses htmlutils;
function ExtractJsonParm(parm,h:string):string;
var r:integer;
begin
r:=pos('"'+Parm+'":',h);
if r<>0 then
result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
else
result:='N/A';
end;
var h:string;
begin
h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
writeln(ExtractJsonParm('Rating',h));
end.

If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.
Resource:
Download link.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart