Use Apache Tika to convert documents to "sections" - apache-tika

Problem
I am looking to convert documents (primarily pdf and html) to something I can reliably search over in a search database (e.g. Elasticsearch).
To best search over long passages of text, it is often recommended to break up longer text into chunks that can be searched over instead. This gives a better chance of providing relevant results.
A common structure for this is to use hierarchical sections. An example based on Stack Overflow's Wikipedia page looks like this:
{
"sections": {
"title": "Stack Overflow",
"sections": [
{
"text": "Stack Overflow is a question and answer website for professional and enthusiast programmers. It is the flagship site of the Stack Exchange Network."
},
{
"title": "History",
"sections": [
{
"text": "The website was created by Jeff Atwood and Joel Spolsky in 2008"
},
{
"title": "Security breach",
"In early May 2019, an update was deployed to Stack Overflow's development version."
}
]
}
]
}
}
What I've learnt
I stumbled upon Apache Tika after seeing it referenced in the documentation of popular cloud based Search engines (e.g. Algolia's).
It seems like Tika is capable of translating pdfs > html, so it seems like I can initially just focus on converting html > "sections".
My rudimentary understanding of Tika is that to convert in this fashion requires some understanding of the structure of the document. However, in my case I don't necessarily have a predictable document structure.
One area I have seen this done well is in "reader view" type plugins (e.g. Mozilla's). They tend to attempt to format html documents in predictable ways using heuristics.
Question
What is the best way to use Tika to convert generic html documents to hierarchical sections (as shown above) similar to how "reader view" works?

Related

Change Eve Python DOMAIN collection schema dynamically during runtime

I am using the Eve Python API Framework for MongoDB. I am writing a feature that allows my users to edit metadata sections for the documents that they are writing.
Example JSON:
{
"metadata": {
"document_type": ["story"],
"keywords": ["fantasy", "thriller"]
}
}
We have a CMS for the document editor that admins can use to allow them to do things like add new metadata fields for the authors (normal users) to add more information about their posts. For example, the site admin may want to add a field called "additional_authors" which is a list of strings. If they add this section to the frontend we would like some way to add it to the Eve Schema on the backend in real time without restarting the server. It is very important that it be real time and not part of a coding change in Eve or requiring Eve to restart.
Our current hard-coded metadata schema looks like this for the document collection:
{
"metadata": {
"type": "dict",
"schema": {
"document_type": {"type": "list", "required": True},
"keywords": {"type": "list", "required": True}
}
}
}
I understand that we can go with a non-strict approach when writing the "metadata" type dict so that it does not care what is inside but from my understanding this means we would not be able to use "projection" properly meaning that if I only wanted to return "metadata.additional_authors" of all documents through my API call, I would not be able to do so. Also, this would mean that we would have to deal with the required check ourselves using hooks instead of the built-in Eve schema check.
Is there anyway around this issue by essentially having a dynamic schema based on a MongoDB document that we can store the entire collection configuration dict in and retrieve it without restarting the server for it to take effect? Even if this means adding a hook to the new schema_dict collection and calling some internal Eve function I am all ears.
Thank you ahead of time for your help.

Alexa Smart Home Skills IoT Alexa.PlaybackController

I have an IoT device that needs to support various operations, one of which is next from the Alexa.PlaybackController. My device is a multimedia device and requires many of the other Controllers as well. I'm including the Alexa.PlaybackController in the discovery response for my devices like so:
{
"type": "AlexaInterface",
"interface": "Alexa.PlaybackController",
"version": "3",
"supportedOperations": ["Next"],
}
I've also tried:
{
"type": "AlexaInterface",
"interface": "Alexa.PlaybackController",
"version": "3",
"properties": {
"supported": [
{"name": "next"}
]
},
}
but neither work. I get a schema error on CloudWatch:
is not valid under any of the given schemas
Looking below at the schema, I see that PlaybackController indeed is not included inside the schema. However, all of the documentation makes it seem like this should be trivial. I'm wondering if I need to include something else to indicate that playback is something that I need.
Is PlaybackController special in some way and unable to be included in combination with other Controllers? I've tried googling about this schema error but it's too vague and nothing's coming up.
Any help would be much appreciated!
__
EDIT:
I see now that video devices seem to get a different set of available Controllers, but there is still reference to using PlaybackController in a lot of places around the regular Smart Home Skill for entertainment devices. Really hope that it's possible!
So should have probably figure this out sooner. I'm using the python validation class provided by Amazon. Turns out that the schema from the same repo simply doesn't include any reference to Alexa.PlaybackController. Therefore, the validation fails every time with the error about mismatching schemas. Maybe they've added some controllers recently and forgot to update the schema.
I submitted an issue to the Smart Home repo here: https://github.com/alexa/alexa-smarthome/issues/62

How do you format data so that it can be easily consumed by MV* frameworks like Angular, Backbone, Ember, etc.?

I'm asking this question to fill a hole in my knowledge, myself having historically been primarily a front-end developer with little concern for server-side code for the longest time. I basically need some way to structure my data so that all relevant information from multiple tables in my database all exist in one place. So, let's say I have a user profile page for a Rails-based site that will use Angular.js on the client. My Angular code might expect a data model like this:
var user = {
"first_name": "Arkady",
"last_name": "Dracul",
"courses": [
{
"name": "Intro to Chemistry",
"id": "CH101",
},
{
"name": "Intro to Computer Science",
"id": "CS101",
},
{
"name": "Intro to Whatever",
"id": "W101",
}
],
"clubs": [
{
"name": "Salsa",
"id": "SDA"
},
{
"name": "Tango",
"id": "TDA"
}
]
}
How on earth do I actually get the data from the various tables in my database to come out structured like this? Mind you, I'm guessing (!) that I may need to have different data models for different views but am uncertain as to whether that would be a good practice if two views are mostly similar. Really, I'm not sure how to go about structuring data for consumption by the front end. Apart from any answers you provide here, are there any books that provide useful beginner-/intermediate-level information for someone like myself?
I'm not sure what you are asking, but if you are on rails, you can use a someUser.to_json to convert your database object to json.
Beside this, if you are trying to implement an API with rails, I strongly recommend grape https://github.com/intridea/grape. It is in active development, and I love it!
You also can build JSON views along with grape using Rabl https://github.com/nesquena/rabl or json_builder https://github.com/dewski/json_builder

how to view epub documents as pages according to printed counterpart?

In popular desktop ebook viewers like calibre, FBreader or Cool Reader, I'm missing a feature to show ebooks in the same pagination as their printed counterparts. Some people (here, too) claim that epub does not have a page concept (e.g. at how to implement 'page break' in epub reader).
But this is not true. From http://www.idpf.org/accessibility/guidelines/content/xhtml/pagenum.php:
"If an ebook is produced from the same workflow as a print document, print pagination markers should be retained in the document. These markers benefit readers in mixed print/digital environments, such as a classroom, as the page numbers allow a common point of reference between the two editions." and from its FAQ-section: "Do page numbers really matter anymore? - Yes. Despite the assertions of the futurists and technophiles, print still reigns supreme. As a result, anyone in a mixed print/digital environment — using an assistive technology or not — needs a way synchronize electronic and print content."
I tested several ebook viewers with two different documents that contain page break tags, but they did not break up into pages (or I'm missing a preference option). Any help, infos are highly appreciated.
You can force page break with CSS and the property page-break-after=always but is not the best page layout for an ebook. For example add a class to your epub:type="pagebreak" label with that style.
<span epub:type="pagebreak" id="page23" class="pagenumber">23</span>
.pagenumber { /* other styles*/ page-break-after=always !important; }

IOS download multiple files

In the server side, there would be many text and images that I want to download.
The question is I do not want to get those things separately, I would like to pack them into a single file, so that it is easier to utilize pause-and-resume capability.
Should I zip them in server side and unzip the file in mobile side?
Is there any API that I can use to unzip in mobile side?
I am not sure about if my idea is correct or not. Is this the common way of doing this sort of things? Thank you.
Regarding the images, frequently we download them individually, and merely update the UI as stuff is downloaded, rather than forcing the user to wait for everything to download. Or lazy downloads are nice (such as afforded by UIImageView category, like SDWebImage's UIImageView+WebCache), effectively downloading the images as they're needed, that way the user can use the app without needing to wait for everything to download.
Regarding the text, unless they're huge, you can download them together. Rather than zipping a bunch of text files, though, I think post people would be inclined towards some JSON-formatted response that returns the text files in some nice, easy to consume format by your app. There's a point (say, the combined JSON is more than 100k), though, where combining the text files together hits a point of diminishing returns, in which case you might want to pursue individual downloads.
We need more background on what the app does and the breakdown of data (how big are the individual text files, how many, etc.) before providing meaningful counsel. But unless there absolutely no way for the app to function at all until everything is downloaded, I'd be inclined towards separate downloads for images, and some nice JSON-format single feed for the text.
Let's assume for a second that you have some text description of a series of products, an an image associated with each. Then you might have a JSON that looks like:
[
{
"id": 1,
"description": "This product is ...",
"image_url": "http://www.yourserver.com/images/img001.png"
},
{
"id": 2,
"description": "This different product is ...",
"image_url": "http://www.yourserver.com/images/img002.png"
},
{
"id": 3,
"description": "This third product is ...",
"image_url": "http://www.yourserver.com/images/img003.png"
},
{
"id": 4,
"description": "This last product is ...",
"image_url": "http://www.yourserver.com/images/img004.png"
}
]
Your app might download that JSON (in which case it would have all of the text strings and a bunch of image URLs), and then present that in the UI, and where it needs to display the image, use the UIImageView category of SDWebImage to update the image of the UIImageView like so:
[cell.imageView setImageWithURL:[NSURL URLWithString:imageUrlString]
placeholderImage:[UIImage imageNamed:#"placeholder.png"]];

Resources