Epub: Start block on next page if it can't fit on current page - epub

Is there a way in an epub file to say that a block should not be split among multiple pages, so that if the current page don't have enough room to show the block, then the block should start on the next page, instead of starting on the current page, and then spill over to the next page.
If this is not possible in general, a solution which work in ibooks and which don't cause problems in other readers are also useful :-}

I haven't played with them myself, but you might want to take a look at the .css widows and orphans styles. They are supported by the 2.0.1 spec here (scroll down to the paged media section). Setting the number high enough on your paragraph style might do the trick.

Use page-break-inside: avoid;. However, as others mentioned, you should not expect this to work on every platform. The only absolutely fool-proof way to get something to start on a new page on all readers is to place it in a separate XHTML file.

Related

HTML parsing: How to find the image in the document, which is surrounded by most text?

I am writing a news scraper, which has to determine the main image (thumbnail), given an HTML document of a news article.
In other words, it's basically the same challenge: How does Facebook determine which images to show as thumbnails when posting a link?
There are many useful techniques (preferring higher dimensions, smaller ratio, etc.), but sometimes after parsing a web page the program ends up with a list of similar size images (half of which are ads) and it needs to pick just one, which illustrates the story described in the document.
Visually, when you open a random news article, the main picture is almost always at the top and surrounded by text. How do I implement an HTML parser (for example, using xpath / nokogiri), which finds such an image?
There is no good way to determine this from code unless you have pre-knowledge about the site's layout.
HTML and DHTML allow you to position elements all over the page, either using CSS or JavaScript, and can do it after the page has loaded, which is inaccessible to Nokogiri.
You might be able to do it using one of the Watir APIs after the page has fully loaded, however, again, you really need to know what layout a site uses. Ads can be anywhere in the HTML stream and moved around the page after loading, and the real content can be loaded dynamically and its location and size can be changed on the fly. As a result, you can't count on the position of the content in the HTML being significant, nor can you count on the content being in the HTML. JavaScript or CSS are NOT your friends in this.
When I wrote spiders and crawlers for site analytics, I had to deal with the same problem. Because I knew what sites I was going to look at, I'd do a quick pre-scan and find my landmark tags, then write some CSS or XPath accessors for those. Save those with the URLs in a database, and you can quickly fly through the pages, accurately grabbing what you want.
Without some idea of the page layout your code is completely at the mercy of the page-layout people, and anything that modifies the page's element's locations.
Basically, you need to implement the wet-ware inside your brain in code, along with the ability to render the page graphically so your code can analyze it. When you, as a user, view a page in your browser, you are using visual and contextual clues to locate the significant content. All that contextual information is what's missing and what you'll need to write.
If I understand you correctly, your problem lies less with parsing the page, but with implementing a logic that successfully decides which image to select.
The first step I think is to decide, which images are news images and which are not (ads for example).
You can find that out by reading the image URL (src-attibute of the image-tag) and checking th host against the article host the middle part ("nytimes" in your example) should be the same.
The second step is to decide which of these is the most important one. For that you can use image size in article, position on page, etc. For step 2 you would have to try out what works best, for most sites. tweak your algorithm, until it produces the best results for most news sites.
Hope this helps

Extracting ePub Excerpt

I've read about the ePub format, standard, structure, readers, tools and available developer techniques to manipulate/convert/create ePubs but there is no such thing as a magical function (so far) to extract a particular length of characters to create an excerpt of the book. And that's precisely what I'm looking for: A way to extract the first X words of an ePub.
The first approach I'm considering (not my favorite btw) is creating a parser to read all the ePub metadata and start parsing the xml files in the right order until I have enough words to create the excerpt of a determined ePub (I will appreciate some feedback in this direction)
The second way (which I can't find so far) is an existent tool/function or parser (in any language) which returns (hopefully) the plain text of the ePub so I can collect the first X words in order to create my excerpt.
Do you know about any tool which can help me achieve the second option?
You should have a look at Apache Tika: http://tika.apache.org/
You can use it from command line, or as a java library or even in server mode to extract text from ePub.
Hope this will help,
F.
Jose,
I'm not aware of any tool to do what you want. Let me comment on your first approach, though. If you do find a tool I hope these comments allow you to evaluate it.
I think your approach is fine and, if you want to do a good job of creating an extract, you may want to own this step anyway. I would suggest you,
grab the OPF file and look for a GUIDE section. If a GUIDE section exists, check the types that are given. Some are probably not relevant for an excerpt (cover,title-page,copyright-page). Many books will not have the types explicitly stated but this should help where they do.
now go through the files in sequence in the SPINE section, excluding anything that is irrelevant, and read through enough XHTML files to get your excerpt.
while in the OPF file grab a bunch of metadata if this is relevant for the excerpt (title, creator, date are mandatory, I think, and some authors will also put in a whole bunch of other metadata such as keywords).
If you are creating a mini-EPUB with this excerpt you will need to pick up any CSS, Audio, Video, Image and Custom Font files that get referenced in the XHTML files used to make your excerpt. You may even choose to use the original cover file for the cover file of your excerpt epub.
If you working with fixed layout books with fun stuff like Read Aloud AND you want to create a mini-EPUB as an excerpt, you may be better off going with a page count rather than a word count. Don't forget to include any SMIL files into your excerpt and to make it look nice: (i) don't split a two page spread and (ii) make sure that the first page is an odd numbered page if odd in the original or even if even numbered in the original - to do this you may need to add a blank filler page (get the odd/even wrong and subsequent two page spreads won't be facing each other)
I hope that helps.

Latex rendering needs too much disk space in MediaWiki?

Not totally sure that Stackoverflow is the best place to ask this question, but since I see a slew of other MediaWiki questions that have already been posted, I suppose my question is appropriate.
My understanding is that MediaWiki, in addition to storing a copy of all revisions of all images, will also store all revisions of all rendered LaTeX. This means that as I am editing a page and clicking "preview" to view my changes, each change of the embedded LaTeX will produce its own separate file even though I am only saving the page once!
This is from reading
MediaWiki Manual: TeX Temporary Files
My question is this, how can people host a reasonably sized MediaWiki that supports LaTex without producing an enormous number of files leading to a loss of significant disk space??
The above link suggests the following, inelegant solution:
The images can be manually deleted, since the wiki can regenerate them, but if you do you'll want to fix the database as well:
• Clear the affected entries in the math table, or the wiki will think it's already rendered those bits
• If using file caching, do one of the following to invalidate the cached pages or visits by anon users won't trigger regeneration of the images:
•• remove all (affected) pages from the cache (consider grep)
•• Update cur_touched fields to present time for affected entries (check for "" in cur_text)
•• Update the global $wgCacheEpoch timestamp in LocalSettings, forcing all cached pages to be regenerated without going to the bother of deleting anything.
The third suggestion to change $wgCacheEpoch seems the most straightforward but also the least elegant.
Failing an elegant solution, would anyone be able to clarify how on Earth I can accomplish this? Is there not a php script in the maintenance directory that can accomplish this??
You might want to try http://www.mediawiki.org/wiki/Extension:MathJax (client-side JavaScript solution) instead of the default server-side approach.

TeX: Add blank page after every content page

I'm currently writing my bachelor thesis and my university wants a one sided print. The printing and binding will be done by a professional print company. They only accept two sided manuscripts.
Because of that I need to add a blank page after every page of content. I don't want to do this manually using \newpage or \clearpage because there are too many pages. Is there any, maybe low level, TeX command or package to do this? Or can you suggest another tool that does this without breaking the PDF?
Thanks for your help!
One option you might look into is to use a double sided layout that allows separate formatting for the even vs. odd pages: e.g. the book class allows this. Then you will need to define the even pages to be blank (presumably you don't want headers printed, or the page count to increment).
An alternative (if you can't get this to look correct for what you need) would be to do the layout in single sided (so that page numbering, etc. is all taken care of), then have a separate latex document which includes the pages, one at a time (pdfpages may be a good package to do this properly), and then insert blank pages (with no headers/etc.) in-between. This may end up being more work, but if you have trouble with formatting, it may be the easier way to go.
I suspect that you'd be better off doing this by manipulating the output PDF, rather than changing the LaTeX.
For example, if you're able to print to a file on your platform, there might be options in the print dialogue to tweak this. Your PDF viewer may be able to arrange this, if only by inserting blanks every second page. Or there may be a GUI or command-line tool to do the reshuffling for you.
Having said that, I've no specific recommendations for what tool you could use. A quick look around suggests strongly that the pstops tool might be able to do something along these lines, but that only helps if you're generating your PDF from postscript.
So no recipe, I'm afraid, but this'll probably be a better direction to look.
(or, meta answer: find a different print shop, or phone again and hope you get someone who gives you a different answer!)

Setting up help for a Delphi app

What's the best way to set up help (specifically HTML Help) for a Delphi application? I can see several options, all of which has disadvantages. Specifically:
I could set HelpContext in the forms designer wherever appropriate, but then I'm stuck having to track numbers instead of symbolic constants.
I could set HelpContext programmatically. Then I can use symbolic constants, but I'd have more code to keep up with, and I couldn't easily check the text DFMs to see which forms still need help.
I could set HelpKeyword, but since that does a keyword lookup (like Application.HelpKeyword) rather than a topic jump (like Application.HelpJump), I'd have to make sure that each of my help pages has a unique, non-changing, top-level keyword; this seems like extra work. (And there are HelpKeyword-related VCL bugs like this and this.)
I could set HelpKeyword, set an Application.OnHelp handler to convert HelpKeyword requests to HelpJump requests so that I can assign help by topic ID instead of keyword lookup, and add code such as my own help viewer (based on HelpScribble's code) that fixes the VCL bugs and lets HelpJump work with anchors. By this point, though, I feel like I'm working against the VCL rather than with it.
Which approach did you choose for your app?
When I first started researching how to do this several years ago, I first got the "All About help files in Borland Delphi" tutorial from: http://www.ec-software.com/support_tutorials.html
In that document, the section "Preparing a help file for context sensitive help" (which in my version of the document starts on page 28). It describes a nice numbering scheme you can use to organize your numbers into sections, e.g. Starting with 100000 for your main form and continuing with 101000 or 110000 for each secondary form, etc.
But then I wanted to use descriptive string IDs instead of numbers for my Help topics. I started using THelpRouter, which is part of EC Software's free Help Suite at: http://www.ec-software.com/downloads_delphi.html
But then I settled on a Help tool that supported string ID's directly for topics (I use Dr. Explain: http://www.drexplain.com/) so now I simply use HelpJump, e.g.:
Application.HelpJump('UGQuickStart');
I hope that helps.
We use symbolic constants. Yes, it is a bit more work, but it pays off. Especially because some of our dialogs are dynamically built and sometimes require different help IDs.
I create the help file, which gets the help topic ID, and then go around the forms and set their HelpContext values to them. Since the level of maintenance needed is very low - the form is unlikely to change help file context unless something major happens - this works just fine.
We use Help&Manual - its a wonderful tool, outputting almost any format of stuff you could want, doc, rtf, html, pdf - all from the same source. It will even read in (or paste from rtf (eg MSWord). It uses topic ID's (strings) which I just keep a list of and I manually put each one into a form (or class) as it suits me. Sounds difficult but trust me you'll spend far longer hating the wrong authouring tool. I spent years finding it!
Brian

Resources