How download linked pdf files from website? - pdf-scraping

I want to download hundreds of pdf documents from a site. I have tried tools such as SiteSucker and similar, but it does not work, because there appears to be some "separation" between the files and the page that links to them. I don't know how to describe this in a better way, since I don't know that much about website programming or scraping. Any advice on what this can be and how one can circumvent it?
More specifically, I am trying to download pdfs of UN resolutions, stored on pages like this one: http://www.un.org/depts/dhl/resguide/r53_en.shtml
It appears there is an in-built "search function," on the UN site, which makes dummy scraping, like SiteSucker, not work as intended.
Are there other tools that I can use?

Clicking a link on the page you mentioned redirects to a page composed by two frames (html). The first one is the "header" and the second one loads a page to generate the PDF file and embed it inside. The URL of the PDF file is hard to guess. I don't know of free tool that could scrap this type of page.
Here is an example of the url in the second frame that ends to the PDF file:
http://daccess-dds-ny.un.org/doc/UNDOC/GEN/N99/774/43/PDF/N9977443.pdf?OpenElement

Related

Does ePub restrict HTML to only some subset?

I was thinking about creating an ePub reader. All the ePub files I have seen so far seemed very simple: just text paragraphs with some big font for the title, and some rectangular illustration images. So, I thought ePub provides only simple ways to describe the text content.
But it seems that an ePub file contains lots HTML and CSS. I opened a sample ePub and it contained text in <p> with the class attribute. Does it mean that it can basically be like a website archive? The author can use any advanced formatting/layout feature that are used when creating an HTML website? If so, I would have to implement a whole web browser to create an ePub reader.
Or, is the HTML allowed in ePub are somehow restricted to only certain HTML tags and attributes, like the HTML that is allowed when writing on an online forum.
PS: I did some research on my own after posting this, and my conclusion is that it is the former. I have tried some famous ePub apps on the Android market, and they all seem to be weird in terms of GUI (meaning, probably non-native),and whilst there does not seem to be a definitive way to know whether an app is native or a web-app, one trick was enabling the layout boundary, and those apps do not have boundaries inside the ePub view itself, meaningly it probably is a web-view.
I searched GitHub for ePub viewers, and they all seem to be using JavaScript or a web-view, including this Android ePub viewer.
So, probably those ePub apps are just parsing the meta data files in the ePub format, and for the rendering of the book itself, they are just delegating that to the web-view and using some sort of JavaScript framework to add a UI on the web-view.
If someone knows better, please correct me.
My understanding of previous ePub specs is that it is a web archive of sorts. A compressed archive consisting of metadata, fonts, images, and content.
It used to be that this content was only in a specially-flavored XHTML format, but it looks like they've also added SVG content documents. I've admittedly lost track of the ePub spec changes (I didn't realize they had merged efforts with the W3C), but hopefully the spec links above can give an idea of what's different between a standard html5 web page and what epub expects.
EDIT: I should also mention that a lot of the readers I worked with back in the day had the bad habit of stripping out formatting and just presenting text (not even text with embedded fonts -- a big no-no for non-English texts). Not sure if this was the reader software being "robust" and acting against ePub formatting that would break their app, or something else.

PDF JS - Lazy load?

It seems, pdf.js itself requesting whole byte range requests of a PDF file. Instead, is it possible to request only 5 pages on PDF load, On scroll can able to load another set of 5 pages, like that.. Is there a way to achieve this by using pdf.js ?
Long story short - No.
PDF is not a contiguous storage format. If the PDF file is formatted for fast web view then you can get it to show page 1 whilst other pages are still streaming in, but you can't ask to start at a specific page or page range. Internally pdf uses a bunch of sections, links/pointers between them and digests. Think of them as wooden blocks with bits of string between them. You can't render anything until you have 'enough' of the file to provide the parts you need, but the organisation of the internal sections is pretty much random as far as your question is concerned.
The only way to get specific pages would be to have a server-side component split them out of the PDF file for you and make a new PDF file containing just those parts, but paging on to page 6 would mean opening a new document, etc.
Edit: There are startup params for Acrobat viewer that could allow you to set the first page to be displayed, and other viewers may offer this feature, but unless you have some very smart client-server interaction this would still require the entire PDF document to be present in the client first.
Edit 2: As per comment from #async5, PDF.js 'may' be able to do page-range loading. See this section of the PDF.js docs. But note that there are requirements on the web server that is serving the PDF file.
As described in an issue here, old versions of PDF.js did not handle linearized PDF files properly(as described by Peter in comment, when you try to load page 1000 it loads page 1-1000).
It seems the problem has been resolved at some point (I dont know specific version #) and now the behaviour when you set those params correctly (namely disableAutoFetch and disableStream both to true) and load page 1000, it would only load page 1000.

what is the best way to implement interactive book in c#

I can upload large document as pdf file into web page no problem. but i want to use arrows to navigate the book pages not to upload the whole book at once as this may take long.
can any one help how to do this in mvc app with or without database? if database is necessary would Mongodb be a better choice? i do not want people to download the book; they can just read it online?
First you cannot prevent people to download your content if you visually display it BUT you can discourage them by making it difficult to do so.
That being said you wouldn't have a need a database to do what you want to do. You can but it's not necessary. You can simply find some library online that handle PDF such as iTextSharp cut the book in 1 PDF per page with it when it get uploaded so you have bunch of small files.
Then the trick is simple you query the PDF library to load the file Page1.PDF (arbitrary name) extract text format and output as text nicely has HTML. when the person click the link Page 2 then reload the page with the new PDF to use for display.
Doing so prevent the user from seeing or having access to the PDF file itself and if he want to download it all he will have to copy paste every single page manually or by code if he's a dev. Most common user wont go around copy pasting manually 300 pages because of laziness.
What i would personally do is each file uploaded i would create a folder with the name of the book and call the files 1.pdf, 2.pdf .... per page. Like that if i query the listing of directories i get the list of all books, and if i check the count of files in it i know the total page number. That would allow me to run all that without database.

How to implement an interactive PDF in iOS

Here im hitting my head againt the wall.
My client provided a pdf with buttons(just like buttons,when user tap on button,it will load next page and previous page etc.).
This buttons will work only when we open it in adobe reader.
I tried the QLpreviewview,quickview but it is not working,all what i can do is just to load the pdf in the webview.
Can anyone please help me in how to load an interactive pdf in iOS.
Thanks in advance.
Have a look at PSPDFKit, it is the most advanced framework I've found for PDFs in iOS. They have an impressive list of customers as well.
It is a bit pricy though, but you have the option to get the Source Code too if you need to modify anything. Could be worth it if your client need that kind of performance and other features as well.
(I am not in any way affiliated with PSPDFKit)
The limitations are due to the capabilities (or non-capabilities) of the PDF viewer used.
Currently the leading PDF viewer on iDevices is PDFExpert by Readdle. Adobe Reader for iDevices is weaker, but can deal to some extent with form elements.
For page navigation etc. you might use links instead of button fields (as far as you can live with the capabilities of links, and not use JavaScript). Links are said to be handled properly with many PDF viewers.
You may have to require certain PDF viewers on instructional level, because you don't have control over the viewer used by the actual user. And, as you noticed, many PDF viewers are simply too dumb do deal with active elements.
Another approach would be looking at PDF-to-HTML5 converters, and serve HTML5 from a server.

iOS display partial download of pdf, only first page

Is it possible to display a pdf from a partial download?
I need only the first page of a pdf for my app. The problem is all the PDF online are 25mb or more in size. Optimizing for the app is not an option :(
The entire PDF will need to be downloaded to display and save it, but I want to show a preview first.
A similar question, but for android:
How to Display first page of PDF before downloading is completed
I do understand downloading of data in iOS, but how can I tell where in the PDF's data the page ends, so I can just display that.
Yes you can do this but the PDF needs to be pre-constructed in a linearized format. This is something that is part of the PDF specfication and is sometimes known as fast-web-view.
Linearized PDF is the same as normal PDF but the objects in the document are ordered in a particular way and with certain extra information which makes it possible to work with partial data.
In particular the objects for the first page are included at the start of the file specifically so that the first page can be displayed quickly.
So I see no reason you shouldn't download the objects at the start of the PDF and use those to display the first page. You could use the hint tables for fast access to selected other pages but that would be quite complicated.
However the essence is that you need to pick up the group-one objects for the first page. These should run from the "%PDF" header through to the first "%%EOF". I'm not sure whether your environment will complain about the missing (but not required) objects but if it does you will need to blank them out on a binary level so that you have an internally consistent page one PDF.
For full details on PDF linearization see the Adobe PDF Specification.
My answers may feature concepts based around ABCpdf .NET. It's what I work on. It's what I know. :-)

Resources