I am building an iPad application that displays PDFs, and I'd like to be able to display the table of contents and the let user navigate to the relevant pages.
I have invested several hours in research at this point, and it appears that since PDFKit is [not supported in iOS], my only option is to parse the PDF meta data manually.
I have looked at several solutions, but all of them are silent on one point - how to associate a page in the "outline" metadata with the real page number of the item. I have examined my PDF document with [the Voyeur tool] and I can see the outline in the tree.
[This solution] helped me figure out how to navigate down the Outline/A/S/D tree to find the "Dest" object, but it performs some kind of object comparison using [self.pages indexOfObjectIdenticalTo:destPageDic] that I don't understand.
I have read the [official PDF spec from adobe], and section "12.3.2.3 Named Destinations" describes the way that an outline entry can point to a page:
Instead of being defined directly with
the explicit syntax shown in Table
151, a destination may be referred to
indirectly by means of a name object
(PDF 1.1) or a byte string (PDF 1.2).
And continues with this line which is utterly incomprehensible to me:
The value of this entry shall be a
dictionary in which each key is a
destination name and the corresponding
value is either an array defining the
destination, using the syntax shown in
Table 151, or a dictionary with a D
entry whose value is such an array.
This refers to page 366, "12.3.2.2 Explicit Destinations" where a table describes a page: "In each case, page is an indirect reference to a page object"
So is the result of CGPDFDocumentGetPage or CGPDFPageGetDictionary an "indirect reference to a page object"?
I found a [thread on lists.apple.com] that discusses. [This comment] implies that you can compare the address (in memory?) of a CGPDFPageGetDictionary object for a given page and compare it to the pages in the "Outline" tree of the PDF meta data.
However, when I look at the address of page objects in the Outline tree and compare them to addresses they are never the same. The line used in that thread "TTDPRINT(#"%d => %p", k+1, dict);" is printing "dict" as a pointer in memory.. there's no reason to believe that an object returned there would be the same as one returned somewhere else.. they'd be in different places in memory!
My last hope was to look at the source code from apple's command line "outline" tool [mentioned in this book] (as [suggested by this thread]), but I can't find it anywhere.
Bottom line - does anyone have some insight into how PDF outlines work, or know of some open source code (preferably objective-c) that reads PDF outlines?
ARGG: I had all kinds of links posted here, but apparently a new user can only post one link at a time
The result of CGPDFDocumentGetPage is the same as an indirect page reference you get when resolving a destination in an outline item. Both are essentially dictionaries and you can compare them using ==. When you have a CGPDFDictionaryRef that you want to know the page number of, you can do something like this:
CGPDFDocumentRef doc = ...;
CGPDFDictionaryRef outlinePageRef = ...;
for (int p=1; p<=CGPDFDocumentGetNumberOfPages(doc); p++) {
CGPDFPageRef page = CGPDFDocumentGetPage(doc, p);
if (page == outlinePageRef) {
printf("found the page number: %i", p);
break;
}
}
An explicit destination however is not a page, but an array with the first element being the page. The other elements are the scroll position on the page etc.
Related
I am trying to parse through html files to get weather forecasts. However, when I view the source, the numbers are missing. When I view the element, the numbers are present. This is an example:
When inspect element:
As seen the temperate is 33.2!
When view source:
div class="st-otlk-temp st-otlk-box-l mapInfoBoxS bFontEn posAbsolute" tt-title="Temperature">
What is the reason for this and how can I solve this for me to be able to parse through?
Note:I would like saving the source file and then parsing.
Source is just a static content of a URL whereas inspect element will be changed dynamically based on user interaction
I guess in this case the temperatures will be loaded on load of actual web page
I'm having quite a problem with viewing PDF in iOS. I want to be able to list user all chapters of a file and after he clicks one of them show this chapter directly in the view.
I now that I can get all chapters by:
PDFDocument* pdfDoc = [[PDFDocument alloc] initWithURL:... ];
PDFOutline* pdfOutline = [pdfDoc outlineRoot];
For now I was using webView, I was even able to write my own goToPage method but don't know how to do this feature. Maybe there is library I can use for this purpose or I can get information from PDFOutline object (I haven't found anything useful in documentation)?
I'm not sure why you think you're stuck because you're very close to what you want to do. You can check how many outline items there are and get all of them:
- (NSUInteger)numberOfChildren
returns the number of outline items and
- (PDFOutline *)childAtIndex:(NSUInteger)index
can be used to retrieve each item.
Once you have such an item, you can get its destination:
- (PDFDestination *)destination
If you read the documentation and the PDF specification, you'll see that an outline item can either have a destination or an action that specifies what needs to happen when it is executed (clicked).
You'll have to interpret what the destination means and use your gotoPage function to go to the correct page (PDFDestination has a "page" member that tells you exactly what page it refers to).
I am using Yahoo pipes to make automated Twitter Searches using terms from the description fields of an RSS feed.
Pipes makes one search from each item in the feed. Each search returns a set of results which are assigned as item.twitloop (all results)
I would like to replace the link from each item in the results with the link from the original query item;
So far I am only able to assign the original link to the first item in the results list rather than to each item.
http://pipes.yahoo.com/pipes/pipe.edit?_id=01f5f60eb8f3c22b45aa3708e5ae057a
Can anyone see where I'm going wrong?
The pipe isn't loading for me - perhaps you didn't set it as public? In any event, I have solved similar problems in the past by using the Loop module. You put the assignment into the loop (usually a string builder works well), and then have the Loop put that original link into item.link.
I have an iOS App disassembly which has the following block:
There are 'greyed out' comments in the picture of great interest which we want to capture from IDAPython. Such as which selectors are used on imported Framework objects such as UIWindow, CLHeading etc. IDA python however only has calls to get Repeatable comments, regular comments and function comments. Any idea which idc/idapython function gets this 'greyed out' comments? I assume they are repeatable comments from somewhere. Thanks.
UPDATES
The grey out comments are repeatable comments so I tried following the labeled address (selRef_setLastHeading on the third line) to the repeatable comment and arrived at this line:
However, when I did a RptCmt(here()) at that address, I was expecting #selector(setLastHeading:) to be returned as the comment but it returned an empty string..
The grey comments are repeating comments from the referenced item, thus for the first grey comment on the third line, if you went to the selRef_setLastHeading_ it should have a repeating comment.
If this was in a structured data block, I'd say read the address and then use that for the comment request function (sorry no IDApython experience just IDC script). but as they are an operand of an instruction, for this type of thing I'd tend to write a script that had a switch based on the instruction so you knew how to decode the reference address.
I'm found a stupid way to get the grey comments,something likes below.
widget = ida_kernwin.open_xrefs_window(pk_ea)
title = ida_kernwin.get_widget_title(widget)
ida_kernwin.close_widget(widget,0)
print(title)
I am working on some code that scrapes a page for two css classes on a page. I am simply using the Hpricot search method for this as so:
webpage.search("body").search("div.first_class | div.second_class")
...for each item found i create an object and put it into an array, this works great except for one thing.
The search will go through the entire html page and add an object into an array every time it comes across '.first_class' and then it will go through the document again looking for '.second_class', resulting in the final array containing all of the searched items in the incorrect order in the array, i.e all of the '.first_class' objects, followed by all the '.second_class' objects.
Is there a way i can get this to search the document in one go and add an object into the array each time it comes across one of the specified classes, giving me an array of items that is in the order they are come across in on the page i am scraping?
Any help much appreciated. Thanks
See the section here on "Checking for a Few Attributes":
http://wiki.github.com/why/hpricot/hpricot-challenge
You should be able to stack the elements in the same way as you do attributes. This feature is apparently possible in Hpricot versions after 2006 Mar 17... An example with elements is:
doc.search("[#href][#type]")
Ok so it turned out i was mistaken and this didn't do anything different to what i previously had at all. However, i have come up with a solution, wether it is the most suitable or not i am not sure. It seems like a fairly straight forward for an annoying problem though.
I now perform the search for the two classes above as i mentioned above:
webpage.search("body").search("[#class~='first_class']|[#class~='second_class']")
However this still returned an array firstly containing all the divs with a class of 'first_class' followed by all divs with a class of 'second_class'. So to fix this and get an array of all the items as they appear in order on the page, i simply chain the 'add_class' method with my own custom class e.g. 'foo_bar'. This then allows me to perform another search on the page for all divs with just this one tag, thus returning an array of all the items i am after, in the order they appear on the page.
webpage.search("body").search("[#class~='first_class']|[#class~='second_class']").add_class("foo_bar")
webpage.search("body").search("[#class~='foo_bar']")
Thanks for the tip. I hadn't spotted that in the documentation and also found another page i hadnt seen either. I have fixed this with the following line:
webpage.search("body").search("[#class~='first_class']|[#class~='second_class']")
This now adds an object into the array each time it comes across one of the above classes in the document. Brilliant!