Is it possible to extract font & spacing information from a Google Doc or pdf? - parsing

I have a 400+ page Google Doc of a poetry anthology, which I want to parse so I can build several reference indices at the end of the doc. I want to extract author names, poem titles, page numbers, and first lines. Each of these is formatted in a unique, consistent way (font size, location on page, spacing, etc.), and it would be easy to parse if I could say "look for this font size" and "read till the new line" and "skip to the next page". I could work with any data structure; I just need to get the doc in its current form into one.
Is there a way to do this? Maybe as a Google Doc or maybe if I download it as a pdf or some other form? Thanks in advance.

Related

Google sheet embed URL documentation

Does anyone know if there is any official documentation for google spreadsheet embed URL paramaters?
That is, given an embed URL from Google Sheets like this:
https://docs.google.com/a/aicr.org/spreadsheet/pub?key=0AhExuVBhVYT1dGxxejBmUHAzYUhGb25veTRkdW1YekE&single=true&gid=1&output=html&gridlines=false
What do the arguments do, and
What other arguments are available, that aren't included by default?
After much digging and searching, I have found:
Some parameters don't seem to do anything (&single=true, &embedded=true)
Some parameters are declared confidently in google search results, but don't work (&gridlines=false)
Some parameters don't seem to appear in any searches I have done (&output=csv)
... and no search I have done has produced anything even remotely approaching either of:
an official, google-maintained document for embed URLs
a code view of the code that is used to parse the embed URLs
By trial and error I have found:
&key=[ID]
google sheet ID
&single=[true|false]
true: ??? (present when I have published only a single sheet)
false: ???
&gid=[#]
sheet ID ??? (present when I have published only a single sheet)
perhaps this can be used to specify a sheet and range when your entire google sheets doc has been 'published to the web' (instead of just one sheet from your doc)
&range=[CellAddress1:CellAddress2]
specify a range of cells to include, eg "B1:C20"
if 'widget=' is false or not present, suppresses display of the usual google header & footer info
if the range spacified is larger than the published sheet, displays only the sheet while still suppressing the header and footer.
&embedded=[true|false]
true: ???
false: ???
this item is included in the embed code offered from within google sheets (set to "true"), but doesn't seem to have any effect.
&widget=[true|false]
true: display entire shared item. Overrides "range=". Does NOT include the google disclaimer footer.
false: include google disclaimer footer in output (unless 'range=' is also present)
&output=[html|txt|csv]
html (default): output as an html table within code that also includes Google tracking code
txt: output the content of the specified range or sheet as tab separated text
csv: output as csv
&gridlines=[???]
this apparently used to work but doesn't work for me.
To suppress gridlines in embedded sheets I set borders on all cells, then color the borders to match the sheet's background color (eg solid white borders on a white-background sheet).
Here are some of the parameters I found for Google Docs (thanks goes to Joel http://obstruction.tumblr.com/post/60784440737/google-docs-url-parameters-rm-minimal-rm-full):
Google Docs URL parameters:
rm=minimal
rm=full
rm=embedded
rm=demo
rm=(render mode)
ui=2 (select the interface version)
chrome=false (full screen mode)
frameborder=(size of border)
q=(Whatever) Search Query
gid=24 (Which sheet you want to display)
widget=false
single=true
range=A2:AA26 Output=html
format=(export spreadsheet)
format=xlsx
format=csv
widget=false
width=(width)
height=(height)
viewer?
start=
channel=
ibd=
client=
I've been looking for the same thing! One more URL parameter I have found useful is
&rm=[minimal|?]
minimal: hides the top menu and cell inspector, but still shows row numbers, column letters, and the Add More Rows feature at the bottom.
This resource describes some of the parameters, though I can't vouch for its accuracy.
http://www.goopal.org/google-sites-business/google-spreadsheets/spreadsheet-output/publish-spreadsheet#TOC-Other-Export-Parameters
The most helpful list of parameters I found comes from Steegle.com.
You can use the htmlembed URL to display just a range from a Google Sheet - here's how to structure the URL
https://docs.google.com/spreadsheets/d/SpreadsheedID/htmlembed?single=true&gid=SheetID&range=D15:E15&widget=false&chrome=false&headers=false
SpreadsheedID should be the long letters, numbers and characters you get in the normal URL
htmlembed is for sheets you have not published: use pubhtml instead if you have chosen to publish the sheet (if you want the public to see it it's the best way
single never been sure what it does, but we think it helps with only showing a single sheet instead of multiple sheets
SheetID is the sheet number you get in the normal URL after the ?gid= (this is not the sheet name you have specified but the automatic number that Google Sheets provides)
range lets you specify the range of cells you want to display
widget lets you choose whether to display the sheet tabs at the bottom
chrome lets you choose whether to display the spreadsheet title (& sheetname) at the top
headers lets you choose whether to display the spreadsheet title at the top
Source: https://www.steegle.com/google-sites/how-to/insert-websites-apps-scripts-and-gadgets/embed-google-sheet-range

Text File Index?

Good afternoon,
I am wanting to create a a Text file with a many different paragraphs. I want to show a different one everyday. Is there a way to create s sort of index and display a new paragraph each day?
I crated the app already, and the way I am updating is by going into my server and editing the text file and then downloading it into the app, and I have to update it every day and I don't want to do that anymore. I have about 2 years worth of daily paragraphs that I can simply have them all in a text file. I'm not sure if an index is the right approach to this.
I want to be able to have like a huge list of text paragraphs and then displaying a different one each day. Is there any way to do this? I am open to different suggestions! I just want to get it to work! Maybe someone can guide me through the right path.
Thank you!
First, start with the simplest thing that could possibly work:
Make a text file with all your paragraphs. Download the entire file into your app. In your app, split the file into paragraphs, choose one at random, and display it.
Now, if the above proves too slow, then consider optimising it. You could:
After downloading the text file (the first time), read through the file once and create an index with the offset of the start of each paragraph. Then, choose an index entry at random, seek to that point in the text file, read the paragraph, and display it.
Or, you could:
Create the index on the server and download it along with the text file. That saves the app from having to create the index itself.
There are probably easier/better ways to do this, but here's what I'd do...
I'd reorganize your text file into a CSV with two columns. The column on the left has the date paragraph should be displayed (in an easy to parse format), and the column on the right has the actual paragraph. When the app is first launched, it goes to the web, downloads and parses this whole file.
In your app, store these paragraphs in an NSDictionary, using the date as the key, and the paragraph as the value.
Now encode this NSDictionary to disc.
From now on, you don't need to redownload/reparse the file. You can just check in that dictionary, find the entry with the right date, and display that.
Now, ideally, you'd want your server to be able to tell your app when the file was last updated, and for your app to keep track of when it last downloaded the file. Any time the server's last update date is more recent than the app's last download date, the app should redownload, reparse, resave the file.
If you don't want to store the dates, you can simple put the paragraphs in a line separated .txt file. When you read the file in, you can store each paragraph into a separate array index very simply by doing something like this:
NSArray *paragraphs = [myTextDocContents componentsSeparatedByCharactersInSet:
[NSCharacterSet newLineCharacterSet]];

How can I smartly extract information from an HTML page?

I am building something that can more or less extract key information from an arbitrary web site. For example, if I crawled a McDonalds page and wanted to figure out programatically the opening and closing time of McDonalds, what is an intelligent way to do it?
In a general case, maybe I also want to find out whether McDonalds sells chicken wings, or the address of McDonalds.
What I am thinking is that I will have a specific case for time, wings, and address and have code that is unique for each of those 3 cases.
But I am not sure how I can approach this. I have the sites crawled and HTML and related information parsed into JSON already. My current approach is something like finding the title tag and checking if the title tag contains key words like address or location, etc. If the title contains those key words, then I will look through the current page and identify chunks of content that resemble an address, such as content that are cities or countries or content that has the word St or Street inside.
I am wondering if there is a better approach to look for key data, and looking for a nicer starting point or bounce some ideas and whatnot. Or even if there are good articles to read about this would be great as well.
Let me know if this is unclear.
Thanks for the help.
In order to parse such HTML pages you have to have knowlege about their structure. There's no general solution for this problem. Each webpage needs its own solution. However, a good approach would be to ensure the HTML code is valid XML too and then use XPath to access elements at known positions. Maybe there's even an XPath like solution for standard HTML (which is not always valid xml). This way you can define a set of XPaths for each page which give you the specific elements if they exist.

How to add hyperlinks to Text columns in Fusion Tables?

As I enter data into a TEXT column in my fusion tables, I find that while I can use html formatting in the text column, the hyperlinks don't display properly in the info balloons.
For example:
<p>See specimen record at Caterpillars
This is read simply as See specimen record at Caterpillers, with the link stripped out.
Is there a way to add hyperlinked text into the database?
Please note: I know it is possible to to add hyperlinked text in the CONFIGURE INFO WINDOW area, but then that link appears for every marker. I want to add different links for different markers.
Thanks!
Wendy
Rod's answer is correct but I'm guessing you are having a different problem, which is what is the list of columns being displayed by the default infoWindow? You must check the list of columns in the CONFIGURE INFO WINDOW dialog to ensure that your column with the link is checked. I think the default is only the first several columns. Don't worry about formatting the info window just the checkboxes.
Try this:
<p>See specimen record at Caterpillars
This will cause the link to open in a new window or tab rather than taking over the map, and should not get scrubbed from the info window HTML.
One other possible option is to add a column with the links for each info window. Then, when configuring the content of the info windows, use a template like the following:
<p>See specimen record at Caterpillar</p>
Substitute <link_column_name> with the column that contains the links.

Represent the search result by adding relevant description

I'm developing simple search engine.If I search some thing using my search engine it will produce the list of urls which are relating with that search query.
I want to represent the search result by giving small,relevant description under each resulting url.(eg:- if we search something on google,you can see they will provide small description with the each resulting link.)
Any idea..?
Thank in advance!
You need to store position of each word in a webpage while indexing.
your index should contain- word id , document id of the document containing this word, number of occurrence of the word in that document , all the positions where the word occurred.
For more info you can read the research paper by Google founders-
The Anatomy of a Large-Scale Hypertextual Web Search Engine
You can fetch the meta content of that page and display it as a small description . Google also does this.

Resources