Parsing a website

Parsing a website - parsing

I want to make a program that takes as user input a website address. The program then goes to that website, downloads it, and then parses the information inside. It outputs a new html file using the information from the website.
Specifically, what this program will do is take certain links from the website, and put the links in the output html file, and it will discard everything else.
Right now I just want to make it for websites that don't require a login, but later on I want to make it work for sites where you have to login, so it will have to be able to deal with cookies.
I'll also want to later on have the program be able to explore certain links and download information from those other sites.
What are the best programming languages or tools to do this?

Beautiful Soup (Python) comes highly recommended, though I have no experience with it personally.

Python.
It's fairly easy to write a simple crawler using python's standard libs, but you'll also be able to find some existing python crawler libraries available on the web.

Related

URL structure for multilingual websites

I'm developing a SPA web app and it will support various languages. It is build with AngularJS and I am using angular-translate to provide i18n.
But I am struggling a little bit with how the URL structure should be. I do no plan on using either gTLDs nor ccTLDs, so that leaves me with three options.
Use query params: ?locale=en-us
Use url paths: /en-us/page
Store the chosen locale in localStorage or a cookie
The first option is a no-go according to Google's guidelines for web apps SEO. So that leaves me with the last two options.
I have a hard time deciding which is more beneficial, though I am inclined to believe that using url paths would probably be more crawler friendly.
P.S: Not sure if this is the best place to ask such a question either.

The second option is your safest bet as according to https://webmasters.stackexchange.com/questions/59652/what-happens-if-i-try-to-set-a-cookie-on-a-bot cookies are ignored. You can test this yourself by going to the Google Console and fetching your website.
As of now most crawlers ignore cookies and DO NOT execute JavaScript. This means that they usually just download the html and make their judgements from there.
Some developers get around the no javascript problem by pre-rendering parts of their content. I haven't done it personally but you might want to check out https://prerender.io/
Edit
As rolandjitsu mentioned google crawls and executes javascript content.

You should go with second option: provide the language tag (and, optionally, region subtags) in the URL path as first segment.
For the simple reason that it allows you, visitors, and bots to link to specific translations.

Download entire website

I want to be able to download the entire contents of a website and use the data in my app. I've used NSURLConnection to download files in the past, but I don't believe it is capable of downloading all files from an entire website. I'm aware of the app Site Sucker, but don't think there is a way to integrate it's functionality into my app. I looked into AFNetworking & ASIHttpRequest, but didn't see anything useful to me. Any ideas / thoughts? Thanks.

I doubt there is anything out of the box that you can use, but existing libraries that you mentioned (AFNetworking & ASIHttpRequest) will get you pretty far.
The way this works is, you load the main website. Then you go through the source and find any resources that that page uses to display its contents and link to other pages. You then need to recursively download the contents of those resources, as well as its resources.
As you can imagine, there are few caveats to this approach:
You will only be able to download files that are mentioned in the source codes. Hidden files or files that aren't used by any page will not be downloaded as the app doesn't know of their existence.
Be aware of relative and absolute paths: ./image.jpg, /image.jpg, http://website.com/image.jpg, www.website.com/image.jpg, etc. could all link to the same image.
Keep in mind that page1.html could link to page2.html and vice versa. If you don't put any checks in place, this could lead to an infinite loop.
Check for pages that link to external websites--you probably don't want to download those as many websites have links to the outside and here you downloading the entire Internet to an iPhone with 8GB of storage.
Any dynamic pages (the ones that use a server side scripting language, such as PHP) will become static because they lose their server backend to provide them with dynamic data.
Those are the ones I could come up with, but I'm sure that there's more.

Web Source into NSString

How could I access a website and turn components of the website into strings. For example taking information from Facebook posts. I have done a little searching but can't find any good tutorials or anything useful.

Try looking at this tutorial. It should get you more familiar on the subject and start you off on the right track.
As it states at the beginning of the tutorial...
How to Parse HTML on iOS
Let’s say you want to find some information inside a web page and
display it in a custom way in your app. This technique is called
“scraping.” Let’s also assume you’ve thought through alternatives to
scraping web pages from inside your app, and are pretty sure that’s
what you want to do. Well then you get to the question – how can you
programmatically dig through the HTML and find the part you’re looking
for, in the most robust way possible? Believe it or not, regular
expressions won’t cut it! Well, in this tutorial you’ll find out how!
You’ll get hands-on experience with parsing HTML into an Objective-C
data model that your apps can use.
http://www.raywenderlich.com/14172/how-to-parse-html-on-ios

Making a firefox/chrome extension from 0

i have a website, its to exchange links, files... to say it quickly it's my 'version' of twitter+megaupload,
Well, users add links all the time and so on, but i would like user be able to syinch his bookmarks from the browser to the ones he has at his profile of mywebsite,
Where should i look into?
Basically i need to be able to:
- Acces bookmarks file (1)
- being able to send the urls to my service ( 2 )
- maybe adding the login feature (in the future)
I was google'ing about this for ages few weeks a go and i kind of give up, because i'm ok with PHP and JS, but with this plugin languages i'm very lost. So i decided posting here, wich always brings positive answers
(1) - > I don't even know where to start
(2) -> i was thinking to have a website.com/auto_import_no_confirm.php?url=[URL] and put it in a for each.
how many different languages and extension files do i have to work with? I really need any kind of tip with point (1)
feel like?
-edit-
Just found This -> https://developer.mozilla.org/En/Code_snippets/Bookmarks
wich really looks like i need, but where do i place this code?
thanks!

Might not be a bad question, but there are too many subtopics raised to answer that. (And there is too much tagspam as well. Break up your question into PHP- and Javascript-specific tasks, when you have devised the general application scheme.)
But to get started, download similar Firefox extensions (.xpi) and unzip them to inspect the general structure. You'll find examplary code for bookmark handling and invoking remote APIs pretty quickly. And basically you only need Javascript for the extension itself. (It sounds like your extension does not need much UI.)
And there are many tutorials on designing Firefox addons: http://roachfiend.com/archives/2004/12/08/how-to-create-firefox-extensions/ or http://www.google.com/search?q=firefox+develop+an+xpi

The good news first, you won't need much more than javascript if you just want to access bookmarks and send them to a server, neither on firefox nor on chrome.
But still you'll have to make yourself familiar with the apis of the browsers and learn how to develop extensions.
However, both Mozilla and Google provide all necessary information on their developer sites.
For Chrome, this is a good place to start, you'll find the api for bookmark access here.
The Corresponding site for Firefox can be found here, with information on bookmark access here.

Resume parser in Ruby/(Rails Plugin/Gem)

Is there any ruby gem/ rails plugin available for parsing the resume and importing that information into an object/form ?

I may be wrong, but I don't think you'll find anything completely automated to do this, because a résumé (or CV) can be structured in so many different ways and can contain very different types of data. Any completely automated solution is likely to have accuracy problems, since it is technically a difficult problem to solve.
You may find this answer useful.
Here are some other suggestions that might help :-
Require a user to enter their details into a form on your website instead of uploading a Word document. You'll then be able to explicitly ask for the data you want and you'll be able to store the data in a structure that suits you. However, this may be too much of a barrier to entry for your users.
Allow a user to submit the URL of their résumé published using the hResume microformat. Sites like LinkedIn already publish résumés in this format. There is a Ruby gem mofo which can parse microformats including hResumes. However, not all users will have an on-line résumé like this.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Parsing a website - parsing

Beautiful Soup (Python) comes highly recommended, though I have no experience with it personally.

Python. It's fairly easy to write a simple crawler using python's standard libs, but you'll also be able to find some existing python crawler libraries available on the web.

Related

URL structure for multilingual websites

Download entire website

Web Source into NSString

Making a firefox/chrome extension from 0

Resume parser in Ruby/(Rails Plugin/Gem)

Categories

Resources