I'm trying to crawl whole youtube.com using Apache Nutch. The problem is I need significant amount of seed urls to make sure almost all urls of Youtube get crawled. But I couldn't find any sitemap or list of urls for youtube. For example, to crawl apple.com, I can provide the urls of Apple website's stiemap as seeds - http://www.apple.com/sitemap.xml
Currently my only seed is - https://www.youtube.com.
And my regex-urlfilter.txt contains -
+^https://www.youtube.com/?(watch\\?([^#\\&\\?]*).*)?$
I tried good search like filetype:xml site:youtube.com but nothing appeared.
Can anyone help me finding a way to get a collection of seeds to crawl youtube.com?
Here is the sitemap I got : https://www.youtube.com/yt/sitemap.xml
from robots.txt. Try to follow the outgoing links from one home page to the other
and do it iteratively.
Related
We are working on a website with a static frontend, API Gateway + Lambda as the backend, DynamoDB as the DB. I saw there were a couple similar questions to this one, but I'm looking to understand the matter throughly to build a complete and robust solutions, since I hope to build several websites using this stack.
This one is a fairly basic website: we have an index.html page, a blog.html page and a portfolio.html page. We also have an html page for single portfolio entries (let's call it portfolio-entry.html) and a page for single blog articles (let's call it blog-post.html).
So I see there's a way to specify an index page and an error page, so you can have a nice clean url for your index. There are also rewrite rules, which are more like redirects.
I guess my best bet to deliver different blog posts would be to pass a query string to blog-post.html ("mywebsite.com/blog-post.html?post=post-alias") and have the .js ask the API different content depending on the query string.
Is there a way using S3 to route mywebsite.com/blog/post-alias/ to mywebsite.com/blog-post.html?post=post-alias and pass the response to the client without redirecting? I'm both interested in "client-side url rewriting" via JS to have nice URLs for humans AND server-side routing to catch crawler requests and have SEO/indexing of pages for specific posts, for example.
How should I go about this? Is there a way to achieve all this using what S3 and JS provide or do I have to put a proxy/router (like nginx) in front of S3 to handle route requests?
We are really committed to the whole S3-ApiGateway-Lambda-Dynamo architecture and we would really love to do without server.
We have a relatively large website and by looking at Google Search Console we have found a lot of strange errors. By lot, I mean 199 URLs give 404 reponse.
My problem is that I don't think these URLs can be found on any of our pages, even though we have a lot of dynamically generated content.
Because of this, I wonder if these are URLs the crawler found or requests coming to our site? Like mysite.com/foobar, which obviously would return 404.
Google reports all backlinks to your website that deliver a 404 in the Google Search Console, no matter if there has ever been a webpage with that URL in the past.
When you click on an URL in the pages with an error list, a pop-up window will give you details. There is a tab "Linked from" listing all (external) links to that page.
(Some of the entries can be outdated. But if these links still exist, try to get them updated or set up redirects for them. The goal in the end is to improve the user experience.)
I want a program that does that, from a website, get all the urls indexed to it with a good output, like, all the urls line by line, and get the urls not used in the website (because a spider can already do that).
I have been searching and finding sloopy options, what I want is accurate and simple: INPUT: URL OUTPUT: ALL THE URLS.
I don't know such applications for now, but I'll try to simplify your task by dividing it:
Yon need a list of your website's internal links. Any webcrawler tool can do that.
You need a list of your website's pages indexed by Google. There are a lot of SE index checkers, you can google it.
Compare 2nd list to the 1st one, and find all the links presents in Google's index but missing on your website.
My site is not able to show uploaded youtube videos when the url is a mobile (m.) site, but it works for the normal youtube site. It seems to me that the mobile and normal urls differ in a pattern, as shown below:
http://www.youtube.com/watch?v=5ILbPFSc4_4
http://m.youtube.com/#/watch?v=5ILbPFSc4_4&desktop_uri=%2Fwatch%3Fv%3D5ILbPFSc4_4
obviously, the m. is added, as is the /#, and all the &desktop_uri... stuff.
and again:
http://www.youtube.com/watch?v=8To-6VIJZRE
http://m.youtube.com/#/watch?v=9To-6VIJZRE&desktop_uri=%2Fwatch%3Fv%3D8To-6VIJZRE
What we hope to do is check to see if the url is mobile site, and if it is, parse it so it shows as the normal site.
Does any one know if all youtube urls work this way--if this similar pattern works for all the same videos on mobile and normal sites?
In general, any time you attempt to parse URLs for sites (as opposed to web APIs) by hand, you're leaving yourself open to breakage. There's no "contract" in place that states that a common format will always be used for watch page URLs on the mobile site, or on the desktop site.
The oEmbed service is what you should use whenever you want to take a YouTube watch page URL as input and get information about the underlying video resource as output in a programmatic fashion. That being said, the oEmbed response doesn't include a canonical link to the desktop YouTube watch page, so it's not going to give you exactly what you want in this case. For many use cases, such as when you want to get the embed code for a video given its watch page URL, it's the right choice.
If you do code something by hand, please ensure that your code is deployed somewhere where it would be easy to update if the format of the watch pages ever do change.
I have a website where I want to basically allow people to display several youtube videos onto the same page.
For example, I have a friend who has 3 different videos. Instead of sending a link to the three videos individually, they would go to my site see the 3 search boxes, search for the videos individually (the search is done on youtube), then they can pick the videos and click "done", at which point, the 3 videos would be embedded on their page.
I'm trying to figure out how to approach this in ruby on rails, but I'm not finding much information on how.
Here's a link from 2009 of a guy who's saying that he can actually do the search and retrieve from youtube: http://railsforum.com/viewtopic.php?id=30443
But I don't know how to do the search & retrieve, and I don't know how to do the embed. I think I can figure out the embed, but what's the best way to do the search/display results?
Thanks a lot for your help stackoverflow, you're my only hope (besides google, but google failed me today).
all you need to know about the search feature is described in YouTube Data Api. You will need your app to communicate with this API. The best thing to do may be to look after a gem specialized in this ; there is a list available in another StackOverflow question.