Can malicious code, virus, etc be loaded onto a site that accepts embed links? - embedding

I operate a CMS Site (Video Server like YouTube-like)and it permits users to embed links to videos elsewhere on the web, i.e. www.vimeo.com/videos/sjek3469df
Is there any way someone could input any type or URL "link" that could infect my website?
Thanks in advance all!

It really depends on how your site is set up, but yes, there would be XSS concerns. At the very least, I'd suggest a whitelist for allowed video hosts (with particular URL patterns, not just acceptable domains). You should also consider parsing the URLs to obtain the video IDs, and using those to generate your own embedding code on a per-host basis. That would give you more customization power, not just more security.

Related

URL structure for multilingual websites

I'm developing a SPA web app and it will support various languages. It is build with AngularJS and I am using angular-translate to provide i18n.
But I am struggling a little bit with how the URL structure should be. I do no plan on using either gTLDs nor ccTLDs, so that leaves me with three options.
Use query params: ?locale=en-us
Use url paths: /en-us/page
Store the chosen locale in localStorage or a cookie
The first option is a no-go according to Google's guidelines for web apps SEO. So that leaves me with the last two options.
I have a hard time deciding which is more beneficial, though I am inclined to believe that using url paths would probably be more crawler friendly.
P.S: Not sure if this is the best place to ask such a question either.
The second option is your safest bet as according to https://webmasters.stackexchange.com/questions/59652/what-happens-if-i-try-to-set-a-cookie-on-a-bot cookies are ignored. You can test this yourself by going to the Google Console and fetching your website.
As of now most crawlers ignore cookies and DO NOT execute JavaScript. This means that they usually just download the html and make their judgements from there.
Some developers get around the no javascript problem by pre-rendering parts of their content. I haven't done it personally but you might want to check out https://prerender.io/
Edit
As rolandjitsu mentioned google crawls and executes javascript content.
You should go with second option: provide the language tag (and, optionally, region subtags) in the URL path as first segment.
For the simple reason that it allows you, visitors, and bots to link to specific translations.

Want to make a Search Engine

I want to make a torrent search engine which will provide links to other torrent sites. So I need data from other sites to index them in the database. So, is it legal to crawl a website for this purpose or is there some other way to do that.
Depending on the site, without permission it is not legal.
You might wish to investigate Common Crawl, a website that has already crawled the entirety of the web. Check out their ToU section to check on the legality of it all.

Protecting website content from crawlers

The contents of a commerce website (ASP.NET MVC) are regularly crawled by the competition. These people are programmers and they use sophisticated methods to crawl the site so identifying them by IP is not possible.
Unfortunately replacing values with images is not an option because the site should still remain readable by screen readers (JAWS).
My personal idea is using robots.txt: prohibit crawlers from accessing one common URL on the page (this could be disguised as a normal item detail link, but hidden from normal users Valid URL: http://example.com?itemId=1234 Prohibited: http://example.com?itemId=123 under 128). If an IP owner entered the prohibited link show a CAPTCHA validation.
A normal user would never follow a link like this because it is not visible, Google does not have to crawl it because it is bogus. The issue with this is that the screen reader still reads the link and I don't think that this would be so effective to be worth implementing.
Your idea could possibly work for a few basic crawlers but would be very easy to work around. They would just need to use a proxy and do a get on each link from a new IP.
If you allow anonymous access to your website then you can never fully protect your data. Even if you manage to prevent crawlers with lots of time and effort they could just get a human to browse and capture the content with something like fiddler. The best way to prevent your data being seen by your competitors would be to not put it on a public part of your website.
Forcing users to log in might help matters, at least then you could pick up who is crawling your site and ban them.
As mentioned, its not really going to be possible to hide publicly accessible data from a determined user, however, as these are automated crawlers, you could make life harder for them by altering the layout of your page regularly.
It is probably possible to use different master pages to produce the same (or similar) layouts, and you could swap in the master page on a random basis - this would make the writing of an automated crawler that bit more difficult.
I am about to get to the phase of protecting my content from crawlers either.
I am thinking of limiting what an anonymous user can see of the website and require them to register for a full functionality.
example:
public ActionResult Index()
{
if(Page.User.Identity.IsAuthorized)
return RedirectToAction("IndexAll");
// show only some poor content
}
[Authorize(Roles="Users")]
public ActionResult IndexAll()
{
// Show everything
}
Since you know users now, you can punish any crawler.

How do url shorting sites work?

How do URL shorting sites like bit.ly or goo.gl work? Does anyone know what technique or algorithm they use?
Save the URL and generate unique key for the URL and store it in the DB. Use this key to navigate to the URL.
Do you need complex algorithm for this? :-)
If you want to make it complex.
Check for malicious URLs and block them
Have stats based on number of clicks
Have registrations and users have their own small urls
Develop plugins for browsers to generate short urls
etc etc

What is the advantage of putting the language indicator into the URL?

I'm doing a site which supports multiple languages. At the moment, I’m doing like /en/… in the URL path and using .htaccess to determine which language the user is on. Actually, this is very common for sites with multiple languages to either do http://en.example.com or http://example.com/en/.
My question is: Why is it so common to show in the URL which language the user is viewing? I can't see any technical advantages. Is it for optimizing user experience?
Because you could easily just use sessions/cookies and hide it from the user which I'm leaning to at the moment.
Thanks in advance :)
For easy bookmarking probably.
Specifying the language information in the URL is 1 way to indicate that you want to view in that particular language, ignoring your current locale.
Wrapping this information in the URL is better than using a cookie for example, as some users may delete all cookies after each browsing session.
And because of this pseudo REST like URL, /en/, it is easily bookmarkable, and search engine friendly
I think it's used as a substitute for not owning the domain within each TLD. (ie company.co.uk and company.com).
It's also usable because of the uri's possibility to be localised: ikea.com/se/stolar could be the localised variant of ikea.com/en/chairs; usable both for the end user and SEO.
It is not directory, but mod_rewrite - such url as:
http://google.pl/en
gets rewritten server side for:
http://google.pl?lang=en
and for every language it will be more handy.
Why? Because if client saves link to our page in favorites and sends it to his friend, he can pass also the language of the page he was viewing. If the default language was for example polish, and he changed it to english, he saves friend some time to search and click specific button.
If you put it in the URL the search engines will store every page in every language. If you use cookies, they will only store one. So it's more a SEO advantage I think.

Resources