How can I identify ad links from a website? I am doing a research on malvertising. As a part of that, I need to extract all the advertisement urls from the website. How can I do that?
(Of course it’s impossible to correctly identify all URLs.)
You could make use of the filter lists of various ad filtering tools. They typically contain absolute URLs (submitted by the community) and strings that often appear in such URLs.
For example, AdBlock Plus hosts some filter lists.
Example from EasyList (big text file):
&adbannerid=
.com/js/adsense
/2013/ads/*
/60x468.
/ad-rotator-
Related
I operate a CMS Site (Video Server like YouTube-like)and it permits users to embed links to videos elsewhere on the web, i.e. www.vimeo.com/videos/sjek3469df
Is there any way someone could input any type or URL "link" that could infect my website?
Thanks in advance all!
It really depends on how your site is set up, but yes, there would be XSS concerns. At the very least, I'd suggest a whitelist for allowed video hosts (with particular URL patterns, not just acceptable domains). You should also consider parsing the URLs to obtain the video IDs, and using those to generate your own embedding code on a per-host basis. That would give you more customization power, not just more security.
In my application I have localized urls that look something like this:
http://examle.com/en/animals/elephant
http://examle.com/nl/dieren/olifant
http://examle.com/de/tiere/elefant
This question is mainly for Facebook Likes, but I guess I will hit similar problems when I start thinking about search engine crawlers.
What kind of url would you expect as canonical url? I don't want to use the exact english url, because I want that people clicking the link will be forwarded to their own language (browser setting/dependent on IP).
The IP lookup is not something that I want to do on every page hit. Besides that I would need to incorporate more 'state' in my application, because I have to check wether a user has already been forwarded to his own locale, or is browsing the english version on purpose.
I guess it will going to be something like:
http://example.com/something/animals/elephant
or maybe without any language identifier at all:
http://example.com/animals/elephant
but that is a bit harder to implement, bigger chance on url clashes in the future (in the rare case I would get a category called en or de).
Summary
What kind of url would you expect as canonical url? Is there already a standard set for this?
I know this question is a bit old, but I was facing the same issue.
I found this:
Different language versions of a single page are considered duplicates only if the main content is in the same language (that is, if only the header, footer, and other non-critical text is translated, but the body remains the same, then the pages are considered to be duplicates).
That can be found here: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
From this I can conclude that we should add locales to canonicals.
I did find one resource that recommends not using the canonical tag with localized addresses. However, Google's documentation does not specify and only mentions subdomains in another context.
There is more that that language that you need to think of.
It's typical a tuple of 3 {region, language, property}
If you only have one website then you have {region, language} only.
Every piece of content can either be different in this 3 dimensional space, or at least presented differently. But this is the same piece of content so you'd like to centralize managing of editorial signals, promotions, tracking etc etc. Think about search systems - you'd like page rank to be merged across all instances of the article, not spread thinly out.
I think there is a standard solution: Canonical URL
Put language/region into the domain name
example.com
uk.example.com
fr.example.com
Now you have a choice how you attach a cookie for subdomain (for language/region) or for domain (for user tracking)!
On every html page add a link to canonical URL
<link rel="canonical" href="http://example.com/awesome-article.html" />
Now you are done.
There certainly is no "Standard" beyond it has to be an URL. What you certainly do see on many comercial websites is exactly what you describe:
<protocol>://<server>/<language>/<more-path>
For the "language-tag" you may follow RFCs as well. I guess your 2-letter-abbrev is quite fine.
I only disagree on the <more-path> of the URL. If I understand you right you are thinking about transforming each page into a local-language URL? I would not do that. Maybe I am not the standard user, but I personally like to manually monkey around in URLs, i.e. if the URL shown is http://examle.com/de/tiere/elefant, but I don't trust the content to be translated well I would manually try http://examle.com/en/tiere/elefant -- and that would not bring me to the expected page. And since I also dislike those URLs http://ex.com/with-the-whole-title-in-the-url-so-the-page-will-be-keyworded-by-search-engines my favorite would be to just exchange the <language> part and use generic english (or any other language) for <more-path>. Eg:
http://examle.com/en/animals/elephant
http://examle.com/nl/animals/elephant
http://examle.com/de/animals/elephant
If your site is something like Wikipedia, then I would agree to your scheme of translating the <more-part> as well.
Maybe this Google's guidelines can help with your issue: https://support.google.com/webmasters/answer/189077?hl=en
It says that many websites serve users (across the world) with content targeted to users in a certain region. It is advised to use the rel="alternate" hreflang="x" attributes to serve the correct language or regional URL in Search results.
I know similar questions have already been asked, but I want to know whether there exists some code\package, or some ideas on how to tell if two urls are the same page.
For motivation, assume that what I want to do is write a chrome extension that tells you how many of your facebook friends visited a link.
Of course simply comparing urls won't work as some url parameters might be critical while other are not, e.g google.com?query=help is not the same page as google.com?query=idea as the query parameter is critical, while google.com?referrer=facebook is the same as google.com?referrer=twitter (I am of course making up these examples).
Also, comparing the page's content is not guaranteed to work, as if there are randomized parts ("related stories") or user-specific content (headline of "Hi Noam, we haven't seen you in a while").
Of course, I am not looking for a foolproof method, just something that work on most normal-behaving sites.
Any good recommendations of packages (any language) or ideas on how to do this?
Any standard distance metric on string comparisons should give you a score for various URLs' contents. Presumably, contents that are more similar will have better scores than less similar URLs, so rank results and compare.
There is no way to make sure that two pages are the same. There may be user-specific content (Login Buttons for some users, a personal greeting for others), advertising, browser-specific content (CSS3 for Chrome, CSS2 for Opera, a drive-by-download-exploit for IE6 users :))
The same resource may be available under different URLs (/article/4-funny-ways-to-encrypt-your-shellcode-123456 or /article.php?id=123456). There might my two domains for the same content (www.domain.com and domain.com, maybe even domain.co.uk). You could get some clues from the Last-Modified: header which might contain the file modification date, but when it comes to dynamic content, it can also contain the generation date. There may be a ETag header that contains a hash for the underlying resource, at least in ruby on rails, if correctly implemented, which is not often the case.
So the only thing left you could probably do is to compare so pages and calculate some metrics. I would consider the domain, IP address and page content for comparision. With a higher weight on IP address and domain (or domain fractions). So you can calculate certain probabilites, but there is no way to make sure two pages are the same.
I have multilanguage website. Actually, the website language is chosen according to the web browser language.
Is there any way to set the language according to the search engine spider? For example:
Display the website in Chinese for Baidu search engine spider,
Display the website in Russian for Yandex spider?
This is called crawler identification. When a request is made to your website, User-Agent field contains the information about the browser or the crawler.
Depending on the crawler, the value of this field will be different. You can then associate different values with different languages. You can also take a look at the large list of user agents.
I'm still pretty sure that by doing this, you'll lower your rank in search engines since you provide different responses to crawlers than to real users, but I don't have solid references to support this statement.
In all cases, crawlers are expected to gather resources in different languages, and those crawlers know how to deal with multilingual websites, except maybe the ones which try to follow every worst practice. Also, the search engines you quoted are not limited to one language. Yandex is available for example in Turkish. As for Baidu, According to Wikipedia, it serves China, Japan, Thailand, Egypt and India.
Our application (C#/.NET) needs a lot of queries to search. Google's 50,000 policy per day is not enough. We need something that would crawl Internet websites by specific rules we set (for ex. country domains) and gather URLs, Texts, keywords, name of websites and create our own internal catalogue so we wouldn't be limited to any massive external search engine like Google or Yahoo.
Is there any free open source solution we could use to install it on our server?
No point in re-inventing the wheel.
DataparkSearch might be the one you need. Or review this list of other Open Source Search Engines.