What is the best way to determine the language of twitter posts.
There is the language parameter that comes with the streaming API but it doesn't really seem to be very accurate. Even many Japanese posts are labelled as English.
What have others done to sort out the langauges?
I've had very good results with this PHP package:
http://pear.php.net/package/Text_LanguageDetect/
It is fast and open source. We use it to select English only posts for a site we run at http://2012twit.com.
google have language detection within their Translate API if using evil external services is a go-er?
http://code.google.com/apis/language/translate/v1/reference.html#detectResult
Related
Is there a way with the API to convert/translate Revit standard terms such as 'Insulation', '3D view', 'View Templates', 'Detail Level' and other baked-in terms to a given language (such as German, Russian , Chinese, etc.)? I'd like to ensure that the messages I provide in my localized add-in use terms that the user is familiar with (with regard to Revit).
I think Jeremy's answer is probably the way to go for a comprehensive approach.
However - if you're looking for something more self-contained and quick-and-dirty, you could try the LabelUtilities class in the Revit API. :)
The LabelUtilties lets you look up the translated value of all of the thousands of builtin parameters, parameter groups, unit types, etc).
All of the pieces of text that you mentioned above are available as BuiltInParmater translations (although, admittedly, some are not available as plurals).
For example:
LabelUtils.GetLabelFor( BuiltInParameter.RBS_WIRE_INSULATION_PARAM );
==> "Insulation" in English.
(You can see all of the translated English BuiltInParameters in the Revit API reference under the BuiltInParameters page).
Good Luck!
Matt
The Autodesk localisation team uses a cross product corpus database NeXLT for terminology and message translation:
http://langtech.autodesk.com/nexlt
This link is accessible from outside the company and translation companies working with the localisation team around the world make use it for translating products for Autodesk platforms.
This answer is already published with a little more background on The Building Coder blog:
http://thebuildingcoder.typepad.com/blog/2014/10/autodesk-open-source-all-over-germany-and-japan.html#4
I have multilanguage website. Actually, the website language is chosen according to the web browser language.
Is there any way to set the language according to the search engine spider? For example:
Display the website in Chinese for Baidu search engine spider,
Display the website in Russian for Yandex spider?
This is called crawler identification. When a request is made to your website, User-Agent field contains the information about the browser or the crawler.
Depending on the crawler, the value of this field will be different. You can then associate different values with different languages. You can also take a look at the large list of user agents.
I'm still pretty sure that by doing this, you'll lower your rank in search engines since you provide different responses to crawlers than to real users, but I don't have solid references to support this statement.
In all cases, crawlers are expected to gather resources in different languages, and those crawlers know how to deal with multilingual websites, except maybe the ones which try to follow every worst practice. Also, the search engines you quoted are not limited to one language. Yandex is available for example in Turkish. As for Baidu, According to Wikipedia, it serves China, Japan, Thailand, Egypt and India.
I'm planning to release a community website that doesn't have a PRIMARY audience that is english speaking. This means that URLs that point to /profile /forums and so on will be in english and not in their native language. I'm not concerned if a user is using the website while accessing different URL paths in English, but I am concerned if I were to use non english URLs then would a search engine pickup on pages on the website better or worse?
Anyone care to share their opinions?
In my opinion, it would be better to have URLs that reflect the primary language of your users as it would make them finding your website easier on search engines (supposing they search using their primary language). From a SEO perspective, if possible try to also include in your URLs the relevant search terms you think would be used by your audience. If you have a forum, for example, include in the thread URLs the full thread title if possible, and so on.
Sources: my own experience with building and managing powershell.it and sqlserver.it, two of the most important Italian technology-related communities.
The best place to start on this issue would be Google's Webmaster Central section on Internationalization.
If you will have versions of the same URL in multiple languages, you can connect them using the rel="alternate"mechanism, which is explained at Google's Webmaster Tools page.
1. Summary
Using non-English URLs for non-English websites is fine.
2. Argumentation
Google Senior Webmaster Trends Analyst John Mueller said in a recent SEO snippets video that using non-English URLs for non-English websites is fine and that Google is able to crawl, index and rank them.
This includes non-Latin characters in your URLs. John Mueller said “as long as URLs are valid and unique, that’s fine.” He added, “So to sum it up, yes, non-English words and URLs are fine, and we recommend using them for non-English websites.”
Read full article here.
3. Disclaimer
Data of this answer were relevant in March 2018 and may be obsolete in the future.
Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/
i would like to know, how is the language translation done in facebook ?
Are they using google translate, or any licensed software ?
I want to enable language translation in my website, and i want similar to that of facebook.
How Can this be done, if at all possible ?
Google has good translation API that will convert your text in to given language. However if you want to translate a larger paragraph you need to go for human translation. Because Google translation is not converting grammar of other languages. Now there are good services available that allow automate the human translation like http://mygengo.com/
Facebook's partner for search and for translate functions is Microsoft Bing.
To use it similarly you need to use the API provided, see 'Translator' here at their Developer page:
http://www.bing.com/dev/en-us/dev-center
Source of my info: several websites including
http://translation-blog.multilizer.com/how-to-use-facebook-translate-button/