Boilerpipe to extract non-english news articles - character-encoding

I am trying to use boilerpipe to extract news articles from non-english text. I have already seen this and its not working for me. I made following changes
1) Modified HTMLfetcher.java. Appended following lines before end of method fetch
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
Or/And then
2) Changes code in class by using UTF-8 charset with Inuts
`URL url = new URL(urls);
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
text = ArticleExtractor.INSTANCE.getText(is);`
Still it did not work
Test URL: http://www.sandesh.com/article.aspx?newsid=2905443
Text: મુંબઈ, 30 જાન્યુઆરી
સલમાન ખાને ગુજરાતમાં આવીને નરેન્દ્ર મોદીના વખાણ શુ કર્યા તેની મુસીબતોમાં ખૂબ વધારો થઈ ગયો છે. સલમાન ખાન ફિલ્મ 'જય હો'ના પ્રમોશન માટે ઉત્તરાયણમાં અમદાવાદ આવ્યા હોવાથી અને તે સમયે તેણે નરેન્દ્ર મોદીના વખાણ કર્યા હોવાથી કોંગ્રેસ દ્વારા મુસ્લિમોને તેની ફિલ્મ 'જય હો' ના જોવાની અરજી કરવામાં આવી હતી અને હવે મુસ્લિમ મૌલવીઓ દ્વારા તેના સામે ફતવો જાહેર કરી દેવામાં આવ્યો છે.
Please help me.

You've clearly been able to get ArticleExtractor to parse utf-8 text. The (likely) problem is that boilerplate's algorithms are specifically tailored to English and aren't working so well on a Gujarati (?) article. The algorithms use verbosity of phrases (eg: number of words per phrase) as well as some specific phrases (comments, have your say, etc) to determine the barriers of the article, as well as what pieces within the article are content or non content.
Have a look in the boilerpipe/filters/english directory of the library for more info on the algorithms. Unfortunately to get the same level of accuracy in non-English languages you would need to repeat their study on each language, or have a list of translated stop words and an idea about verbosity for each language you use.

First - The accepted answer is correct. Boilerpipe's algorithms are specifically tailored to English. However that does not mean it cannot return rough content in other languages. Please read complete accepted answer, below can be a crapshoot and you may not always get good content...
Java-
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class BoilerpipeTest {
public static void main(String[] args) {
try{
//some wrestling match in Russian from Russian newspaper
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Next, if you are using Eclipse-
Click on Run > Run Configurations > and select the Common Tab, then Encoding to Other(UTF-8), then click Run like so:

Related

Weka ARFF generation

I am trying to generate an .arff file from a csv data file I have. Now I am totally new to Weka and have started using it just a day back. I am trying out a simple twitter sentiment analysis with this for starters. I have generated training data in CSV. Contents of CSV file are as follows:
tweet,affinScore,polarity
ATAUTHORcfoblog is giving away a $25 Amex gift card (enter to win over $600 in prizes!) http://t.co/JD8EP14c ,4,4
"American Express has always been my dark horse acquirer of ATAUTHORFoursquare. Bundle in Square-like payments & its a lite-retailer platform, no? ",0,1
African-American Demos Express Ethnic Identity Differently http://t.co/gInv4bKj via ATAUTHORmediapost ,0,3
Google ???????? Visa ? American Express http://t.co/eEZTSiHY ,0,4
Secrets to Success from Small-Business Owners : Lifestyle :: American Express OPEN Forum http://t.co/b85F8JX0 via ATAUTHOROpenForum ,2,1
RT ATAUTHORhunterwalk: American Express has always been my dark horse acquirer of ATAUTHORFoursquare. Bundle in Square-like payments & its a lite ... ,0,1
Winning Surveys $1500 american express Huggies Sweeps http://t.co/WoaTFowp ,4,1
I root for Square mostly because a small business that takes Square is also one that takes American Express. ,0,1
I dont know how bitch be acting American Express but they cards be saying DEBIT ON IT HAVE A ?? PLEASE!!! ,-5,2
Uh oh... RT ATAUTHORBlackArrowBella: I dont know how bitch be acting American Express but they cards be saying DEBIT ON IT HAVE A ?? PLEASE!!! ,-5,2
Just got another credit card. A Blue Sky card with American Express. Its gonna help pay for the honeymoon! ATAUTHORAmericanExpress ,-1,1
Follow ATAUTHORShaveMagazine and ReTweet this msg to be entered to #Win an American Express Gift card. Winners contacted bi-weekly by direct msg! ,2,4
American Express Gold zakelijk aanvragen: http://t.co/xheZwmbt ,0,3
RT ATAUTHORhunterwalk: American Express has always been my dark horse acquirer of ATAUTHORFoursquare. Bundle in Square-like payments & its a lite ... ,0,1
Here first attribute is actual tweet, second is AFFIN score and third is actual classification class (1- Positive, 2-Negative, 3-Neutral, 4-Spam)
Now I try to generate .arff format from it using code:
import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;
import java.io.File;
public class CSV2Arff {
/**
* takes 2 arguments:
* - CSV input file
* - ARFF output file
*/
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");
System.exit(1);
}
// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();
// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();
}
}
This generates .arff file that looks somewhat like:
#relation file
#attribute tweet {_ATAUTHORcfoblog_is_giving_away_a_$25_Amex_gift_card_(enter_to_win_over_$600_in_prizes!)_http://t.co/JD8EP14c_,'American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite-retailer_platform,_no?_',African-American_Demos_Express_Ethnic_Identity_Differently_http://t.co/gInv4bKj_via__ATAUTHORmediapost_,Google_????????_Visa_?_American_Express__http://t.co/eEZTSiHY_,Secrets_to_Success_from_Small-Business_Owners_:_Lifestyle_::_American_Express_OPEN_Forum_http://t.co/b85F8JX0_via__ATAUTHOROpenForum_,RT__ATAUTHORhunterwalk:_American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite_..._
#data
_ATAUTHORcfoblog_is_giving_away_a_$25_Amex_gift_card_(enter_to_win_over_$600_in_prizes!)_http://t.co/JD8EP14c_,4,4
'American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite-retailer_platform,_no?_',0,1
African-American_Demos_Express_Ethnic_Identity_Differently_http://t.co/gInv4bKj_via__ATAUTHORmediapost_,0,3
Google_????????_Visa_?_American_Express__http://t.co/eEZTSiHY_,0,4
Secrets_to_Success_from_Small-Business_Owners_:_Lifestyle_::_American_Express_OPEN_Forum_http://t.co/b85F8JX0_via__ATAUTHOROpenForum_,2,1
RT__ATAUTHORhunterwalk:_American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite_..._,0,1
I am new to Weka but from what I have read, I have a suspicion that this ARFF is not correctly formed. Can anyone comment on it?
Also if it is wrong, can someone point me to where exactly am I going wrong?
Make sure to set the type of the tweet attribute to be arbitrary strings, not a categorial attribute, which seems to be the default. This doesn't scale well, as it puts a copy of every tweet in the type definition otherwise.
Note that for actual analysis of the tweet contents, you will likely need to preprocess them much further. You will likely want a sparse vector representation of the text instead of a long string.
If you are using the UI as previously mentioned then you can just load the file directly into Weka.
If you just want to generate an ARFF file based on the CSV file you can do the following. This was taken from the CSV2Arff tool which comes as part of Weka.
import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;
import java.io.File;
public class CSV2Arff {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");
System.exit(1);
}
// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();
// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();
}
}

I have 30 or so items in a .txt file which I would like to display in a listbox in wp7

.txt file looks like
Euro
US Dollar
Australian Dollar
Pounds Sterling
Swiss Franc
and so on.
I have tried things like XDocument and ObservableCollection but can't seem to get them to work.
I would rather not hard code so much into xaml.
thanks,
Assuming you're sending the file with the project (file Build action set to "Content"), here's what you need:
First, add a ListBox to the Page called CurrenciesListBox, then add this code on the page load event or constructor:
var xapResolver = new System.Xml.XmlXapResolver();
using (var currenciesStream = (Stream)xapResolver.GetEntity(new Uri("Currencies.txt", UriKind.RelativeOrAbsolute), "", typeof(Stream)))
{
using (var streamReader = new StreamReader(currenciesStream))
{
while (!streamReader.EndOfStream)
{
CurrenciesListBox.Items.Add(streamReader.ReadLine());
}
}
}
Remember to change the filename above to match your file!
This is for starter, there are better ways to do the job (using MVVM)!

Phonegap/Sencha Language Localization

The Problem:
I'm about to implement language localization to an already very large ipad application that was built using sencha touch wrapped in phonegap. I have the english and spanish translations in json files.
What I'm Planning on Doing:
I plan to load the json files into a sencha touch store, creating a global object. Then in each place where I call text that is displayed, i will replace the text with a call to the global object.
My question(s):
Is there an easier way to implement language localization with my
setup?
Will I run into issues with native sencha stuff (like datepickers)?
When loading/reloading language json files, will I have performance
issues (require a webview reload?, sencha object resizing issues,
etc)
edit 1 : Useful Related Info:
For those going down this road, it quickly becomes useful to write a simple phonegap plugin to get the ipad/iphone device's language setting into your javascript. This requires a plugin, which will look something like this:
Javascript :
part 1:
PhoneGap.exec("PixFileDownload.getSystemLanguage");
part 2(callback Function):
setLanguage(returnedLanguage)
{
GlobalVar.CurrentLanguage = returnedLanguage; //GloablVar.CurrentLanguage already defined
}
Objective C:
-(void)getSystemLanguage:(NSMutableArray*)paramArray withDict:(NSMutableDictionary*)options
{
/*Plugin Details
PhoneGap.exec("PixFileDownload.getSystemLanguage");
Returns Language Code
*/
NSUserDefaults* defs = [NSUserDefaults standardUserDefaults];
NSArray* languages = [defs objectForKey:#"AppleLanguages"];
NSString *language = [languages objectAtIndex:0];
NSLog(#"####### This is the language code%#",language);
NSString *jsCallBack;
jsCallBack = [NSString stringWithFormat:#"setLanguage('%#');",language];
[self.webView stringByEvaluatingJavaScriptFromString:jsCallBack];
}
edit 2: character encoding
When adding additional language characters to a sencha project (or any webview phonegap project), ensure that you have the proper encoding specified in the index file. This is the tag i needed.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I've finished this language localization plugin. It's not amazing, but it worked better than I originally speculated. Here are the short answers to each of the questions.
1- Is there an easier way to implement language localization with my
setup?
Not that i'm aware of. The comment by Stuart provided this link Sencha-touch localization. Use a store or a global JSON object? which had some good information on one way that you can use class overrides. I didn't like this approach, because it spread my language translations into different classes. But certainly, if you are doing something simple, or you want something that could be more powerful, perhaps you should investigate that.
2- Will I run into issues with native sencha stuff (like datepickers)?
I ended up specifically leaving "datepickers" in english for now. But everything else was really relatively easy to customize. Almost every graphical UI element can have it's text altered.
3- When loading/reloading language json files, will I have performance
issues (require a webview reload?, sencha object resizing issues,
etc).
The method that i employed (see below) worked exceptionally well in regards to performance. The one issue that you have is right when you switch the languages, you need that specific page to reload. Sencha handled resizing without any flaws, except where I had been foolish and statically set sizes.
Some of what I did was described in edits to the question. Here is a detailed overview of my solution. (warning, it's not the most elegant solution.)
Instead of using a pure JSON file, I ended up just using a javascript function. This isn't the greatest solution because it requires some minimal maintenance, but JSON parsing with phonegap/sencha isn't the best. (I get JSON files from translater's, and quickly paste into the javascript file. Takes around 2 minutes, see further explanation below).
Language.js
function setLanguage(language)
{
if(language == "en")
{
//console.log("inside if Language == en");
GlobalLanguage.CurrentLanguage = language;
GlobalLanguage.ID = {"glossary": [
{
//CONVERTED JSON
about : 'About',
checking_for_updates : 'Checking for updates...(This may take a few minutes.)'
//Any additional translations
}
]};
}
if (language == "es"){
//console.log("inside language == es");
GlobalLanguage.CurrentLanguage = language;
GlobalLanguage.ID = {"glossary": [
{
//CONVERTED JSON
about : 'Acerca de ',
checking_for_updates : 'Verificando actualizaciones... (Capas que demore algunos minutos).'
//Any additional translations
}]};
}
if (language == "pt"){
//console.log("inside language == pt");
GlobalLanguage.CurrentLanguage = language;
GlobalLanguage.ID = {"glossary": [
{
//CONVERTED JSON
about : 'Sobre',
checking_for_updates : 'Verificando se há atualizações... (pode demorar alguns minutos.)'
//Any additional translations
}]};
}
}
As you can see, this file allows for 3 languages (Portugese, English, and Spanish). After setting the language, you could access each localized string anywhere in your object. For example, if you need to access the word "about" simply use:
GlobalLanguage.ID.glossary[0]["about"]
This will access the GlobalLanguage object which will have whatever language loaded into the properties. So throughout your project, you could have these calls. However, I'd recommend taking it one step further
function langSay(languageIdentifier){
// console.log("inside langSay");
if(!GlobalLanguage.ID.glossary[0][languageIdentifier]){
return "[! LANGUAGE EXCEPTION !]";
}
else{
return GlobalLanguage.ID.glossary[0][languageIdentifier];
}
}
This protects you from having language exceptions and having your program crash without knowing where (you may have hundreds or thousands of properties being set in that language.js file). So now simply :
langSay("about")
One additional note about formatting from JSON. The format you want your language files in is:
languageIdentifier : 'Translation',
languageIdentifier : 'Translation',
languageIdentifier : 'Translation'
I used Excel for the formatting. Also the languageIdentifiers are unique identifiers without spaces. I recommend just using Excel to format first 3 to 4 words word1_word2_word3_word4 of the english translation.
word1_word2_word3 : 'word1 word2 word3'
Hope this helps you out! I'd be happy to respond to any questions

How to show Persian numbers on ASP.NET MVC page?

I'm building a site which needs to support both English and Persian language. The site is built with ASP.NET MVC 3 and .NET 4 (C#). All my controllers inherit from a BaseController, which sets culture to "fa-IR" (during test):
Thread.CurrentThread.CurrentCulture = new CultureInfo("fa-IR");
Thread.CurrentThread.CurrentUICulture = new CultureInfo("fa-IR");
In my views, I'm using static helper classes to convert into right timezone and to format date. Like:
DateFormatter.ToLocalDateAndTime(model.CreatedOnUtc)
I'm doing the same for money with a MoneyFormatter. For money, the currency prints correctly in Persian (or, at least I think so as I don't know any Persian. It will be verified later on though by our translators). The same goes for dates, where the month is printed correctly. This is what it displays as currently:
For money:
قیمت: ريال 1,000
For dates:
21:01 ديسمبر 3
In these examples, I want the numbers to also print in Persian. How do you accomplish that?
I have also tried adding:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1256" />
in the HEAD tag, as recommended on some forums. Does not change anything though (from what I can see). I have also added "fa-IR" as the first language in my browser languages (both IE and Firefox). Didn't help either.
Anyone got any ideas on how to solve this? I'd rather avoid creating translation tables between English numbers and Persian numbers, if possible.
Best regards,
Eric
After some further research I found an answer from a Microsoft employee stating that they don't translate numbers, though it's fully possible to do it yourself using the array of digits for a specific culture, that is provided by the NativeDigits property (see http://msdn.microsoft.com/en-us/library/system.globalization.numberformatinfo.nativedigits.aspx).
To find out if text that is submitted in a form is Arabic or Latin I'm now doing:
public bool IsContentArabic(string content)
{
string pattern = #"\p{IsArabic}";
return Regex.IsMatch(
content,
pattern,
RegexOptions.RightToLeft | RegexOptions.IgnoreCase | RegexOptions.Multiline);
}
public bool IsContentLatin1(string content)
{
string pattern = #"\p{IsBasicLatin}";
return Regex.IsMatch(
content,
pattern,
RegexOptions.IgnoreCase | RegexOptions.Multiline);
}
And to convert Persian digits into their "Latin" equivalents, I wrote this helper:
public static class NumberHelper
{
private static readonly CultureInfo arabic = new CultureInfo("fa-IR");
private static readonly CultureInfo latin = new CultureInfo("en-US");
public static string ToArabic(string input)
{
var arabicDigits = arabic.NumberFormat.NativeDigits;
for (int i = 0; i < arabicDigits.Length; i++)
{
input = input.Replace(i.ToString(), arabicDigits[i]);
}
return input;
}
public static string ToLatin(string input)
{
var latinDigits = latin.NumberFormat.NativeDigits;
var arabicDigits = arabic.NumberFormat.NativeDigits;
for (int i = 0; i < latinDigits.Length; i++)
{
input = input.Replace(arabicDigits[i], latinDigits[i]);
}
return input;
}
}
I've also hooked in before model binding takes place and there I convert digit-only input from forms into Latin digits, if applicable.
Ako, for what direction goes we managed to solve quite a bit of issues using the 'dir' attribute (dir => direction) on the html tag for our Web page. Like this:
<html dir="rtl">
The 'dir' attribute takes either "rtl" (right-to-left) or "ltr" (left-to-right).
Hope this helps someone!
Maybe you have to change the font.
If you know your client supports a particular font, say Badr or Nazanin, then you can change the font-family for those numbers, and I think that will work for you.
You can supply for the font for your page, but I think this only works in modern browsers and fails in the old ones. You can check this.
Hope that helps.
How are you printing the numbers to the output? I don't have much experience with localization but you might have to use the appropriate NumberFormatInfo or something similar to format the number.
Also for greatest portability you should probably be using UTF8 as the encoding.
firs of all you should use a font-face that supports Persian number.
I use this technique and I can show Persian number on my web site.
fonts like:BNazanin.

When parsing XML, the character é is missing

I have an XML as input to a Java function that parses it and produces an output. Somewhere in the XML there is the word "stratégie". The output is "stratgie". How should I parse the XML as to get the "é" character as well?
The XML is not produced by myself, I get it as a response from a web service and I am positive that "stratégie" is included in it as "stratégie".
In the parser, I have:
public List<Item> GetItems(InputStream stream) {
try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(stream);
doc.getDocumentElement().normalize();
NodeList nodeLst = doc.getElementsByTagName("item");
List<Item> items = new ArrayList<Item>();
Item currentItem = new Item();
Node node = nodeLst.item(0);
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element item = (Element) node;
if(node.getChildNodes().getLength()==0){
return null;
}
NodeList title = item.getElementsByTagName("title");
Element titleElmnt = (Element) title.item(0);
if (null != titleElmnt)
currentItem.setTitle(titleElmnt.getChildNodes().item(0).getNodeValue());
....
Using the debugger, I can see that titleElmnt.getChildNodes().item(0).getNodeValue() is "stratgie" (without the é).
Thank you for your help.
I strongly suspect that either you're parsing it incorrectly or (rather more likely) it's just not being displayed properly. You haven't really told us anything about the code or how you're using the result, which makes it hard to give very concrete advice.
As ever with encoding issues, the first thing to do is work out exactly where data is getting lost. Lots of logging tends to be the way forward: create a small test case that demonstrates the problem (as small as you can get away with) and log everything about the data. Don't just try to log it as raw text: log the Unicode value of each character. That way your log will have all the information even if there are problems with the font or encoding you use to view the log.
The answer was here: http://www.yagudaev.com/programming/java/7-jsp-escaping-html
You can either use utf-8 and have the 'é' char in your document instead of é, or you need to have a parser that understand this entity which exists in HTML and XHTML and maybe other XML dialects but not in pure XML : in pure XML there's "only" ", <, > and maybe &apos; I don't remember.
Maybe you can need to specify those special-char entities in your DTD or XML Schema (I don't know which one you use) and tell your parser about it.

Resources