How to index xml data directly on elasticsearch server - ruby-on-rails

I have almost 250 XML data files (one file contain 1000 pairs of xml formatted data) and i have one elasticsearch server. My application build on Ruby on Rails platform. I know how to do index on Model in rails application (ModelName.import) which will do indexes on elasticsearch server.
But is there other way that we can directly do indexing using XML data files on elasticsearch server instead of using .import method?
XML file looks like (XML file may contain 1000 item per file),
<?xml version="1.0" encoding="UTF-8"?>
<catalog items="2" total-pages="260" page="1" per-page="2" status="complete">
<item>
<sku>1</sku>
<vbid>1</vbid>
<created>Sun, 05 Oct 2014 03:35:58 +0000</created>
<updated>Sun, 06 Mar 2016 12:44:48 +0000</updated>
<subjects>
<subject schema="bisac" code="HIS027090">World War I</subject>
<subject schema="coursesmart" code="cs.soc_sci.hist.milit_hist">Social Sciences -> History -> Military History</subject>
</subjects>
<aliases>
<eisbn-canonical>1</eisbn-canonical>
<isbn-canonical>1</isbn-canonical>
<print-isbn-canonical>9780752460864</print-isbn-canonical>
<fpid/>
<isbn13>1</isbn13>
<isbn10>0750951796</isbn10>
<additional-isbns>
<isbn type="print-isbn-10">0752460862</isbn>
<isbn type="print-isbn-13">9780752460864</isbn>
</additional-isbns>
</aliases>
</item>
<item>
<sku>2</sku>
<vbid>2</vbid>
<created>Sun, 05 Oct 2014 03:35:58 +0000</created>
<updated>Sun, 06 Mar 2016 12:44:48 +0000</updated>
<subjects>
<subject schema="bisac" code="HIS027090">World War I</subject>
<subject schema="coursesmart" code="cs.soc_sci.hist.milit_hist">Social Sciences -> History -> Military History</subject>
</subjects>
<aliases>
<eisbn-canonical>2</eisbn-canonical>
<isbn-canonical>2</isbn-canonical>
<print-isbn-canonical>9780752460864</print-isbn-canonical>
<fpid/>
<isbn13>2</isbn13>
<isbn10>0750951796</isbn10>
<additional-isbns>
<isbn type="print-isbn-10">0752460862</isbn>
<isbn type="print-isbn-13">9780752460864</isbn>
</additional-isbns>
</aliases>
</item>
</catalog>

Related

Rails: open() returns StringIO instead of Tempfile

I have two valid URL's to two images.
When I run open() on the first URL, it returns an object of type Tempfile (which is what the fog gem expects to upload the image to AWS).
When I run open() on the second URL, it returns an object of type StringIO (which causes the fog gem to crash and burn).
Why is open() not returning a Tempfile for the second URL?
Further, can open() be forced to always return Tempfile?
From my Rails Console:
2.2.1 :011 > url1
=> "https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xpf1/v/t1.0-1/c0.0.448.448/10298878_10103685138839040_6456490261359194847_n.jpg?oh=e2951e1a1b0a04fc2b9c0a0b0b191ebc&oe=56195EE3&__gda__=1443959086_417127efe9c89652ec44058c360ee6de"
2.2.1 :012 > url2
=> "https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xfa1/v/t1.0-1/c0.17.200.200/1920047_10153890268465074_1858953512_n.jpg?oh=5f4cdf53d3e59b8ce4702618b3ac6ce3&oe=5610ADC5&__gda__=1444367255_396d6fdc0bdc158e4c2e3127e86878f9"
2.2.1 :013 > t1 = open(url1)
=> #<Tempfile:/var/folders/58/lpjz5b0n3yj44vn9bmbrv5180000gn/T/open-uri20150720-24696-1y0kvtd>
2.2.1 :014 > t2 = open(url2)
=> #<StringIO:0x007fba9c20ae78 #base_uri=#<URI::HTTPS https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xfa1/v/t1.0-1/c0.17.200.200/1920047_10153890268465074_1858953512_n.jpg?oh=5f4cdf53d3e59b8ce4702618b3ac6ce3&oe=5610ADC5&__gda__=1444367255_396d6fdc0bdc158e4c2e3127e86878f9>, #meta={"last-modified"=>"Tue, 25 Feb 2014 19:47:06 GMT", "content-type"=>"image/jpeg", "timing-allow-origin"=>"*", "access-control-allow-origin"=>"*", "content-length"=>"7564", "cache-control"=>"no-transform, max-age=1209600", "expires"=>"Mon, 03 Aug 2015 22:01:40 GMT", "date"=>"Mon, 20 Jul 2015 22:01:40 GMT", "connection"=>"keep-alive"}, #metas={"last-modified"=>["Tue, 25 Feb 2014 19:47:06 GMT"], "content-type"=>["image/jpeg"], "timing-allow-origin"=>["*"], "access-control-allow-origin"=>["*"], "content-length"=>["7564"], "cache-control"=>["no-transform, max-age=1209600"], "expires"=>["Mon, 03 Aug 2015 22:01:40 GMT"], "date"=>["Mon, 20 Jul 2015 22:01:40 GMT"], "connection"=>["keep-alive"]}, #status=["200", "OK"]>
This is how I'm using fog:
tempfile = open(params["avatar"])
user.avatar.store!(tempfile)
I assume you are using Ruby's built-in open-uri library that allows you to download URLs using open().
In this case Ruby is only obligated to return an IO object. There is no guarantee that it will be a file. My guess is that Ruby makes a decision based on memory consumption: if the download is large, it puts it into a file to save memory; otherwise it keeps it in memory with a StringIO.
As a workaround, you could write a method that writes the stream to a tempfile if it is not already downloaded to a file:
def download_to_file(uri)
stream = open(uri, "rb")
return stream if stream.respond_to?(:path) # Already file-like
Tempfile.new.tap do |file|
file.binmode
IO.copy_stream(stream, file)
stream.close
file.rewind
end
end
If you're looking for a full-featured gem that does something similar, take a look at "down": https://github.com/janko-m/down
The open uri library has 10K size limit for choose either StringIO or Tempfile.
My suggestion for you is change to constant OpenURI::Buffer::StringMax, that used for open uri set default
In your initializer you could make this:
require 'open-uri'
OpenURI::Buffer.send :remove_const, 'StringMax' if OpenURI::Buffer.const_defined?('StringMax')
OpenURI::Buffer.const_set 'StringMax', 0
This doesn't answer my question - but it provides a working alternative using the httparty gem:
require "httparty"
File.open("file.jpg", "wb") do |tempfile|
tempfile.write HTTParty.get(params["avatar"]).parsed_response
user.avatar.store!(tempfile)
end

Open URI Wrong Output

I am trying to download images from the web and upload them back to Cloudinary. The code I have works for some images, but not for others. I have isolated the problem down to this line (it requires open-uri):
image = open(params[:product_image][:main])
For this image, it works fine. image is
#<Tempfile:/var/folders/49/bmhbmmzj5fl31dm9j6m6gxr00000gn/T/open-uri20150526-7662-1b676ws>
and cloudinary accepts this. However, when I try to pull this image, image becomes
#<StringIO:0x007fa0267c8f80 #base_uri=#<URI::HTTP:0x007fa0267c92c8 URL:http://www.spiresources.net/WebImages/480/swatch/CELW.JPG>,
#meta={"date"=>"Tue, 26 May 2015 22:17:47 GMT", "server"=>"Apache/2.2.22 (Ubuntu)",
"last-modified"=>"Mon, 29 Jun 2009 00:00:00 GMT", "etag"=>"\"44700f-c35-46d715f090000\"",
"accept-ranges"=>"bytes", "content-length"=>"3125", "content-type"=>"image/jpeg"}, #metas={"date"=>["Tue, 26 May 2015 22:17:47 GMT"], "server"=>["Apache/2.2.22 (Ubuntu)"],
"last-modified"=>["Mon, 29 Jun 2009 00:00:00 GMT"], "etag"=>["\"44700f-c35-46d715f090000\""], "accept-ranges"=>["bytes"],
"content-length"=>["3125"], "content-type"=>["image/jpeg"]}, #status=["200", "OK"]>
which cloudinary rejects and raises an error of "No conversion of StringIO to string". Why does open-uri return different objects for what would seem like similar images? How can I make open-uri return a tempfile or at least turn my StringIO to a tempfile?
You can simply give the URL to the Cloudinary upload method. Then Cloudinary will fetch the remote resource directly.

Following links to get RSS entry content with feedzirra

I have a Rails app (3.2.11, Ruby 1.9.3) and I'm trying to read the feed at http://www2c.cdc.gov/podcasts/createrss.asp?t=r&c=66 using feedzirra. Looking at the source XML of the feed, this is what entries looks like:
<item>
<title>In the News - Novel (New) Coronavirus in the Arabian Peninsula and United Kingdom</title>
<description>Novel (New) Coronavirus in the Arabian Peninsula and United Kingdom</description>
<link>http://wwwnc.cdc.gov/travel/notices/in-the-news/coronavirus-arabian-peninsula-uk.htm</link>
<guid isPermaLink="true">http://wwwnc.cdc.gov/travel/notices/in-the-news/coronavirus-arabian-peninsula-uk.htm</guid>
<pubDate>Thu, 07 Mar 2013 05:00:00 EST</pubDate>
</item>
<item>
<title>Outbreaks - Dengue in Madeira, Portugal</title>
<description>Dengue in Madeira, Portugal</description>
<link>http://wwwnc.cdc.gov/travel/notices/outbreak-notice/dengue-madeira-portugal.htm</link>
<guid isPermaLink="true">http://wwwnc.cdc.gov/travel/notices/outbreak-notice/dengue-madeira-portugal.htm</guid>
<pubDate>Wed, 20 Feb 2013 05:00:00 EST</pubDate>
</item>
As you can see, this feed doesn't seem to be exposing the entry contents, just a link to the underlying article. My question is this, can I use feedzirra to access the content of the original article? If not, any recommendations on good tools out there? wget? mechanize? httparty? Thanks!
Well, I don't know if it's possible with feedzirra, but from what I see with the XML, all you can get is the title and some more snippets like the description, pubication date..., I can however recommend a tool for this, you should check FeedsAPI , it has a nice simple to use RSS Feeds API and can do what you are tryng to achieve. i hope this could help.

JqGrid DataBinding exception while exporting to excel file

I am trying to export JqGrid to excel so i follow this instruction and i use it like at below.
var grid = new JqGridModelParticipiant().JqGridParticipiant;
var query = db.ReservationSet.Select(r => new
{
r.Id,
Name = r.Doctor.Name,
Identity = r.Doctor.Identity,
Title = r.Doctor.Title.Name,
Total = r.TotalTL,
Organization = r.Organization.Name
});
grid.ExportToExcel(query,"file.xls");
And i get below exception on the line of " grid.ExportToExcel(query,"file.xls");"
Data binding directly to a store query (DbSet, DbQuery, DbSqlQuery) is
not supported. Instead populate a DbSet with data, for example by
calling Load on the DbSet, and then bind to local data. For WPF bind
to DbSet.Local. For WinForms bind to DbSet.Local.ToBindingList().
As far as i understand that it expect to have ObservableCollection that is on DbSet.Local member. But i am working on projected query so i can't do that.
What is the solution for this problem.
In the answer I posted the demo which shows how to implement export to Excel (real *.XLSX file instead of HTML fragment renamed to *.XLS used here).
The method used for exported to the Excel in jqSuite (the demo) produce HTML fragment like
HTTP/1.1 200 OK
Cache-Control: private
Content-Type: application/excel; charset=utf-8
Server: Microsoft-IIS/7.0
X-AspNetMvc-Version: 2.0
content-disposition: attachment; filename=grid.xls
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Date: Fri, 29 Jun 2012 14:24:54 GMT
Connection: close
<table cellspacing="0" rules="all" border="1" id="_exportGrid" style="border-collapse:collapse;">
<tr>
<td>OrderID</td><td>CustomerID</td><td>OrderDate</td><td>Freight</td><td>ShipName</td>
</tr><tr>
<td>10248</td><td>VINET</td><td>1996/07/04</td><td>32.3800</td><td>Vins et alcools Chevalier</td>
</tr><tr>
<td>10249</td><td>TOMSP</td><td>1996/07/05</td><td>11.6100</td><td>Toms Spezialitäten</td>
</tr><tr>
<td>10250</td><td>HANAR</td><td>1996/07/08</td><td>65.8300</td><td>Hanari Carnes</td>
</tr><tr>
...
</table>
instead of creating of real Excel file. The way is very unsafe because at the opening the "Standard" type of data will be always used. For example if you would export the data like
<td>10249</td><td>TOMSP</td><td>1996/07/05</td><td>11.02.12</td><td>Toms Spezialitäten</td>
the text "11.02.12" will be automatically converted to the date 11.02.2012 if German locale are used as default:
The name "Toms Spezialitäten" from will be wrong displayed as "Toms Spezialitäten".
It can be especially dangerous in case of large table where some small part of data in the middle of grid will be wrong converted. In one project I displayed information about Software and some software versions will be wrong converted to the Date type.
Because of such and other close problems I create real Excel file on the server using Open XML SDK 2.5 or Open XML SDK 2.0. In the way one have no problems described above. So I recommend you to follow the approach described in my old answer.

how to load a properties file with non-ascii in ant

Suppose I have a properties file test.properties, which saved using utf-8
testOne=测试
I am using the following ant script to load it and echo it to another file:
<loadproperties srcFile="test.properties" encoding="utf-8"/>
<echo encoding="utf-8" file="text.txt">${testOne}</echo>
When I open the generated text.txt file using "utf-8" encoding I see:
??
What's wrong with my script?
Use "encoding" and "escapeunicode" together. It's work fine.
<loadproperties srcfile="${your.properties.file}" encoding="UTF-8">
<filterchain>
<escapeunicode />
</filterchain>
</loadproperties>
I found a work around, but I still doesn't understand why the org one doesn't work:
<native2ascii src="." dest=".">
<mapper type="glob" from="test.properties" to="testASCII.properties"/>
</native2ascii>
<loadproperties srcFile="testASCII.properties"/>
Then the echo works as expected.
I don't know why the encoding in loadproperties doesn't work.
Can anyone explain?
Try it this way:
<loadproperties srcfile="non_ascii_property.properties">
<filterchain>
<escapeunicode/>
</filterchain>
</loadproperties>
Apparently, InputStreamReader that uses the ISO Latin-1 charset, which kills your non-ascii characters. I ran into the same issue w/Arabic.
What editor were you using and what platform are you on?
Your generated property file might actually be good, but the editor you're using to examine it may be incapable of viewing it. For example, on my Mac, the VIM command line editor can view it (which surprises me), but in Eclipse, it looks like this:
testOne=������
If you're on Unix/Linux/Mac, try using od to dump your generated file, and examine the actual hex code to see what it should be.
For example, I copied your property file, and ran od on a Mac:
$ od -t x1 -t c test.property
0000000 74 65 73 74 4f 6e 65 3d e6 b5 8b e8 af 95 0a
t e s t O n e = 测 ** ** 试 ** ** \n
Here I can see that the code for 测 is 36 b5 8b and 试 is e8 af 95 which is the correct UTF-8 representation for these two characters. (Or, I at least think so. It shows up correctly in the Character Viewer Mac OS X panel).
The right answer is pointed to in this comment by David W.:
how to load a properties file with non-ascii in ant
Java Property Files must be encoded in ISO-8859-1:
http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html
But Unicode escape sequences like \u6d4b can/must be used to encode unicode characters therein.
Tools/ANT-Targets like <native2ascii>, generating ascii-encoded files from natively maintained ones, can help here.
You can write your own task that reads properties from your default java character encoding for your OS (mine is utf-8), instead of converting your property files to unreadable unicode-escaped ASCII files (designed by people who read and write only English). Here's how to do it by copying and modifying Property.java from Ant's source code to your own package (e.g. org.my.ant). I used Ant 1.10.1.
Download Ant's source code in your format of choice from here:
http://ant.apache.org/srcdownload.cgi
Copy src/main/org/apache/tools/ant/taskdefs/Property.java to your own project (such as a new Java project), in org/my/ant/Property.java)
replace:
package org.apache.tools.ant.taskdefs;
with:
package org.my.ant;
Fix any imports needed by the package change. I just needed to add:
import org.apache.tools.ant.taskdefs.Execute;
In the method:
loadProperties(Properties props, InputStream is, boolean isXml)
replace:
props.load(is);
with:
props.load(new InputStreamReader(is));
In your project's resources folder (could be the same as your source folder), add the file org/my/ant/antlib.xml, with the content:
<?xml version="1.0" encoding="UTF-8"?>
<antlib>
<taskdef name="property" classname="org.my.ant.Property"/>
</antlib>
Compile this project (Property.java + antlib.xml).
Put the resulting jar in Ant's classpath, as explained here:
http://ant.apache.org/manual/using.html#external-tasks
Then use it in a build.xml file as follows:
<?xml version="1.0" encoding="UTF-8"?>
<project name="example"
xmlns:my="antlib:org.my.ant"
default="print"
>
<my:property file="greek.properties" prefix="example" />
<target name="print">
<echo message="${example.a}"/>
</target>
</project>
The file greek.properties contains:
a: ΑΒΓ

Resources