I am evaluating OSS to implement crawling, indexing and searching a mid-sized ASP.NET (MVC4) website.
So far it looks promising.
Here are some basic questions, which I could not find in the docs:
German Umlauts:
the Renderer/Search for German Umlauts 'ä, ü, ö' fails:
http://localhost:8080/renderer?use=haas&name=gSearch&query=küche
returns
"küche in the search box with no results - there should be results in the index!"
(I created a query "gSearch" with language=German
can OSS return Synonyms like "...did you mean..." WITHOUT having to manually insert every thinkable or unthinkable synonym MANUALLY??
I did not get results until I added "aspx" in Schema->Parser_list-> HTML -> supported extensions
is this correct - or should I add another parser for ASP - ... can I have more than one parser for HTML, ASP, PDF...etc...?
after doing 3. I got results - both aspx and pdf documents... but I did not get a clickable link (filename) for the PDF-Files ??
what would be the best way to call search from MVC? Via Webservices...? I do not want to include an IFRAME
It's always troublesome when several different questions are gathered in one., but here's my take on number 4:
I use a WebRequest, very straightforward.
var webRequest = WebRequest.Create("http://localhost:8080/select?use=haas&query=kitchen");
webRequest.Timeout = 10000;
WebResponse webResponse;
try
{
webResponse = webRequest.GetResponse();
}
catch (WebException ex)
{
WriteToEventLog(ex.Message);
}
var xmlStream = webResponse.GetResponseStream();
var reader = XmlReader.Create(xmlStream);
var doc = XDocument.Load(reader, LoadOptions.PreserveWhitespace);
Then you have yourself an XML with the returned fields set up in your OSS index query.
Related
I tried Importhtml ("https://nepsealpha.com/investment-calandar/dividend","table",) and then Importxml("https://nepsealpha.com/investment-calandar/dividend",xpath). I found out xpath from "selectorgadget" extension of googlechrome, but still couldn't import it. It shows either "empty content" or formula parse error".
You can retrieve quite all the informations this way
=importxml(url,"//div/#data-page")
and then parse the json.
By script : =getData("https://nepsealpha.com/investment-calandar/dividend")
function getData(url) {
var from='data-page="'
var to='"></div></body>'
var jsonString = UrlFetchApp.fetch(url).getContentText().split(from)[1].split(to)[0].replace(/"/g,'"')
var json = JSON.parse(jsonString).props.today_prices_summary.top_volume
var headers = Object.keys(json[0]);
return ([headers, ...json.map(obj => headers.map(header => obj[header]))]);
}
edit
to update periodically, add this script
function update(){
var chk = SpreadsheetApp.getActiveSpreadsheet().getSheets()[0].getRange('A1')
chk.setValue(!chk.getValue())
}
put a trigger as you wish on the update function and change as follows
=getData("https://nepsealpha.com/investment-calandar/dividend",$A$1)
I know that's not the answer you want to see.
It's impossible to get any content from this website using IMPORTXML or other tools included in Google Sheets.
It's generated using Javascript. Once Javascript is disabled no content is displayed:
It's done on purpose. Financial companies pay for live stock data and they don't want to share it with us for free.
So the site is protected against tools like importxml.
I really hope you will be able to help me out on this one.
I am new to pdf.js so for the moment, I am playing around with the pre-built version to see if I can integrate this into my web app.
My problem:
I am using tcpdf to generate a pdf file which I would like to visualize using pdf.js without having to save it to a file on the server.
I have a php file (generate_document.php) that I use to generate the pdf. The file ends with the following:
$pdf->Output('test.pdf', 'I');
according to the tcpdf documentation, the second parameter can be used to generate the following formats:
I: send the file inline to the browser (default). The plug-in is used if available. The name given by name is used when one selects the "Save as" option on the link generating the PDF.
D: send to the browser and force a file download with the name given by name.
F: save to a local server file with the name given by name.
S: return the document as a string (name is ignored).
FI: equivalent to F + I option
FD: equivalent to F + D option
E: return the document as base64 mime multi-part email attachment (RFC 2045)
Then, I would like to view the pdf using pdf.js without creating a file on the server (= not using 'F' as a second parameter and passing the file name to pdf.js).
So, I thought I could simply create an iframe and call the pdf.js viewer pointing to the php file:
<iframe width="100%" height="100%" src="/pdf.js_folder/web/viewer.html?file=get_document.php"></iframe>
However, this is not working at all....do you have any idea what I am overlooking? Or is this option not available in pdf.js?
I have done some research and I have seen some posts here on converting a base64 stream to a typed array but I do not see how this would be a solution to this problem.
Many thanks for your help!!!
EDIT
#async, thanks for your anwer.
I got it figured out in the meantime, so I thought I'd share my solution with you guys.
1) In my get_document.php, I changed the output statement to convert it directly to base64 using
$pdf_output = base64_encode($pdf->Output('test_file.pdf', 'S'));
2) In viewer.js, I use an XHR to call the get_document.php and put the return in a variable (pdf_from_XHR)
3) Next, I convert what came in from the XHR request using the solution that was already mentioned in a few other posts (e.g. Pdf.js and viewer.js. Pass a stream or blob to the viewer)
pdf_converted = convertDataURIToBinary(pdf_from_XHR)
function convertDataURIToBinary(dataURI) {
var base64Index = dataURI.indexOf(BASE64_MARKER) + BASE64_MARKER.length;
var base64 = dataURI.substring(base64Index);
var raw = window.atob(base64);
var rawLength = raw.length;
var array = new Uint8Array(new ArrayBuffer(rawLength));
for (i = 0; i < rawLength; i++) {
array[i] = raw.charCodeAt(i);
}
return array;
}
et voilà ;-)
Now i can inject what is coming from that function into the getDocument statement:
PDFJS.getDocument(pdf_converted).then(function (pdf) {
pdfDocument = pdf;
var url = URL.createObjectURL(blob);
PDFView.load(pdfDocument, 1.5)
})
I'm working on a Grails / Backbone / Handlebars application that's a front end to a much larger legacy Java system in which (for historical & customizability reasons) internationalization messages are deep in a database hidden behind a couple of SOAP services which are in turn hidden behind various internal Java libraries. Getting at these messages from the Grails layer is easy and works fine.
What I'm wondering, though, is how to get (for instance) internationalized labels into my Handlebars templates.
Right now, I'm using GSP fragments to generate the templates, including a custom tag that gets the message I'm interested in, something like:
<li><myTags:message msgKey="title"/> {{title}}</li>
However, for performance and code layout reasons I want to get away from GSP templates and get them into straight HTML. I've looked a little into client-side internationalization options such as i18n.js, but they seem to depend on the existence of a messages file I haven't got. (I could generate it, possibly, but it would be ginormous and expensive.)
So far the best thing I can think of is to wedge the labels into the Backbone model as well, so I'd end up with something like
<li>{{titleLabel}} {{title}}</li>
However, this really gets away from the ideal of building the Backbone models on top of a nice clean RESTful JSON API -- either the JSON returned by the RESTful service is cluttered up with presentation data (i.e., localized labels), or I have to do additional work to inject the labels into the Backbone model -- and cluttering up the Backbone model with presentation data seems wrong as well.
I think what I'd like to do, in terms of clean data and clean APIs, is write another RESTful service that takes a list of message keys and similar, and returns a JSON data structure containing all the localized messages. However, questions remain:
What's the best way to indicate (probably in the template) what message keys are needed for a given view?
What's the right format for the data?
How do I get the localized messages into the Backbone views?
Are there any existing Javascript libraries that will help, or should I just start making stuff up?
Is there a better / more standard alternative approach?
I think you could create quite an elegant solution by combining Handelbars helpers and some regular expressions.
Here's what I would propose:
Create a service which takes in a JSON array of message keys and returns an JSON object, where keys are the message keys and values are the localized texts.
Define a Handlebars helper which takes in a message key (which matches the message keys on the server) and outputs an translated text. Something like {{localize "messageKey"}}. Use this helper for all template localization.
Write a template preprocessor which greps the message keys from a template and makes a request for your service. The preprocessor caches all message keys it gets, and only requests the ones it doesn't already have.
You can either call this preprocessor on-demand when you need to render your templates, or call it up-front and cache the message keys, so they're ready when you need them.
To optimize further, you can persist the cache to browser local storage.
Here's a little proof of concept. It doesn't yet have local storage persistence or support for fetching the texts of multiple templates at once for caching purposes, but it was easy enough to hack together that I think with some further work it could work nicely.
The client API could look something like this:
var localizer = new HandlebarsLocalizer();
//compile a template
var html = $("#tmpl").html();
localizer.compile(html).done(function(template) {
//..template is now localized and ready to use
});
Here's the source for the lazy reader:
var HandlebarsLocalizer = function() {
var _templateCache = {};
var _localizationCache = {};
//fetches texts, adds them to cache, resolves deferred with template
var _fetch = function(keys, template, deferred) {
$.ajax({
type:'POST',
dataType:'json',
url: '/echo/json',
data: JSON.stringify({
keys: keys
}),
success: function(response) {
//handle response here, this is just dummy
_.each(keys, function(key) { _localizationCache[key] = "(" + key + ") localized by server"; });
console.log(_localizationCache);
deferred.resolve(template);
},
error: function() {
deferred.reject();
}
});
};
//precompiles html into a Handlebars template function and fetches all required
//localization keys. Returns a promise of template.
this.compile = function(html) {
var cacheObject = _templateCache[html],
deferred = new $.Deferred();
//cached -> return
if(cacheObject && cacheObject.ready) {
deferred.resolve(cacheObject.template);
return deferred.promise();
}
//grep all localization keys from template
var regex = /{{\s*?localize\s*['"](.*)['"]\s*?}}/g, required = [], match;
while((match = regex.exec(html))) {
var key = match[1];
//if we don't have this key yet, we need to fetch it
if(!_localizationCache[key]) {
required.push(key);
}
}
//not cached -> create
if(!cacheObject) {
cacheObject = {
template:Handlebars.compile(html),
ready: (required.length === 0)
};
_templateCache[html] = cacheObject;
}
//we have all the localization texts ->
if(cacheObject.ready) {
deferred.resolve(cacheObject.template);
}
//we need some more texts ->
else {
deferred.done(function() { cacheObject.ready = true; });
_fetch(required, cacheObject.template, deferred);
}
return deferred.promise();
};
//translates given key
this.localize = function(key) {
return _localizationCache[key] || "TRANSLATION MISSING:"+key;
};
//make localize function available to templates
Handlebars.registerHelper('localize', this.localize);
}
We use http://i18next.com for internationalization in a Backbone/Handlebars app. (And Require.js which also loads and compiles the templates via plugin.)
i18next can be configured to load resources dynamically. It supports JSON in a gettext format (supporting plural and context variants).
Example from their page on how to load remote resources:
var option = {
resGetPath: 'resources.json?lng=__lng__&ns=__ns__',
dynamicLoad: true
};
i18n.init(option);
(You will of course need more configuration like setting the language, the fallback language etc.)
You can then configure a Handlebars helper that calls i18next on the provided variable (simplest version, no plural, no context):
// namespace: "translation" (default)
Handlebars.registerHelper('_', function (i18n_key) {
i18n_key = Handlebars.compile(i18n_key)(this);
var result = i18n.t(i18n_key);
if (!result) {
console.log("ERROR : Handlebars-Helpers : no translation result for " + i18n_key);
}
return new Handlebars.SafeString(result);
});
And in your template you can either provide a dynamic variable that expands to the key:
<li>{{_ titleLabeli18nKey}} {{title}}</li>
or specify the key directly:
<li>{{_ "page.fancy.title"}} {{title}}</li>
For localization of datetime we use http://momentjs.com (conversion to local time, formatting, translation etc.).
I'm Extremely new to this and I've been trying to get the title of each unique forum page (or topic) here is the code I have so far:
function GraalGet() {
//parses forums for ALL posts one by one, extract <title> from HTML webpage
var sheet = SpreadsheetApp.getActiveSheet();
var i = 31
var url = "http://www.graalians.com/forums/showthread.php?p="+i;
//var params = {method : "post"}; can this be used at all?
//The aim: loop this once you can get 1 result.
var geturl = UrlFetchApp.fetch(url).getContentText(); //maybe .getContentText should be elsewhere?
var parseurl = Xml.parse(geturl, true); //confirmed - this is true because it wont parse HTML if false
var titleinfo = parseurl.getElement().getElement("html"); //.getElement('body');//.getElements("title");
sheet.appendRow([titleinfo, i]);
}
In addition the script would write down the topic number in the adjoining cell.
There's a lot of answered questions about extracting XML data, and this example is about parsing HTML but I couldn't pull up any results - I'm honestly stumped and any help about finding and extracting the tag will be appreciated. (If you have the time, please feel free to explain as well, but I'll be thankful for any help really.)
For reference I have used these:
Google's Kevin Bacon Script
The authors comments on bugs with the script & some explanation
I'm sorry if I'm being pedantic, this is my first post & I don't want to anger anyone, please do tell me if I've broken any rules, I'll do my best to fix them. I've left the comments I made for myself for your perusal too.
You can use Logger.log to print out debugging information. I did this with your function and figured out that the title tag is embedded within the tag. So you should use something like this. Also, getElement returns an XmlElement object which you should convert to String using getText().
var titleinfo = parseurl.getElement().getElement('head').getElement('title');
sheet.appendRow([titleinfo.getText(), i]);
I have an XML as input to a Java function that parses it and produces an output. Somewhere in the XML there is the word "stratégie". The output is "stratgie". How should I parse the XML as to get the "é" character as well?
The XML is not produced by myself, I get it as a response from a web service and I am positive that "stratégie" is included in it as "stratégie".
In the parser, I have:
public List<Item> GetItems(InputStream stream) {
try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(stream);
doc.getDocumentElement().normalize();
NodeList nodeLst = doc.getElementsByTagName("item");
List<Item> items = new ArrayList<Item>();
Item currentItem = new Item();
Node node = nodeLst.item(0);
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element item = (Element) node;
if(node.getChildNodes().getLength()==0){
return null;
}
NodeList title = item.getElementsByTagName("title");
Element titleElmnt = (Element) title.item(0);
if (null != titleElmnt)
currentItem.setTitle(titleElmnt.getChildNodes().item(0).getNodeValue());
....
Using the debugger, I can see that titleElmnt.getChildNodes().item(0).getNodeValue() is "stratgie" (without the é).
Thank you for your help.
I strongly suspect that either you're parsing it incorrectly or (rather more likely) it's just not being displayed properly. You haven't really told us anything about the code or how you're using the result, which makes it hard to give very concrete advice.
As ever with encoding issues, the first thing to do is work out exactly where data is getting lost. Lots of logging tends to be the way forward: create a small test case that demonstrates the problem (as small as you can get away with) and log everything about the data. Don't just try to log it as raw text: log the Unicode value of each character. That way your log will have all the information even if there are problems with the font or encoding you use to view the log.
The answer was here: http://www.yagudaev.com/programming/java/7-jsp-escaping-html
You can either use utf-8 and have the 'é' char in your document instead of é, or you need to have a parser that understand this entity which exists in HTML and XHTML and maybe other XML dialects but not in pure XML : in pure XML there's "only" ", <, > and maybe ' I don't remember.
Maybe you can need to specify those special-char entities in your DTD or XML Schema (I don't know which one you use) and tell your parser about it.