Lucene/Solr unexpected query answer - parsing

I'm using Solr 4.4.0 running on Tomcat 7.0.29.
The solrconfig.xlm is as-delivered (excepted for the Solr home directory of course).
I could pass on the schema.xml, though I doubt this would help much, as the following will show.
If I select all documents containing "russia" in the text, which is the default field, ie if I execute the query "russia", I find only 1 document, which is correct.
If I select all documents containing "web" in the text ("web"), the result is 29, which is also correct.
If I search for all documents that do not contain "russia" ("NOT(russia)"), the result is still correct (202).
If I search for all documents that contain "web" and do not contain "russia" ("web AND NOT(russia)"), the result is, once again, correct (28, because the document containing "russia" also contains "web").
But if I search for all documents that contain "web" or do not contain "russia" ("web OR NOT(russia)"), the result is still 28, though I should get 203 matches (the whole set).
Has anyone got an explanation ??
For information, the AND and OR work correctly if I don't use a NOT somewhere in the query, i.e. :
"web AND russia" --> OK
"web OR russia" --> OK

I got a solution from Yonik Seeley, which is to translate the NOT(russia) into (*:* -russia), so that a positive value (: i.e. all documents) can be used to substract from (-russia). This solution works very well. I still believe that it would be a good idea to modify the parser so that the strainghtforward request "web OR NOT(russia)" would work without translation.

Related

importxml of url with Hebrew returns in encoding other than UTF-8 that chrome doesn't recognize

For example, in the dummy spreadsheet (tab 'desired outcome'), under "Link 1" you will see this URL:
http://www.promotion-il.co.il/service/%5DE%5E4%5D9%5E5-%5E8%5D9%5D7-%5D7%5E9%5DE%5DC%5D9-%5DC%5E2%5E1%5E7%5D9%5DD/
However, the actual URL in UTF-8 is:
http://www.promotion-il.co.il/service/%D7%9E%D7%A4%D7%99%D7%A5-%D7%A8%D7%99%D7%97-%D7%97%D7%A9%D7%9E%D7%9C%D7%99-%D7%9C%D7%A2%D7%A1%D7%A7%D7%99%D7%9D/
The actual URL string that contains Hebrew is:
http://www.promotion-il.co.il/service/מפיץ-ריח-חשמלי-לעסקים/
I will also add that the same URL has returned with a proper UTF-8 encoding for other blog posts. (See second example in the same tab).
Why is it happening?
How can it be fixed?
Thanks in advance!
This is the solution I came up with eventually:
I saw that for the imported urls - in order to fix a broken url 2 repalcements were needed:
5D --> D7%9
5E --> D7%A
I used this formula in a separate column to achieve it:
==ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE((<COLUMN WITH IMPORTED URLS HERE>),"5D","D7%9"),"5E","D7%A"))

SEC company filings: Is the <SEC-HEADER> tag valid SGML? If so, how to parse it?

I tried to parse SEC company filings from sec.gov. Starting from fb 10-Q index.htm let's look at a complete text submission filing like complete submission text filing. It has a structure like:
<SEC-DOCUMENT>
<SEC-HEADER>
<ACCEPTANCE-DATETIME>"some content" This tag is not closed.
"some lines resembling yaml markup"
These are indented lines with a
"key": "value" structure.
</SEC-HEADER>
<DOCUMENT>
.
.
some content
.
.
</DOCUMENT>
"several DOCUMENT tags" ...
</SEC-DOCUMENT>
I tried to figure out the structure of the <SEC-HEADER> tag and found some information under Public Dissemination
Service (PDS) Technical
Specification (pdf) and concluded that the content of the header should be SGML.
Nevertheless, I am clueless about the formatting, since there are no angle brackets, and the keys - value paires are separated by colons like key: value instead of <key>value</key>. In the pdf link I could not find anything about colons.
Question: Is the <SEC-HEADER> tag valid SGML? If it is, how to parse it?
I'd be glad at any help.
The short answer is no. The <SEC-HEADER> tag in the raw filing is not a valid SGML.
However, it is my understanding that this section in the raw filing is parsed automatically from the header file <accession_num>.hdr.sgml, which does follow SGML. This header file can be found in the same directory as the raw filing (i.e., the <accession_num>.txt file).
I use a REGEX of the form: ^<(.+?)>(.+?)$ (with re.MULTILINE option) to capture each (tag, value) tuple and get the results directly in a dict().
I believe the only tag in that file that has a closing tag is the </FILER> tag, where there could be multiple filers in each filing. You can first extract those using a REGEX of the form: <FILER>(.+?)</FILER> and then employ the same REGEX as above to get the inner tags for each filer.
Note that other than 'FILER', there could be other tags, representing different relations of the entities to the filing. Those are 'ISSUER', 'SUBJECT COMPANY', 'FILED BY', 'FILED FOR', 'SERIAL COMPANY', 'REPORTING OWNER'.

Using google query to download parts of a published sheet

This works:
curl 'https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/pub?gid=911257845&single=true&output=csv'
however I want to only pick up rows where count > 300.
The query before encoding would be
select * where F > 300
After encoding
select%20*%20where%20F%3E300
So the url becomes
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/pub?gid=911257845&output=csv&tq=select%20*%20where%20F%3E300
The line above works retrieves a file, but it returns the whole file, and doesn't filter.
Note that a published web sheet has the form
https://docs.google.com/spreadsheets/d/e/KEY/pub?gid=GID
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/pub?gid=911257845
This works. Adding &output=csv to it (no space before the &) works, and it downloads as a csv file. This opens in excel and shows the data in the table.
I tried this:
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/pub?gid=911257845&output=csv&tq=select%20*%20where%20F%3E%20300
and
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/gviz/tq?gid=911257845&output=csv&tq=select%20*%20where%20F%3E300
and get errors -- resource not available.
The page above should be public for people who want to try.
This may be an issue between publishing a sheet, and sharing a whole spread sheet to anyone who has the link.
I've created a new page that uses importrange() that slurps up the page from the main sheet, and that one is public.
https://docs.google.com/spreadsheets/d/1-lqLuYJyHAKix-T8NR8wV8ZUUbVOJrZTysccid2-ycs/edit?usp=sharing
How about this modification?
Modification points :
When it uses query, please use like https://docs.google.com/spreadsheets/d/### file ID ###/gviz/tq?gid=###&tq=### query ###.
When select%20*%20where%20%F%3E300 is decoded, it is select * where %F>300.
select * where F > 300 is select%20%2a%20where%20F%20%3e%20300.
In order to output CSV, please use tqx=out:csv.
Please share the Spreadsheet.
On Google Drive
On the Spreadsheet file
right-click -> Share -> Advanced -> Click "change" at "Private - Only you can access"
Check "On Anyone with the link"
Click "Save"
At "Link to share", copy URL.
Retrieve file ID from https://docs.google.com/spreadsheets/d/### file ID ###/edit?usp=sharing
Modified curl command :
curl 'https://docs.google.com/spreadsheets/d/### file ID ###/gviz/tq?gid=911257845&tq=select%20%2a%20where%20F%20%3e%20300&tqx=out:csv'
Reference :
Query Language Reference
If I misunderstand your question, I'm sorry.
Edit :
The following 2 URLs are the comparison between your URL and my answer. The URL of my answer was matched to your URL.
1. Your URL
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/gviz/tq?gid=911257845&output=csv&tq=select%20*%20where%20F%3E300
When above URL is separated,
https://docs.google.com/spreadsheets/d/e/
e/ is not required.
2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc
This is not the file ID of spreadsheet.
/gviz/tq
gid=911257845
output=csv
tq=select%20*%20where%20F%3E300
2. In my answer matched to your URL
https://docs.google.com/spreadsheets/d/### file ID ###/gviz/tq?gid=###&tqx=out:csv&tq=### query ###
When above URL is separated,
https://docs.google.com/spreadsheets/d/
### file ID ###
You can see the detail of the file ID of spreadsheet at here.
/gviz/tq
gid=###
You can use gid=911257845.
tqx=out:csv
This has to be used instead of output=csv.
tq=### query ###
You can use tq=select%20*%20where%20F%3E300.
Note :
Each number corresponds.
And please share the Spreadsheet as follows. This is difference from "Publish to the web" on Spreadsheet.
On Google Drive
On the Spreadsheet file
right-click -> Share -> Advanced -> Click "change" at "Private - Only you can access"
Check "On Anyone with the link"
Click "Save"
At "Link to share", copy URL.
Retrieve file ID from ``https://docs.google.com/spreadsheets/d/###

PHPEXCEL weird characters on form inputs

I need some help with PHPEXCEL library, everything works great, I'm successfully extracting my SQL query to excel5 file, I need to give this file to transport company in order to auto collect informations about packages, unfotunately the generated excel file has some ascii characters between each letter of the cell text, and when the excel file is imported you need to manually delete these charaters.
If I open the excel file, everything is fine I see: COMPANY NAME, If I open the excel file with notepad++, I see the cell values this way: C(NUL)O(NUL)M(NUL)P(NUL)A(NUL)N(NUL)Y N(NUL)A(NUL)M(NUL)E
If I open again the file with excel and save, then reopen with notepad++ I see COMPANY NAME.
So I do not understan why every time I create an excel file using PHPEXCEL my every letter of all words are filled with (nul) every letter.
So how do I prevent the generated excel file to include (nul) between every word????
Also if you open the original excel file generated from PHPExcel samples are also filled with (nul) and if you open and save it, the (nul) is gone.
Any help would be appreciated, thanks.
what is the (nul) ??? 0x00??? char(0)???
ok, here is the example:
error_reporting(E_ALL);
ini_set('display_errors', TRUE);
ini_set('display_startup_errors', TRUE);
date_default_timezone_set('Europe/London');
if (PHP_SAPI == 'cli')
die('Disponibile solo su browser');
require_once dirname(__FILE__) . '/Classes/PHPExcel.php';
$objPHPExcel = new PHPExcel();
$objPHPExcel->getProperties()->setCreator("Solidus")
->setLastModifiedBy("Solidus")
->setTitle("Import web")
->setSubject("Import File")
->setDescription("n.a")
->setKeywords("n.a")
->setCategory("n.a");
$objPHPExcel->setActiveSheetIndex(0)
->setCellValueExplicit("A1", "COMPANY")
->setCellValue('A2', 'SAMSUNG');
$objPHPExcel->getActiveSheet()->setTitle('DDT');
$objPHPExcel->setActiveSheetIndex(0);
header('Content-Type: application/vnd.ms-excel');
header('Content-Disposition: attachment;filename="TEST.xls"');
header('Cache-Control: max-age=0');
header('Cache-Control: max-age=1');
header('Cache-Control: private',false);
$objWriter = PHPExcel_IOFactory::createWriter($objPHPExcel, 'Excel5');
ob_end_clean();
$objWriter->save('php://output');
As you can see from this little example, this scripts creates a file excel5 with 2 cells, A1 = COMPANY, A2 = SAMSUNG
when I send this file to the transport company, they import the file into their system, but as you can see from the picture, there is an weird character between each letter.
so I noticed every time I open the generated Excel5 with notepad++ file I get:
S(nul)A(nul)M(nul)S(nul)U(nul)N(nul)G
If I save the save with excel and then open it again with notepad++ I get:
SAMSUNG
and this file is ok for the transport company
so my question is, how should I avoid the file generated to contain thi '(nul) charachter between each letter????
some help?
weird characters
SAMSUNG
I found the soluion by myself, I explain just in case anyone has also this problem:
there is not way to change the way the excelfile is encoded by PHPEXCEL
so I figured out the problem was reading the file, I did some simulations and reproduce the problem, every time a read the file and put the result into inputs a get weird characters:
C�O�M�P�A�N�Y�
If I set the output enconding enconding as follows:
$excel->setOutputEncoding('UTF-8');
the file loads fine, so the problem was not creating the excel file, but reading the excel file.
If I print the variable with ECHO I get: "COMPANY",
if I put the variable on input as value I get: "C�O�M�P�A�N�Y�"
setting the output solves the problem, but I would like to know why the difference when I put the variable on input as value, thanks

Text searching PDF

When parsing a PDF, given a string (popped from the Tj or TJ operator callbacks) with the Identity-H encoding how do you map that string to a unicode (say UTF8) representation?
If I need a CMap for this, how do I create (or retrieve) and apply the CMap?
You'll probably have to parse the font data itself. Identity-H just means "use the bytes as raw glyph indexes into the given font". That's why you MUST embed fonts when using Identity-H... different versions of the same font need not have the same glyph order.
There's example code on how to do this sort of thing in several different open source projects. iText, for example (yes, I'm biased).
You'd mentioned a CMap. Identity-H fonts can have a CMap but aren't required to do so. The /ToUnicode entry will be a stream that is a CMap, as defined in some adobe spec somewhere. They aren't all that complex:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TTX+0)
/Ordering (T42UV)
/Supplement 0
>> def
/CMapName /TTX+0 def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
80 beginbfrange
<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
<0033><0033><0050>
<0034><0034><0051>
<0035><0035><0052>
<0036><0036><0053>
<0037><0037><0054>
<0038><0038><0055>
<0039><0039><0056>
<003a><003a><0057>
<003b><003b><0058>
<003c><003c><0059>
<003d><003d><005a>
<0065><0065><00c9>
<00c8><00c8><00c1>
<00cb><00cb><00cd>
<00cf><00cf><00d3>
<00d2><00d2><00da>
<00e2><00e2><0160>
<00e4><00e4><017d>
<00e9><00e9><00dd>
<00fd><00fd><010c>
<0104><0104><0104>
<0106><0106><010e>
<0109><0109><0118>
<010b><010b><011a>
<0115><0115><0147>
<011b><011b><0158>
<0121><0121><0164>
<0123><0123><016e>
<01a0><01a0><0116>
<01b2><01b2><012e>
<01cb><01cb><016a>
<01cf><01cf><0172>
<022c><022c><0401>
<023b><023b><0411>
<023c><023c><0412>
<023d><023d><0413>
<023e><023e><0414>
<023f><023f><0415>
<0240><0240><0416>
<0241><0241><0417>
<0242><0242><0418>
<0243><0243><0419>
<0244><0244><041a>
<0245><0245><041b>
<0246><0246><041c>
<0247><0247><041d>
<0248><0248><041e>
<0249><0249><041f>
<024a><024a><0420>
<024b><024b><0421>
<024c><024c><0422>
<024d><024d><0423>
<024e><024e><0424>
<024f><024f><0425>
<0250><0250><0426>
<0251><0251><0427>
<0252><0252><0428>
<0253><0253><0429>
<0254><0254><042a>
<0255><0255><042b>
<0256><0256><042c>
<0257><0257><042d>
<0258><0258><042e>
<0259><0259><042f>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
Wow. That particular CMap is horribly inefficient. A "bfrange" starts from parameter 1, and goes to and includes parameter 2, maping values starting at parameter 3 (and continuing on until there are no more things to map.
For example:
<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
could be represented as
<0003><0003><0020>
<0024><0032><0041>
A quick google search turned up the CMap/CID font spec.
There are also beginbfchar/endbfchar which just take two parameters (src and dest values, no ranges), CID based versions (at which point you need to have access to Adobe's character ID tables. They're part of Acrobat/Reader installations, though Reader will need to be prodded into downloading the various Language Packs (or kits or whatever they're called)), and various other stuff you really out to read that spec to find out about.
There are multiple ways this data may be encoded (some using CMAPs). You can also have custom encodings (http://www.jpedal.org/PDFblog/2011/04/understanding-the-pdf-file-format-%E2%80%93-custom-font-encodings/). You also need to understand CID fonts (http://www.jpedal.org/PDFblog/2011/03/understanding-the-pdf-file-format-%E2%80%93-what-are-cid-fonts/)

Resources