Using "import" vs. "from X import" - python-import

I'm working through Head First Python, and there's an example:
from datetime import datetime
odds = [ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19,
21, 23, 25, 27, 29, 31, 33, 35, 37, 39,
41, 43, 45, 47, 49, 51, 53, 55, 57, 59 ]
right_this_minute = datetime.today().minute
#if right_this_minute in odds:
#print("This minute seems a little odd.")
#else:
#print("Not an odd minute.")
Now if I substitute "import datetime" for the "from datetime import datetime", the interpreter gives me an error:
right_this_minute = datetime.today().minute
AttributeError: module 'datetime' has no attribute 'today'
I don't understand why the "from datetime import datetime" works, but "import datetime" does not. I've gone through a number of stackoverflow Q&A's about this, but I'm obviously missing something.
Any suggestions would be greatly appreciated.

First of all, there are two "things" called datetime: the module and a class defined by the module.
The two import options you use have different behaviours.
When you run:
from datetime import datetime
the first is the module, the second is the class. Python imports only one class (datetime) from the module. From then on, Python will understand datetime to refer to the class.
When you run:
import datetime
you import the whole module, so Python will understand datetime to be the module. To access class datetime, you need to use datetime.datetime.

Related

RSpec expect is failing while comparing equal strings

I tried to update this test but it is failing to compare identical strings, even though I copied and pasted the "got" output back into the test case. Why is this RSpec test failing?
Failure/Error: expect(first_item_cost).to eq("12 x $499 = $5,988")
expected: "12 x $499 = $5,988"
got: "12 x $499 = $5,988"
(compared using ==)
Code:
first_item_cost = find('.cart-item-cost', match: :first).text
expect(first_item_cost).to eq("12 x $499 = $5,988")
RSpec 3.9
I checked the encoding and bytes and discovered:
puts "Encoding: " + first_item_cost.encoding.to_s
puts "Bytes: " + first_item_cost.bytes.to_s
Output:
Encoding: UTF-8
Bytes: [49, 50, 32, 195, 151, 32, 36, 52, 57, 57, 32, 61, 32, 36, 53, 44, 57, 56, 56]
The 'x' has too many bytes! I looked in the template and sure enough it used ×. When I copied and pasted from the console, it must have lost the original character (or RSpec translated it before output). I changed the spec and template to x.

What is the Dart equivalent of Java byte[]

I am experimenting with Flutter and need to make a plugin package for Android and iOS, and have started with Android. The Android Java code I need to communicate with uses a byte array (byte[]) both as input and as return type for some of its methods. What does this map to in Dart?
Here is the standard type mapping for platform channels:
https://flutter.io/platform-channels/#codec
On Android, byte[] maps to Uint8List.
Dart has a dart:typed_data core library exactly for this purpose:
https://api.dartlang.org/stable/1.24.3/dart-typed_data/dart-typed_data-library.html
I'm not 100% sure of how this maps to the Flutter plugin model, though I suspect a Flutter user or developer can fill us in :)
You can use List<int> like so:
List<int> data = [102, 111, 114, 116, 121, 45, 116, 119, 111, 0];
Or Uint8List like this:
// import 'dart:typed_data';
Uint8List data = Uint8List.fromList([102, 111, 114, 116, 121, 45, 116, 119, 111, 0]);
Also check out ByteBuilder, ByteData, and ByteBuffer for more byte manipulation options. Read Working with bytes in Dart for more info.

Splitting complex PDF files using Watson Document Conversion Service

We are implementing Question & Answering System using Watson Discovery Service(WDS). We required each answer unit available in single document. We have complex PDF files as corpus. The PDF files contains two column data, tables and images. Instead ingesting whole PDF files as corpus to WDS and using passage retrieval we are using Watson Document Conversion Service(WDC) to split each PDF file into answer units and later we are ingesting there answer units into WDS.
We are facing two issues with Watson Document Conversion service for complex PDF splitting.
We are expecting each heading as title and corresponding text as data(answer). However it is splitting each chapter as single answer unit. Is there any way to split the two column document based on the heading?
In case the input PDF file contains table the document conversion service reading structured data available in PDF file as simple text(missing table formatting). Is there any way to read structured data from PDF to answer unit?
I would recommend that you first convert your PDF to normalized HTML by using this setting:
"conversion_target": "normalized_html"
and inspect the generated HTML. Look for the places where headings (<h1>, <h2>, ..., <h6>) are detected. Those are the tags that will be used to split by answer units when you switch back to answer_units.
The reason you are currently seeing each chapter being split as an answer unit is because each chapter probably starts with a heading, but no headings are detected within each chapter.
In order to generate more answer units, you will need to tweak the PDF input configurations as described here, so that more headings are generated from the PDF to HTML conversion step and hence more answer units are generated.
For example, the following configuration will detect headings at 6 different levels, based on certain font characteristics for each level:
{
"conversion_target": "normalized_html",
"pdf": {
"heading": {
"fonts": [
{"level": 1, "min_size": 24},
{"level": 2, "min_size": 18, "max_size": 23, "bold": true},
{"level": 3, "min_size": 14, "max_size": 17, "italic": false},
{"level": 4, "min_size": 12, "max_size": 13, "name": "Times New Roman"},
{"level": 5, "min_size": 10, "max_size": 12, "bold": true},
{"level": 6, "min_size": 9, "max_size": 10, "bold": true}
]
}
}
}
You can start with a configuration like this and keep tweaking it until the produced normalized HTML contains the headings at the places that you expect the answer units to be. Then, take the tweaked configuration, switch to answer_units and put it all together:
{
"conversion_target": "answer_units",
"answer_units": {
"selector_tags": ["h1", "h2", "h3", "h4", "h5", "h6"]
},
"pdf": {
"heading": {
"fonts": [
{"level": 1, "min_size": 24},
{"level": 2, "min_size": 18, "max_size": 23, "bold": true},
{"level": 3, "min_size": 14, "max_size": 17, "italic": false},
{"level": 4, "min_size": 12, "max_size": 13, "name": "Times New Roman"},
{"level": 5, "min_size": 10, "max_size": 12, "bold": true},
{"level": 6, "min_size": 9, "max_size": 10, "bold": true}
]
}
}
}
Regarding your second question about tables, unfortunately there is no way to convert table content into answer units. As explained above, answer unit generation is based on heading detection. That being said, if there is a table between two detected headings, that table will be part of the answer unit as any other content between the two headings.

Neo4j: indexing properties that are longer then 32k with node_auto_index / lucene index

I am trying to build a little file and email search engine. I'd like also to use more advanced search queries for the full text search. Hence I am looking at lucene indexes. From what I have seen, there are two approaches - node_auto_index and apoc.index.addNode.
Setting the index up works fine, and indexing nodes with small properties works. When trying to index nodes with properties that are larger then 32k, neo4j fails (and get's into an unusable state).
The error message boils down to:
WARNING: Failed to invoke procedure apoc.index.addNode: Caused by:
java.lang.IllegalArgumentException: Document contains at least one
immense term in field="text_e" (whose UTF8 encoding is longer than the
max length 32766), all of which were skipped. Please correct the
analyzer to not produce such terms. The prefix of the first immense
term is: '[110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32,
110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101,
111, 32, 110, 101]...', original message: bytes can be at most 32766
in length; got 40000
I have checked this on 3.1.2 and 3.1.0+ apoc 3.1.0.3
A much longer description of the problem can be found at https://baach.de/Members/jhb/neo4j-full-text-indexing.
Is there any way to fix this? E.g. have I done anything wrong, or is there something to configure?
Thx a lot!
neo4j does not support index values that are longer then ~32k because of underlying lucene limitation.
For some details around that area You can look at:
https://github.com/neo4j/neo4j/pull/6213 and https://github.com/neo4j/neo4j/pull/8404.
You need to split such longer values into multiple terms.

Would Sklearn GridSearchCV go through all the possible default options of the estimator's parameters?

Algorithms in scikit-learn might have some parameters that have default range of options,
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
and the parameter has a default value "auto", with the following options: algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}
My question is, when using **GridSearchCV** to find the best set of values for the parameters of an algorithm, would GridSearchCV go though all the default options of a parameter even though I don't add it to the parameter_list?
For example, I want to use **GridSearchCV** to find the best parameter values for **kNN**, I need to examine the n_neighbors and algorithm parameters, is it possible that I just need to pass the values with no as below (because the algorithm parameter has default options),
parameter_list = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
or, I have to specify all the options that I want to examine?
parameter_list = {
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
Thanks.
No, You are misunderstanding about the parameter default and available option.
Looking at the documentation of KNeighborsClassifier, the parameter algorithm is an optional parameter (i.e. you may and may not specify it during constructor of KneighborsClassifier).
But if you decide to specify it, then it has options available: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}. It means that you can give the value only from these given options for algorithm and cannot use any other string to specify for algorithm. The default option is 'auto', means that if you dont supply any value, then it will internally use 'auto'.
Case 1:- KNeighborsClassifier(n_neighbors=3)
Here since no value for algorithm has been specified, so it will by default use algorithm='auto'.
Case 2:- KNeighborsClassifier(n_neighbors=3, algorithm='kd_tree')
Here as the algorithm has been specified, so it will use 'kd_tree'
Now, GridSearchCV will only pass those parameters to the estimator which are specified in the param_grid. So in your case when you use the first parameter_list from the question, then it will give only n_neighbors to the estimator and algorithm will have only default value ('auto').
If you use the second parameter_list, then both n_neighbors and algorithm will be passed on to the estimator.

Resources