I'm currently using your gem to transform a csv that was webscraped from a personel-database that has no api.
From the scraping I ended up with a csv. I can process it pretty fine using your gem, there's only one bit I am wondering
Consider the following data:
====================================
| name | article_1 | article_2 |
------------------------------------
| Andy | foo | bar |
====================================
I can turn this into this:
======================
| name | article |
----------------------
| Andy | foo |
----------------------
| Andy | bar |
======================
(I used this tutorial to do this: http://thibautbarrere.com/2015/06/25/how-to-explode-multivalued-attributes-with-kiba/)
I'm using the normalizelogic on my loader for this. The code looks like:
source RowNormalizer, NormalizeArticles, CsvSource, 'RP00119.csv'
transform AddColumnEntiteit, :entiteit, "ocmw"
What I am wondering, can I achieve the same using a transform? So that the code would look like this:
source CsvSource, 'RP00119.csv'
transform NormalizeArticles
transform AddColumnEntiteit, :entiteit, "ocmw"
So question is: can I achieve to duplicate a row with a transform class?
EDIT: Kiba 2 supports exactly what you need. Check out the release notes.
In Kiba as currently released, a transform cannot yet more than one row - it's either one or zero.
The Kiba Pro offering I'm building includes a multithreaded runner which happens (by a side-effect rather than as actual goal) to allow transforms to yield an arbitrary number of rows, which is what you are looking after.
But that said, without Kiba Pro, here are a number of techniques which could help.
The first possibility is to split your ETL script into 2. Essentially you would cut it at the step where you want to normalize the articles, and put a destination here instead. Then in your second ETL script, you would use a source able to explode the row into many. This is I think what I'd recommend in your case.
If you do that, you can use either a simple Rake task to invoke the ETL scripts as a sequence, or you can alternatively use post_process to invoke the next one if you prefer (I prefer the first approach because it makes it easier to run either one or another).
Another approach (but too complicated for your current scenario) would be to declare the same source N times, but only yield a given subset of data, e.g.:
pre_process do
field_count = number_of_exploded_columns # extract from CSV?
end
(0..field_count).each do |shard|
source MySource, shard: shard, shard_count: field_count
end
then inside MySource you would only conditionnally yield like this:
yield row if row_index % field_count == shard
That's the 2 patterns I would think of!
I would definitely recommend the first one to get started though, more easy.
Related
I would like to add the vertical pipes so that I can have my data table down there under "Examples" feature in Specflow. Anyone to give me any tip so I can go through it?. My scenario outline looks like:
#mytag
Scenario Outline: Check Noun Words
Given Access the AnalyzeStateless URL
And language code
And content of <Sentence>
And the Expected KeyWord <Expected KeyWords>
And the Expected Family ID <Expected FID>
And the index <Index>
When return the XML response
Then the keyword should contain <FamilyID>
Examples:
| Index | Sentence | Expected KeyWords | Expected FID |
| 1 | I need a personal credit card | personal | 92289 |
The "Examples" feature has been manually entered in above case. I have a thousand of rows on an excel file, any appropriate way to get all of the values in one go?
Have you looked at Specflow.Excel which allows you to keep your examples in your excel files?
I am new to BDD specflow.
I have to write a scenario wherein after I capture an image, i have to select a value for each defined attribute for that image from a selection list
For Eg:
|Body Part |Location |Group |
| Leg | Left | Skin |
| Hand | Upper | Burn |
| Arm | Right | Ulcer |
I need a way in which i can select a different value for each attribute, every time.
Thanks in advance!
You are looking for Scenario Outline;
Scenario outlines allow us to more concisely express these examples through the use of a template with placeholders, using Scenario Outline, Examples with tables and < > delimited parameters.
Specflow takes each line in the Example table and create from the line a scenario to execute.
I understand that Julia has a complete set of low level tools for interfacing with binary files on one hand and some powerfull utilities such as readdlm to load text files containing rectangular data into Array structures on the other hand.
What I cannot discover in the standard library docs, however, is how to easily get input from less structured text files. In particular, what would be the Julia equivalent of the c++ idiom
some_input_stream >> a_variable_int_perhaps;
Given this is such a common usage scenario I am surprised something like this does not feature prominently in the standard library...
You can use readuntil http://docs.julialang.org/en/latest/stdlib/io-network/#Base.readuntil
shell> cat test.txt
1 2 3 4
julia> i,j = open("test.txt") do f
parse(Int, readuntil(f," ")), parse(Int, readuntil(f," "))
end
(1,2)
EDIT: To address comments
To get the last integer in an irregularly formatted ascii file you could use split if you know the character preceding the integer (I've use a blank space here)
shell> cat test.txt
1.0, two five:$#!() + 4
last line 3
julia> i = open("test.txt") do f
parse(Int, split(readline(f), " ")[end])
end
4
As far as code length is concerned, the above examples are completely self contained and the file is opened and closed in an exception safe manner (i.e. wrapped in a try-finally block). To do the same in C++ would be quite verbose.
I have a query
start ko=node:koid('ko:"ko:K01963"')
match p=(ko)-[r:abundance*1]-(n:Taxon)
with p, extract(nn IN relationships(p) | nn.`c1.mean`) as extracted
return extracted;
I would like to sum the values in extracted by using return sum(extracted), however, this throws me the following error
SyntaxException: Type mismatch: extracted already defined with conflicting type Collection<Boolean>, Collection<Number>, Collection<String> or Collection<Collection<Any>> (expected Relationship)`
Also, when i return extracted, my values are enclosed in square brackets
+---------------+
| extracted |
+---------------+
| [258.98813] |
| [0.0] |
| [0.0] |
| [0.8965624] |
| [0.85604626] |
| [0.0] |
Any idea how I can solve this. That is to sum the whole column which is returned.
Use reduce which is a fold operation:
return p, reduce(sum=0,nn IN relationships(p) | sum + nn.`c1.mean`) as sum
Values in square brackets are collections/arrays.
First off, given your use of "WITH" and labels, I'm going to assume you're using Cypher 2.x.
Also, to be honest, it's not entirely clear what you're after here, so I'm making some assumptions and stating them.
Second off, some parts of the query are unnecessary. As well, the *1 in your relationship means that there will only be one "hop" in your path. I don't know if that's what you're after so I'm going to make an assumption that you want to possibly go several levels deep (but we'll cap it so as to not kill your Neo4j instance; alternatively, you could use something like "allShortestPaths" but we won't go into that). This assumption can easily be changed by removing the cap. and signifying a single hop.
As for your the results being returned in brackets, extract returns a list of values, potentially only a single one.
So let's rewrite the query a little (note that for ko the identifier is a little confusing above so replace it with whatever you need to).
If we assume that you just want the sum per path, we can do:
MATCH p=(ko:koid {ko: 'K01963'})-[r:abundance*1..5]-(n:Taxon)
WITH reduce(val = 0, nn IN relationships(p) | val + nn.`c1.mean`) as summed
RETURN summed;
(This can also be modified to sum over all paths, I believe.)
If we want the total sum of ALL relationships returned, we need something a bit different, and it's even simpler (assuming in this case you really only do have:
MATCH p=(ko:koid {ko: 'K01963'})-[r:abundance]-(n:Taxon)
RETURN sum(r.`c1.mean`);
Hopefully even if I'm off in my assumptions and how I've read things this will at least get you thinking in the right way.
Mostly, the idea of using paths when you only have 1 hop to make in a path is a bit confusing, but perhaps this will help a little.
I am testing out F# and using NUnit as my test library; I have discovered the use of double-back ticks to allow arbitrary method naming to make my method names even more human readable.
I was wondering, whether rightly or wrongly, if it is possible to parameterise the method names when using NUnit's TestCaseAttribute to change the method name, for example:
[<TestCase("1", 1)>]
[<TestCase("2", 2)>]
let ``Should return #expected when "#input" is supplied`` input expected =
...
This might not be exactly what you need, but if you want to go beyond unit testing, then TickSpec (a BDD framework using F#) has a nice feature where it lets you write parameterized scenarios based on back-tick methods that contain regular expressions as place holders.
For example, in Phil Trelford's blog post, he uses this to define tic-tac-toe scenario:
Scenario: Winning positions
Given a board layout:
| 1 | 2 | 3 |
| O | O | X |
| O | | |
| X | | X |
When a player marks X at <row> <col>
Then X wins
Examples:
| row | col |
| middle | right |
| middle | middle |
| bottom | middle |
The method that implements the When clause of the scenario is defined in F# using something like this:
let [<When>] ``a player marks (X|O) at (top|middle|bottom) (left|middle|right)``
(mark:string,row:Row,col:Col) =
let y = int row
let x = int col
Debug.Assert(System.String.IsNullOrEmpty(layout.[y].[x]))
layout.[y].[x] <- mark
This is a neat thing, but it might be an overkill if you just want to write a simple parameterized unit test - BDD is useful if you want to produce human readable specifications of different scenarios (and there are actually other people reading them!)
This is not possible.
The basic issue is that for every input and expected you need to create a unique function. You would then need to pick the correct function to call (or your stacktrace wouldn't make sense). As a result this is not possible.
Having said that if you hacked around with something like eval (which must exist inside fsi), it might be possible to create something like this, but it would be very slow.