Csv bounded source with a custom line Delimitter - google-cloud-dataflow

I want to read a csv file with a line Delimiter other than the default line delimiter. Each csv record spans multiple lines so the TextIO.Read does not suffice.
Should I extend the FileBasedSource or is there any existing CsvBasedSource (with a custom line/fields delimiter).
I was looking in to the splitIntoBundles() api, the XmlSource did not override the isSplittable() and so it can be split in to bundles and was wondering how the XmlSource handles this because the split can happen at the middle of a <record> as the split is happening based on the desiredBundleSize only.

That's correct that this will need a custom FileBasedSource implementation to work. Regarding XMLSource, record and root element names have to be unique (i.e. no other elements can have those names). We'll update the documentation to reflect that, and look at improving this in the future.

Related

References and Bibliography in two distinct chapters of Quarto book

I would like to put References and Bibliography in two distinct chapters of the book. The "References" are the things actually cited in the text, but the "Bibliography" is just a manually created chapter before or after the References chapter.
So, i would like to write a chapter file bib.qmd like:
# The bibliography
#source1
#source2
#source3
... etc
However, i haven't found a way to obtain the full content using cites, i only get author or number depending on the CSL. Obviously i could write all that content by hand, but i prefer to do it through the corresponding citation.
I have read about including uncited items, and sound like what i want but i need them in a different chapter and not merged within the references.
Im thinking to write a lua filter to run after quarto's citeproc, and somehow reuse the output of citeproc but not sure if this is a viable path.
The idea to use a Lua filter is a good one. The biggest challenge is to collect all uncited items. You'd first get the full list of available items with the pandoc.utils.references function, collect all used keys by filtering on all Cite keys, and then use pandoc.utils.citeproc to generate and process a document with the uncited references.
If you have all uncited items in a single .bib file then you could use a pre-existing filter like multibib. Otherwise you might be able to adapt that filter to fit your requirements.

How to define large set of properties of a node without having to type them all?

I have imported a csv file into neo4j. I have been trying to define a large number of properties (all the columns) for each node. How can i do that without having to type in each name?
I have been trying this:
USING PERIODIC COMMIT
load csv WITH headers from "file:///frozen_catalog.csv" AS line
//Creating nodes for each product id with its properties
CREATE (product:product{id : line.`o_prd`,
Gross_Price_Average: TOINT(line.`Gross_Price_Average`),
O_PRD_SPG: TOINT(line.`O_PRD_SPG`)});
You can adding properties from maps. For example:
LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/products.csv" AS row
MERGE (P:Product {productID: row.productID})
SET P += row
http://neo4j.com/docs/developer-manual/current/cypher/clauses/set/#set-adding-properties-from-maps
The LOAD CSV command cannot perform automatic type conversion to ints on certain fields, that must be done explicitly (though you can avoid having to explicitly mention all other fields by using the map projection feature to transform your line data before setting it via stdob--'s suggestion).
You may want to take a look at Neo4j's import tool, as this will allow you to specify field type in headers, which should perform type conversion for you.
That said, 77 columns is a lot of data to all store on individual nodes. You may want to take another look at your data and figure out if some of those properties would be better modeled as nodes with their own label with relationships to your product nodes. You mentioned some of these were categorical properties. Categories are well suited to be modeled separately as nodes instead of as properties, and maybe some of your other properties would work better as nodes as well.

How to add property and value from loadCsv

I have csv with header like :
string,alias,source
I was trying to use query like this :
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM file AS line WITH line
Match (p{name:SUBSTRING(line.string ,7)})
Create (p:line.source:line.alias)
but I get error about last line. Is it possible to add new property to exsisting node using loadCsv ?
I think you're looking for the SET command. CREATE is reserved for nodes, relationships, and patterns.
You may want to review the documentation or the Cypher reference.
Also your match doesn't seem to be using a label (you're only using the variable p). If at all possible, use labels in your graph, without them you can't take advantage of your indexes or unique constraints, and even without those, it ensures that you're only scanning over nodes of that label instead of your entire graph.

Mapping flat fields to sequential records

I have a source schema that defines a "ShippingCharge" and a "DiscountAmount". My destination schema is an EDI X12 850 message.
I need to create two "fake" iterations for the SAC loop. I need a way to define that for the first iteration, use the ShippingCharge and the second use the DiscountAmount. There are a few additional "default values" that I need to set to SAC01 that also depends on the iteration (1 or 2).
What functoid should I be using? Any suggestions?
Have you tried the Table Looping functoid? You can use the table looping functoid to define multiple rows using input links (ShippingCharge and DiscountAmount) and constants (the SAC01 values). The output would then loop through these rows and create the two SACLoop1 elements.
You will need to use the Table Extractor functiod as well to deal with each data value in the table.
Complete instructions on using Table Looping and Table Extractor can be found here: http://msdn.microsoft.com/en-us/library/aa559310%28v=bts.20%29.aspx

RegExp as table entries

I'm building an application that takes inputs from SMS text thru Twilio. I'd like to build a table the matches the incoming SMS body with the appropriate response.
For example, imagine I'm building an NFL text message thing.
Someone texts in 'Redskins' and we text back, "The Redskins play at FedEx field"
Someone texts in 'Colts' and we text back, "The Colts are the pride of Indiana."
Here's the tricky part:
Of course, our Rails app is going to need to interpret the incoming team names through Regular Expressions, as many people will text in: Redskins or REDSKINS or REDSKIN or Redskin or REDskin.....
With one or two teams, one could just hardcode the RegExp and response into the controller...but with 30 teams, that seems wrong. (And with 120 entries -- say all pro sports-- even worse).
Does any one have any tips on getting the team names from the input stage, thru the DB table stage with a 'RegExp' conversion in the middle?
Thanks in advance.
for a modest number of keywords, I recommend a two table approach with Keywords and Aliases, always stores in lower case. Convert input to lower case. For each Keyword (say, redskins) you manually add 5-10 variations (including the correct one) in Aliases all of which have Alias.keyword_id = the id of the keyword. So you simply search Alias for the user input, and if you find a match you have the keyword_id of the keyword.
It has two advantages: fast and easy to extend... i fyou log the "no matches" you'll get a list of new aliases to add once to the dbase. MUCH easier and more reliable than trying to do via regex.
I don't think you want regexps here. What about spelling errors? For helpfulness (esp coming from a txt msg) I think you want to allow shortenings too.
Maybe a Soundex-based library or spelling correction thing would be best. You want a nearest match algorithm not a patterned match one.
If the text message is not too long, you should first chop that into words, and then take an intersection with the list of team names.
array_of_team_names = %w(Redskins Colts ... ) # keep it all capitalized
'cOLts blah blah'.scan(/\w+/).map{|word| word.capitalize} & array_of_team_names
# => ['Colts']
If you want to handle mistypes as suggested by drysdam, or if you want to handle larger text with more accuracy, you should use some library specific to that.
I think what you are asking is "how do I avoid hardcoding a regexp into my code, since I might have a lot of them, and they are really a data element"?
If you want to do the matching with regexp, you should note that you can create a regexp from a string, so you could easily have a table that contains column of regexp in string form. You can then dynamically create the array of regexp objects that you'd be using to search the incoming string with. The trick is what to do when you have a match. You'll need to develop a set of rules (yet another table) that basically says which response to pick based on incoming text. For example, if your rule is simply "match based on the team name and say where they play", that's pretty easy. Each regexp that you are searching for maps to exactly one action ("The Bears play in Chicago"). If your rules are more complicated (look for the Bears, and then look to see if the word "schedule" is in there too as well as "first game(s)", then you'd need another table that maps a collection of matches to a response.

Resources