How to transform nested XML to CSV? - xml-parsing

While not strictly XML at this point, looking to transform a "list" into something which would work for XML shredding, so something more like a table. Suitable for import to a spreadsheet or a table in a SQL database.
How would this input:
people
joe
phone1
phone2
phone3
sue
cell4
home5
alice
atrib6
x7
y9
z10
get transformed to a structure like:
joe phone1 phone2 phone3
sue cell4 home5
alice atrib6 x7 y9 z10
It only matters that the "names" are in the first column, other "attributes" in any of following columns of CSV or similar, or something exportable as such CSV. It doesn't have to be CSV, perhaps just convertible to such a structure because BaseX exports to CSV quite nicely through the GUI.
or, perhaps:
joe phone1
joe phone2
joe phone3
sue cell4
sue home5
alice atrib6
alice x7
alice y9
alice z10
Although I prefer the former for this specific data.

I think in another question you already got the suggestion to use windowing:
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
declare option output:method 'text';
declare option output:item-separator '
';
for tumbling window $group in ul/*
start $s next $n when $s[self::li] and $n[self::ul]
return
tail($group) ! ($s || ',' || .)
https://xqueryfiddle.liberty-development.net/6qVSgeS

Related

Rails ActiveRecord - find by substring of another string

id | name
1 | jack
2 | tomas
I want to find a row if the name is a substring of Ttomas.
So the result should be
id | name
2 | tomas
Is this possible?
Yes, this is possible in Ruby on Rails and SQL. Depends a bit on the database you use, but something like this should work:
Modelname.where("? LIKE CONCAT('%', name, '%')", 'Ttomas')
It's possible.
The brute-force way would be to iterate through all your users and check if name is a substring. But this horribly inefficient.
If you want to utilize SQL lookups, you need to look into gems like pg_search or elasticsearch for full search functionality.

querying on inverted file index

I have a project in school which I need to create a search engine using inverted index and I'm a bit blocked with how to continue.
I stored all the word that were in my docs (4 docs) using inverted file index, but in a way each word in a specific file has a row, so lets say the word 'like' can appears in doc 2 three times and doc 4 twice- so it will have 2 rows, word:like docid:2 hit:3 instoplist:0 and word:like docid:4 hit:2 instopelist:0 (hit is the number of time the word appeared inside the doc and instoplist if it part of the words that is in stop-list).
now I need to be able to do querying on that index.
lets say I need to find - car and (motorcycle or bicycle)
which is the best way to do that? how do I write the order of the search? how do I know to take motorcycle and bicycle first and do 'or' between them and then do 'and' with car ?
*ps- using php to write the code
I will appreciate any kind of help,
Thanks
You can use intersection of documents containing car with ( union of documents containing motorcycle or bicycle)
Car : doc1 , doc2 , doc3
motorcycle : doc1 , doc4
bicycle: doc1, doc2
So your final list of document should be doc1, doc2
For finding intersection and union in php.
Lets say you have 3 array $car , $motorcycle and $bicycle containing document having these words
<?php
$car = ['doc1','doc2','doc3'];
$motorcycle = ['doc1','doc4'];
$bicycle = ['doc1','doc2'];
$intersect = array_merge($motorcycle, $bicycle);
$result = array_intersect($car , $intersect);
for($x = 0; $x < count($result); $x++) {
echo $result[$x];
echo "<br>";
}
?>

Cleaning data in SPSS with name misspellings

I have a 5M records dataset in this basic format:
FName LName UniqueID DOB
John Smith 987678 10/08/1976
John Smith 987678 10/08/1976
Mary Martin 567834 2/08/1980
John Smit 987678 10/08/1976
Mary Martin 768987 2/08/1980
The DOB is always unique, but I have cases where:
Same ID, different name spellings or Different ID, same name
I got as far as making SPSS recognize that John Smit and John Smith with the same DOB are the same people, and I used aggregate to show how many times a spelling was used near the name (John Smith, 10; John Smit 5).
Case 1:
What I would like to do is to loop through all the records for the people identified to be the same person, and get the most common spelling of the person's name and use that as their standard name.
Case 2:
If I have multiple IDs for the same person, take the lowest one and make that the standard.
I am comfortable using basic syntax to clean my data, but this is the only thing that I'm stuck on.
If UniqueID is a real unique ID of individuals in the population and you are wanting to find variations of name spellings (within groupings of these IDs) and assign the modal occurrence then something like this would work:
STRING FirstLastName (A99).
COMPUTE FirstLastName = CONCAT(FName," ", LName").
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID FirstLastName /Count=N.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID /MaxCount=MAX(Count).
IF (Count<>MaxCount) FirstLastName =$SYSMIS.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES OVERWRITE=YES /BREAK=UniqueID /FirstLastName=MAX(FirstLastName).
You could then also overwrite the FName and LName fields also but then more assumptions would have to be made, if for example, FName or LName can contain space characters ect.

k nearest neighbor in SAS: how to get the neighbor list for each row?

currently I'm using the proc discrim in SAS to run a kNN analysis for a data set, but the problem may require me to get the top k neighbor list for each rows in my table, so how can I get this list from SAS??
thanks for the answer, but I'm looking for the neighbor list for each data point, for example if i got data set:
name age zipcode alcohol
John 26 08439 yes
Cathy 49 47789 no
smith 37 90897 no
Tom 34 88642 yes
then i need list:
name neighbor1 neighbor2
John Tom cathy
Cathy Tom Smith
Smith Cathy Tom
Tom John Cathy
I could not find this output from SAS, is there any whay that I can program to get this list? Thank you!
I am not a SAS user, but a quick web lookup seems to give a good answers for your problem:
As far as i know you do not have to implement it by yourself. DISCRIM is enough.
Code for iris data from http://www.sas-programming.com/2010/05/k-nearest-neighbor-in-sas.html
ods select none;
proc surveyselect data=iris out=iris2
samprate=0.5 method=srs outall;
run;
ods select all;
%let k=5;
proc discrim data=iris2(where=(selected=1))
test=iris2(where=(selected=0))
testout=iris2testout
method=NPAR k=&k
listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 'Using KNN on Iris Data';
run;
The long and detailed description is also avaliable here:
http://analytics.ncsu.edu/sesug/2012/SD-09.pdf
And from the sas community:
Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance. Since PROC DISCRIM doesn't output the Tree it built internally, use "data= test= testout=" option to score new data set.

Data collection task

I have data that follows this kind of patten:
ID Name1 Name2 Name3 Name4 .....
41242 MCJ5X TUAW OXVM4 Kcmev 1
93532 AVEV2 WCRB3 LPAQ 2 DVL2
.
.
.
As of now this is just format in a spreadsheet and has about 6000 lines. What I need to do is to create a new row for each Name after Name1 and associate that with the ID on its current row. For example, see below:
ID Name1
41242 MCJ5X
41242 TUAW
41242 OXVM4
41242 Kcmev 1
93532 AVEV2
93532 WCRB3
93532 LPAQ 2
93532 DVL2
Any ideas how I could do this? I feel like this shouldn't be too complicated but not sure of the best approach. Whether a script or some function I'd really appreciate the help.
If possible, you might want to use a csv file. These files are plain-text and most spreadsheet programs can open/modify them (I know Excel and the OpenOffice version can). If you go with this approach, your algorithm will look something like this:
read everything into a string array
create a 1 to many data structure (maybe a Dictionary<string, List<string>> or list of (string, string) tuple types)
loop over each line of the file
splice the current line on the ','s and loop over those
if this is the first splice, add a new item to the 1 to many data structure with the current splice as the Id
otherwise, add this splice to the "many" (name) part of the last item in the data structure
create a new csv file or open the old one for writing
output the "ID, Name1" row
loop over each 1-many item in the data collection
loop over the many items in the current 1-many item
output the 1 (id) + "," + current many item (current name)
You could do this in just about any language. If its a one-time use script then Python, Ruby, or Powershell (depending on platform) would probably be a good choice.

Resources