Reading tabular data from flat file in J - parsing

I have a file whose contents look something like this:
A 12 17.5 3.2
B 7 12 11
C 6.2 9.3 13
The whitespace between cells can vary and is not significant, though there must be at least one space. Additionally, the first column only contains (or should only contain) one of those three letters, and I am content to work with 0-2 instead if it simplifies life with J (I suspect it would).
I'm not even sure how to approach this in J. Two approaches jump out at me:
Use ;: to break the file contents into "words". This will produce something like this for me:
(;: file)
┌─┬───────────┬─┬─┬───────┬─┬─┬──────────┐
│A│12 17.5 3.2│ │B│7 12 11│ │C│6.2 9.3 13│
└─┴───────────┴─┴─┴───────┴─┴─┴──────────┘
This is interesting, because it has grouped the numeric values together. I could see then selecting out those columns like so:
(0=3|i.#;:file)#;:file
I could use ". to convert the other rows to numbers. For some reason, doing it piecemeal like this feels hackish.
Use sequential machine (;:)
The documentation on this verb is making my head spin, but I think if I drew a state transition diagram I could get the words broken up. I don't know if it would be possible to convert any of the words to numbers at the same time though, or if it's possible to return a matrix this way. Is it?
I worry that I'm bringing too much of my experience with other languages to bear on this and it's actually a simple problem in J, if you know how to do it. Is that the case? What's a more idiomatic way to do this with J?

If the file is a string of numbers it does make things a bit easier, so I will replace your A B C with 1 2 3, but I will also add a couple of rows to show how filtering can be done.
file is the string of characters.
[ file=.'1 12 17.5 3.2 2 7 12 11 3 6.2 9.3 13 2 2.3 3.6 12 1 3.4 2 3.4'
1 12 17.5 3.2 2 7 12 11 3 6.2 9.3 13 2 2.3 3.6 12 1 3.4 2 3.4
Convert file to numerals using ". then take numbers 4 at a time to create a table using _4 ]\ which makes use of dyad Infix \ http://www.jsoftware.com/help/dictionary/d430.htm
[ array=. _4]\ ". file
1 12 17.5 3.2
2 7 12 11
3 6.2 9.3 13
2 2.3 3.6 12
1 3.4 2 3.4
Once that is done you can then group the rows according to their first column and perform any operation that you would like using v/. where v is any verb attached to the key conjunction /. http://www.jsoftware.com/help/dictionary/d421.htm
({."1 </. }."1) array
+------------+----------+----------+
| 12 17.5 3.2| 7 12 11|6.2 9.3 13|
|3.4 2 3.4|2.3 3.6 12| |
+------------+----------+----------+
For example you take the average of the entries for each row depending on the category of the first column.
({."1 (+/ % #)/. }."1) array
7.7 9.75 3.3
4.65 7.8 11.5
6.2 9.3 13
From the comment below, using the ;: trick you can end up with the shape and type that you would like from the original file.
;"1 ".each(('123'{~ 'ABC'&i.) each #:{. , }.)"1[ _2 [\ ;: 'A 1.1 2.2 3.3 B 3.4 4.5 5.6 C 6.7 7.8 8.9'
1 1.1 2.2 3.3
2 3.4 4.5 5.6
3 6.7 7.8 8.9

I think using all numeric if possible is probably preferable as bob suggests, but if you need to parse a flat file containing fields of mixed type delimited by 1 or more spaces then the following should do the job pretty well:
]cut;._2 freads 'myfile.txt'
┌─┬───┬────┬───┐
│A│12 │17.5│3.2│
├─┼───┼────┼───┤
│B│7 │12 │11 │
├─┼───┼────┼───┤
│C│6.2│9.3 │13 │
└─┴───┴────┴───┘

Related

Cycling through a sequence 1-12 at different offsets

In Sheets I would like fixed sequence of 1-12. I have set it as =sequence(3,4) and I would like it to roll and wrap when I change the first number
Apologies in advance for formatting. I would like the array to roll and wrap when I change the first number in the sequence. So, the starting array is 1-12, but when I change the first number to 4 I would like the sequence to run from there and wrap around back to 1.
1 2 3 4
5 6 7 8
9 10 11 12
But if I start at 4 I would like it to read
4 5 6 7
8 9 10 11
12 1 2 3
Say your start number is in A1:
=ArrayFormula(MOD(SEQUENCE(3,4,A1-1,1),12)+1)
This uses MOD to cycle through the sequence.

Lookup data from two ranges and if text matches, then assign numeric value

I am trying to calculate points in a Formula 1 racing league. I'm having trouble with a bonus 15 points if a constructor qualifies 1st and finishes the race 1st. The issue is there could be two different drivers who do this. For example. As you can see, HAM qualified 1st and ROS finished 1st in the race. Because they both drive for Mercedes, 15 points need to be awarded to Mercedes. The data can't be moved around as it's imported using an API (not in the example) but a copy of the layout can be found here
Qualifying Race Driver Team
14 1 ROS mercedes
1 15 HAM mercedes
3 3 VET ferrari
8 4 RIC red_bull
6 5 MAS williams
19 6 GRO haas
10 7 HUL force_india
16 8 BOT williams
7 9 SAI toro_rosso
5 10 VES toro_rosso
13 11 PAL renault
Put this in I2 and copy down. See if that is how you want it:
=IF(AND(VLOOKUP(1, $A$2:$H$12, 8, FALSE)=VLOOKUP(1, $B$2:$H$12, 7, FALSE), VLOOKUP(1, $B$2:$H$12, 7, FALSE)=H2, MATCH(H2, H:H, 0)=ROW(H2)), 15, 0)

Clustering unique datasets based on similarities (equality)

I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.

Formatting credit card track II data separator using Cobol

We have a legacy COBOL program that formats the iso 8583 0100 authorization request. Recently we were told the track II data was invalid due to the separator. The track II data is in a PIC X() field and we simply replace the = with the character D before running the data through a binary intrinsic 2 bytes at a time.
We are told that the character is converting to 4 on their side. My question is: What character should we use to replace the = character? Or do we leave the = character alone?
Thanks for any guidance.
Track 2 data is stored on a Credit card as Binary Coded Decimal with parity and the other possible binary values are used for controls.
Hex ASCII Meaning
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
A : (not used)
B ; Start Sentinel
C < (not used)
D = Field Separator
E > (not used)
F ? End Sentinel
I have a feeling that the "binary intrinsic" is simply converting ASCII to BCD, which if you used the standard ASCII characters you would get what you want, as the = is a 0x3D in ASCII and if you strip off the first nibble you are left with a 0xD.

neo4j: Invalid encoding issue

I've imported the data into the database using batch-import. Starting the neo4j server, I can se at the dashboard that all the nodes and relationships have been imported. When I try to execute the cypher query, however, it returns the message Invalid encoding: '12'
neo4j-sh (0)$ start a = node(1) return a
==> Invalid encoding '12'
neo4j-sh (0)$
In the past I have already successfully imported the data using batch-import and now, following the same steps I can't figure out what causes the error.
The file nodes.csv contains the data in the following format:
node_id msisdn:string:users
1 000000F8BE951D6DE6480F4AFDFB670C553E47C0
2 0000021449360C1A398ED9A18800B2B13AA098A4
3 00000DABDE4C555FC82F7D534835247B94873C2C
4 00001BE4128DB41729365A41D3AC1D019E5ED8A6
5 00002506A1BC28F5DAE937703106CE6B39B857A0
6 00002781A2ADA816CDB0D138146BD63323CCDAB2
and relationships.csv the following:
calling_party called_party connection_type:string:connection consecutive_day:int:connection called_network_id:int:connection calls_count:int duration_sum:int
6 209339 CALLED 3 9800 1 532
6 667602 CALLED 25 9800 1 31
6 917611 CALLED 54 9811 2 17
6 1057687 CALLED 14 9800 1 29
6 1070735 CALLED 41 9800 1 285
6 1070735 CALLED 43 9800 1 18
6 2106202 CALLED 29 9802 1 26
6 2106202 CALLED 0 9802 1 10
The settings in batch.properties are configured as following:
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=100M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=4G
neostore.propertystore.db.strings.mapped_memory=200M
neostore.propertystore.db.arrays.mapped_memory=0M
neostore.propertystore.db.index.keys.mapped_memory=15M
neostore.propertystore.db.index.mapped_memory=15M
batch_import.node_index.users=exact
batch_import.relationship_index.connection=exact
I am not sure whether the wrong format might be related to the fact that I prepared the csv files on Windows machine while the neo4j and batch-import is running on Linux machine.
Anyway I would be really thankful if anyone could help to solve the problem.
Perhaps a \r line-ending? I fixed that yesterday, perhaps you want to re-import your data with the latest build?
Or clean the line ending before importing with dos2unix or tr -d '\r' <input >output

Resources