I would like to parse a text file that has the section of interest as follows:
mesh 0 400 12000
400 300 400
1 0 -1
300 500 600
0 0 1
etc....
12000
1300
1100
etc..
I would only like the rows that immediately follow the row that starts with string mesh and every other one after that as well, and has 3 columns. I would like this output to be in a separate text file with a modified name.
So desired output text file:
400 300 400
300 500 600
I tried to do this with python and loops but it literally took hours and never finished as there are thousand to hundred of thousands of lines in the original text file.
Is there a more efficient way to do this in with a bash script using awk?
awk to the rescue!
$ awk '/^mesh/{n=NR;next} NF==3 && n && NR%2==(n+1)%2' file > filtered_file
400 300 400
300 500 600
Related
I have a PySpark dataframe of ids showing the results of a series of linkage operations. It shows how the records within a series of dataframes are linked across those dataframes. An example is as below where df_a represents the first dataframe in a pair comparison and df_b the second. link_a is the id of a record in df_a that links to an entry in df_b.
So the first row is saying that dataframe 1 entry 100 links to entry 200 in dataframe 2. In the next row entry 100 in dataframe 1 links to entry 300 in dataframe 3.
df_a
df_b
link_a
link_b
1
2
100
200
1
3
100
300
1
4
100
400
1
5
100
500
2
3
200
300
2
4
200
400
2
5
200
500
3
4
300
400
4
5
400
500
1
2
101
201
1
3
101
301
2
3
201
301
2
3
202
302
1
3
103
303
1
5
103
503
In the real table there are many dataframe comparisons and across all the dataframes 100000's of links.
What I am trying to achieve is to reformat this table to show the links across all the dataframes. So you can see in the above example that 100 links to 200, 200 to 300, 300 to 400 and 400 to 500. So all 5 records link together. Not all cases will have a record in each dataframe so in some cases you will end with incomplete chains empty values.
The end result would look like.
df_1
df_2
df_3
df_4
df_5
100
200
300
400
500
101
201
301
-
-
-
202
302
-
-
103
-
303
-
503
I will then use the above to add a common id to the underlying data.
I have gotten part of the way to this using a series of joins but this seems clumsy and quickly becomes difficult to follow. I have now run out of ideas as to how to solve so assistance getting closer or even to a full solution would be gratefully received.
I am trying to print special characters on a Zebra Printer (é, à, Ô).
So far, I've tried solutions found on StackOverflow (like this one Print characters with an acute in ZPL). In this particular one, the special characters does print correctly, but the font is big and the printer unroll a few inches of paper before actually printing.
I've read Zebra Programming doc but I can't seem to make it work.
Also, it does not look at all like the code I have so far :
T 0 3 40 0 ^FDHimudit\82
T 0 3 40 30
T 0 3 40 60
T 0 3 40 90 Déroulage Réduction Öyster
T 0 3 40 120 Règle ÀAA ÂA
SETFF 100 2.5 FORM PRINT
I am working on a ML problem to predict house prices and Zip Code is one feature which will be useful. I am also trying to use Random Forest Regressor to predict the log of the price.
However, should I use One Hot Encoding or Label Encoder for Zip Code? Because I have about 2000 Zip Codes in my dataset and performing One Hot Encoding will expand the columns significantly.
https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor
To rephrase: does it make sense to use LabelEncoder instead of One Hot Encoding on Zip Codes
Like the link says:
LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but
then the imposed ordinality means that the average of dog and mouse is
cat. Still there are algorithms like decision trees and random forests
that can work with categorical variables just fine and LabelEncoder
can be used to store values using less disk space.
And yes, you are right, when you have 2000 categories for zip codes, one hot may blow up your feature set massively. In many cases when I had such problems, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
There you go, you overcome the LabelEncoder problem, and you also get 4 feature columns instead of 8 unlike one hot encoding. This is the basic intuition behind Binary Encoder.
**PS:** Give 2 power 11 is 2048 and you have 2000 categories for zipcodes, you can reduce your feature columns to 11 instead of 1999 in the case of one hot encoding!
Say I have this file, test.log:
blabla test test
20 30 40
hello world
100 100
34 506 795
blabla test2
50 60 70
hello
10 10
200 200
blabla test BB
30 40 50
100 100
20 20 20 20
I would like to print all lines with blabla in them, the line after that - with the match number prepended.
Without match number, it is easy:
$ grep -A1 "blabla" test.log
blabla test test
20 30 40
--
blabla test2
50 60 70
--
blabla test BB
30 40 50
With a prepended match number, it would look like this:
1: blabla test test
1: 20 30 40
--
2: blabla test2
2: 50 60 70
--
3: blabla test BB
3: 30 40 50
The tricky part is, I want to preserve the match number, regardless if I just grep for a single line match, or with context (X lines after or before the match).
Is there an easy way to do this? If I could do a format specifier for the number, as in %03d, even better - but just a usual number would be fine too...
Something like
grep -A1 blahblah test.log | awk -v n=1 '$0 == "--" { n += 1; print; next }
{ printf("%03d: %s\n", n, $0) }'
Perl to the rescue!
perl -ne '/blabla/ and print ++$i, ":$_" and print "$i:", scalar <>' -- file
-n reads the input line by line
each line is read into the special variable $_
the diamond operator <> reads the next line from the input file
scalar makes it read just one line, not all the remaining ones
the variable $i is incremented each time blabla is encountered and is prepended to each output line.
Your specification doesn't handle the case when two blablas are present on adjacent lines.
To format the numbers, use sprintf:
perl -ne 'if (/blabla/) { $f = sprintf "%03d", ++$i; print $f, ":$_"; print "$f:", scalar <>}'
I know I can get a random number for example from 0 to 700 using:
arc4random() % 362
But how can I get a random number in between for example 200 to 300?
(arc4random() % 100) + 200
Not to be picky, but I am pretty sure you have to add a number...
if you want all the numbers from 200 to and including 300... use
200 + arc4random()%101;
arc4random()%100 would give a random number from 0 to 99 so 300 would never occur...