How to use some text processing(awk etc..) to put some character in a text file at certain lines - parsing

I have a text file which has hex values, one value on one separate line. A file has many such values one below another. I need to do some analysis of the values for which i need to but some kind of delimiter/marker say a '#' in this file before line numbers 32,47,62,77... difference between two line numbers in this patterin is 15 always.
I am trying to do it using awk. I tried few things but didnt work.
What is the command in awk to do it?
Any other solution involving some other language/script/tool is also welcome.
Thank you.
-AD

This is how you can use AWK for it,
awk 'BEGIN{ i=0; } \
{if (FNR<31) {print $0} \
else {i++; if (i%15) {print $0} else {printf "#%s\n",$0}}\
}' inputfile.txt > outputfile.txt
How it works,
BEGIN sets an iterator for counting from your starting line 32
FNR<31 starts counting from the 31st record (the next record needs a #)
input lines are called records and FNR is an AWK variable that counts them
Once we start counting, the i%15 prefixes a # on every 15th line
$0 prints the record (the line) as is
You can type all the text with white spaces skipping the trailing '\' on a single command line.
Or, you can use it as an AWK file,
# File: comment.awk
BEGIN{ i=0; }
$0 ~ {\
if (FNR<31) {print $0} \
else {\
i++; \
if (i%15) {\
print $0
}\
else {\
printf "#%s\n",$0
}\
}\
}
And run it as,
awk -f comment.awk inputfile.txt > outputfile.txt
Hope this will help you to use more AWK.

Python:
f_in = open("file.txt")
f_out = open("file_out.txt","w")
offset = 4 # 0 <= offset < 15 ; first marker after fourth line in this example
for num,line in enumerate(f_in):
if not (num-offset) % 15:
f_out.write("#\n")
f_out.write(line)

Haskell:
offset = 31;
chunk_size = 15;
main = do
{
(h, t) <- fmap (splitAt offset . lines) getContents;
mapM_ putStrLn h;
mapM_ ((putStrLn "#" >>) . mapM_ putStrLn) $
map (take chunk_size) $
takeWhile (not . null) $
iterate (drop chunk_size) t;
}

Related

Lua pattern matching problem with escaped letter

I've already had a rule that \ should be replaced with \\\\
, so the existed code is
string.gsub(s, '\\', '\\\\\\\\')
but there is some data that should not be converted, such as abc\"cba, which will be replaced with abc\\\\"cba.
How can I constraint that only \ followed without " can be replaced, such like
'abc\abc' -> 'abc\\\\abc'
'abc\"abc' -> 'abc\"abc'
I have used patterns like \\[^\"]- and \\[^\"]+- but none of them works.
Thanks
You can use
string.gsub((s .. ' '), '\\([^"])', '\\\\\\\\%1'):sub(1, -2)
See the online demo:
local s = [[abc\abc abc\"abc\]];
s = string.gsub((s .. ' '), '\\([^"])', '\\\\\\\\%1'):sub(1, -2)
print( s );
-- abc\\\\abc abc\"abc\\\\
Notes:
\\([^"]) - matches two chars, a \ and then any one char other than a " char (that is captured into Group 1)
\\\\\\\\%1 - replacement pattern that replaces each match with 4 backslashes and the value captured in Group 1
(s .. ' ') - a space is appended at the end of the input string so that the pattern could consume a char other than a " char
:sub(1, -2) - removes the last "technical" space that was added.

How to parse values with AWK when column number is inconsistent

Input file:
6 31236622 HLA_C*05:01:01:01 A T . PASS AF=0.07724;MAF=0.07724;R2=0.98466;IMPUTED GT:DS:HDS:GP 1|0:0.999:0.999,0.000:0.001,0.999,0.000 0|0:0:0,0:1,0,0 1|1:1.994:0.995,1.000:0.000,0.006,0.994
6 29910248 HLA_A*01:01 A T . PASS AF=0.15969;MAF=0.15969;R2=0.97333;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 1|0:1.000:1.000,0.000:0.000,1.000,0.000 0|0:0:0,0:1,0,0
6 31322134 HLA_B*55:01 A T . PASS AF=0.01091;MAF=0.01091;R2=0.94511;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0
6 31322132 HLA_B*55 A T . PASS AF=0.01091;MAF=0.01091;R2=0.94485;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0
6 31322006 HLA_B*44:02:01:01 A T . PASS AF=0.08074;MAF=0.08074;R2=0.97706;IMPUTED GT:DS:HDS:GP 1|0:0.999:0.999,0.000:0.001,0.999,0.000 0|0:0:0,0:1,0,0 1|1:1.997:0.998,0.999:0.000,0.003,0.997
I want to parse a specific number from each column after the "GT:DS:HDS:GP" column, specifically, the numbers after "x|x:". So desired output is:
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
To parse the desired values from (e.g.) line 4, I can use:
awk -F: '{for (i=5; i<=NF; i+=3) printf "%s%s", $i, (i+3 <= NF ? ", " : ORS)}'
Line 5 would require:
awk -F: '{for (i=9; i<=NF; i+=3) printf "%s%s", $i, (i+3 <= NF ? ", " : ORS)}'
So the problem with the input file is that column 3 (space delimited) contains a variable number of colons, which makes colons a poor delimiter for this particular input file (but the desired values are surrounded by colons!)
I though about using "|" as delimiter, with substr($i,3,?), but the desired values have an inconsistent number of digits (hence the "?").
Is there a flexible awk code to get the desired output?
You may try this awk:
awk -v OFS=', ' '$9 == "GT:DS:HDS:GP" {for (i=10; i<=NF; ++i) if ($i ~ /^[0-9]+\|[0-9]+:/ && split($i, a, /:/)) printf "%s", (i == 10 ? "" : OFS) a[2]; print ""}' file
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
An expanded form:
awk -v OFS=', ' '
$9 == "GT:DS:HDS:GP" {
for (i=10; i<=NF; ++i)
if ($i ~ /^[0-9]+\|[0-9]+:/ && split($i, a, /:/))
printf "%s", (i == 10 ? "" : OFS) a[2]
print ""
}' file
Why do you care about the space-delimited columns at all?
awk '{ sub(/.* GT:DS:HDS:GP */, "");
i = split($0, n, /[0-9]\|[0-9]:/);
sep = "";
for(x=2; x<=i; x++) {
sub(/:.*/, "", n[x]); printf("%s%s", sep, n[x]); sep=", " }
printf "\n"; }' file
We successively pick apart each line, first by removing everything through GT:DS:HDS:GP from the line, then by splitting the remaining string into n on the specified delimiter, and then cleaning up the resulting fields by removing everything after the first colon in each, and printing the result. (We skip the first one, which only contains the useless short or empty string before the first delimiter.)
Output for your sample:
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
I have no idea what these fields stands for so I just picked single-letter variable names; you can probably improve the readability by giving these variables more descriptive names.

line break in a Function in Mathjax

very basic question about line break. I'm a newbie at mathjax but understand latex well. I'm using mathjax to make a quiz.
I tried to use \\ in mathjax but it doesn't show the line break: I'd the question to say:
If a + 10 = 2,
then what is the value of a
{
op1: 0,
question: function() {
op1 = Math.ceil(Math.random() * 5) + 1;
return `If $ a + ${op1} = 2 \\ $, then what is the value of $a$`;
},
answer: function() {
return op1 * op2;
}
}
thanks
Since \ is a special character in javascript string literals, you need to double them if you want an actual \ in your string. So you would need to use \\\\ to get \\ in the resulting string. Your \\ would just add \, which (together with the following space) will be the \ control sequence, which will just add a space at the end of the expression.

Applying quantifier to a sentence in Lua pattern

So I am trying to parse out #define statements out of a C file using Lua patterns, but there is the case on multiline defines, where you might escape the newline character with a backslash.
In order for me to know where the define ends, I need to be able to define backslash + linebreak as if it were a single character so I can get the complement of that and then use the * quantifier on it and then count until the first non-escaped linebreak.
How do I do that?
You cannot simply replace all occurrences of "\\\n" with some temporary symbol, because a problem will arise with the line "c\\\\\n" in the following example.
Instead, you should implement mini-scanner for C source files:
local str = [[
#define x y
#define a b\
c\\
d();
#define z
]]
-- Print all #defines found in the text
local line = ""
for char in str:gmatch"\\?." do
if char == "\n" then
if line:sub(1, #"#define") == "#define" then
print(line)
end
line = ""
else
line = line..char
end
end
Output:
#define x y
#define a b\
c\\
#define z

mnist database parsing c

I am trying to parse the MNIST Database of handwritten numbers. However, when I look at the values that it is giving me when I use fread, they aren't right. I have changed the endianness, but the numerical values aren't correct still. Link to the database is here: http://yann.lecun.com/exdb/mnist/
int ChangeEndianness(int value) {
int result = 0;
result |= (value & 0x000000FF) << 24;
result |= (value & 0x0000FF00) << 8;
result |= (value & 0x00FF0000) >> 8;
result |= (value & 0xFF000000) >> 24;
return result;
}
FILE *imageTestFiles = fopen("train-images-idx3-ubyte.gz","r");
if(imageTestFiles == NULL) {
perror("File Not Found");
}
int magic_number_bytes;
fread(&magic_number_bytes, sizeof(int), 1, imageTestFiles);
printf("%d\n", ChangeEndianness(magic_number_bytes));
All this is supposed to do is print the "magic number" which is 2049 or 0x00000801, but it instead prints a 529205256 which is 0x1F8B0808. I am sorta new to C, always used Java beforehand. Thanks in advance!
The file must first be decompressed rather than simply removing the gz extension.
One can tell your code is operating on a compressed file because 0x1F8B is the magic number for the gzip file format.
If xxd is used to display the file contents after downloading you get the observed 0x1F8B0808:
$ xxd -p train-images-idx3-ubyte.gz | head -c 8
1f8b0808
However, if you decompress the file:
$ gunzip train-images-idx3-ubyte.gz
$ xxd -p train-images-idx3-ubyte | head -c 8
00000803
you get the expected magic number for the MNIST data.

Resources