Hi I have a directory of files and each file has multiple languages text strings over multiple lines. Using grepwin I would like to extract all the English text strings and save into another text file. Typically in the each file the english text is inside a Switch/Case condition like this:
Default //English
bitmap 8 20 "bmp5/warning.bmp"
Ltext 5 1 11 "USB Device Overload"
LText 85 20 13 "USB"
Ltext 50 33 13 "Device Overload!"
Break
Case _French
bitmap 8 20 "bmp5/warning.bmp"
LTEXT 5 1 11 "Surcharge clé USB!"
LTEXT 45 30 13 "Surcharge clé USB"
Break
Since all the English text is always between 'Default' and 'Break' I want to use those two keywords as the delimeter. Finally all the text between the two keywords needs to be saved out to another text file.
Can anyone help at all.
Thanks guys
Through grep.
$ grep -oPz '(?s)(?<=\n|^)Default\b[^\n]*\n\K.*?(?=\nBreak\b)' file
bitmap 8 20 "bmp5/warning.bmp"
Ltext 5 1 11 "USB Device Overload"
LText 85 20 13 "USB"
Ltext 50 33 13 "Device Overload!"
To save the result to another file, you need to use output redirection operator.
grep -oPz '(?s)(?<=\n|^)Default\b[^\n]*\n\K.*?(?=\nBreak\b)' infile > outfile
From man grep
-P, --perl-regexp PATTERN is a Perl regular expression
-z, --null-data a data line ends in 0 byte, not newline
Related
I have two files with thousands of lines:
file1:
COL22A1 LCT 1 12 0.149667616334 2.16226378401
GPRIN2 TP53 12 170 0.0455368539793 44.2359753827
MUC3A TP53 12 170 0.0455368539793 44.2359753827
file2:
COL22A1 LCT 12 41 23 0.0296296296296 0.101234567901 0.0567901234568 2.36563
MEGF10 SORCS1 10 21 39 0.0246913580247 0.0518518518519 0.0962962962963 2.30599
I want to compare first two columns of these files and if they match I want to print whole line of second file and last column of first file:
output:
COL22A1 LCT 12 41 23 0.0296296296296 0.101234567901 0.0567901234568 2.36563 2.16226378401
I tried awk, grep, join but it always gives me output of just one file
Could you please try following and let us know then.
awk 'FNR==NR{a[$1,$2]=$NF;next} a[$1,$2]{print $0,a[$1,$2]}' Input_file1 Input_file2
With a Postscript driver (Xerox, Canon, HP, all), when I create a PS file, for example when I print the test page in the printer properties, I get :
OK :
The view of the result is correct (with GSview for example)
Not OK :
The file size is to big, more than 4 MB.
When I edit the file, I have one big image (doNimage). I think is the reason of the big size file.
The example file : https://drive.google.com/open?id=0B9bet657DEU5alV6WFZZdDFjMmc
I'm on Windows 10, similar problem with Windows server 2012 r2.
I let the configuration of the driver by default.
Anyone has an idea ?
Thanks a lot.
Regards.
I don't understand your problem, the file you posted a link to contains text. Here's an example:
360 4485 M <202530360E0F1102381030100D100B0824152D30103102020C302A1E19181B1E1730132E28301530132D3B02230B2A2E22081308>[46 16 28 70 18 42 44 44 54 32 28 32 36 32 25 39 65 40 40 28 32 44 44 44 18 28 53 45 20 47 38 45
40 28 34 40 40 28 40 28 34 40 18 44 44 25 53 40 16 39 34 0]xS
M is a moveto and xS uses the xshow operator to draw the glyphs represented by the character codes in the hexstring, using the values in the array to modify the width of each glyph.
If you were expecting to see ASCII character codes you are going to be sadly disappointed, the files uses an incrementally downloaded subset TrueType font, so the character codes are defined as they are encountered, that is the first glyph used will be given character code 1, the second will be character code 2 and so on.
Even without that, using ASCII would limit the languages that could be supported. Back in the 1980s that maybe didn't seem like a problem, but its a long time since that was considered acceptable.
If you were expecting to be able to modify the text by editing it in a text editor, forget it. PostScript is a programming language, and the output of a PostScript printer driver is a machine-generated program. Its a lengthy process for a skilled user of the language to decipher what the program is doing. The program is not amenable to alteration, if there's a fault in the output, correct the original document and recreate the PostScript program from the original.
PostScript is not an editable format.
Thanks all for your response. I see I was not very clear in my question.
Here is the state :
With the PS driver, on a windows server 2008, I get this file :
http://expirebox.com/download/0bb511565377e8b74eead67641fe7f68.html
Inside the file I can see the text "Page de test d\222imprimante"
On a Windows server 2012 R2 :
http://expirebox.com/download/60fa957cba97c82bbcd5c0e975825b52.html
I can't see any text. It's a printer page test too.
I need to see text because I'll print document with code inside. Code for a printer to identify page type. (for example a white page for the tray n° 1, yellow page for tray 2)
KenS : I understand your point. But why the same driver give different file.
I checked if it's really the same. The only difference I see is the OS, one x86, the other x64.
Thanks.
Regards.
I am trying to extract the text of page 5 in pdf.
The pdf have a font YLJAAA+CMSY10 which has no mappings (CMap) or even encodings (default encoding or /Differences). While extracting text, after string "tetex package" CGPDFScanner returns "\x15" character which is encountered many times. When this character is encountered current font is the above mentioned font which has nothing to extract the text from pdf string.
What is this \x15 character?
Thanks.
I found 2 (not "many") occurrences of this:
[ (\025) ] TJ
which is a number in octal – this is the number that is \x15 in hexadecimal.
The font definition for "YLJAA+CMSY10" in the PDF carries no special encoding, so it has the default encoding for "CMSY" ("Computer Modern Symbol"):
114 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont 210 0 R % -> "/YLJAAA+CMSY10"
/FirstChar 0
/FontDescriptor 211 0 R
/LastChar 127
/Widths 204 0 R
>>
211 0 obj
<<
/Ascent 750
/CapHeight 683
/CharSet (/bullet/greaterequal/arrowright/arrowdblright/element/negationslash/backslash/radical)
/Descent 0
/Flags 4
/FontBBox [ -29 -960 1116 775 ]
/FontFile 205 0 R
/FontName 210 0 R % -> '/YLJAAA+CMSY10'
/ItalicAngle -14
/StemV 85
/XHeight 430
>>
endobj
In itself, this still says nothing definitive: a PDF producer may reorder glyphs and encodings at will, as long as it does the same with the embedded font). Assuming the font set is not reordered, checking a random list of CMxx encodings shows that the character code 0x1F could well be GREATER-THAN OR EQUAL TO (Unicode U+2265).
Acrobat agrees; inspecting the font in the PDF shows that character code 21 (decimal) is named 'GREATER-THAN OR EQUAL' and looks like it as well.
I have a large text file as below imported in MATLAB:
Run Lat Long Time
1 32 32 34
1 23 22 21
2 23 12 11
2 11 11 11
2 33 11 12
up to 10 runs etc.
So I'm trying to break up each section in the file: section 1, section 2, etc and write it to 10 different text files. File 1 will have data from Run 1. File 2 will have data from Run 2.
What you're looking for is Matlab's textread function. I'll give you the pieces you need and frame out the logic, but you'll need to connect the pieces yourself :)
Your read would look something like this
[head1, head2, head3, head4] = textread(file_name,'%s %s %s %s',1);
[run, lat, long, time] = textread(file_name,'%u %u %u %u');
and your write method would use a loop to iterate over the values in
unique(run)
creating a file with
fout = fopen([base_file_name_out num2str(run_number)]);
and writing to it the values contained in
lat_this_run=Lat(run==run_number);
using the method
fprintf(fout,'%u %u %u\n', lat_this_run, long_this_run, time_this_run)
If your data is already loaded into matlab and named A, you could do:
>> a = max(A(:,1));
>> AA={};
>> for i = 1:a
AA{i}=A(find(A(:,1)==i),:)
name=sprintf('%d.txt',i);
dlmwrite(name,AA{i},'\t');
end
The output will be .txt files containing tab-delimited data.
gsub('$0\n','') isn't working
I would prefer something similar. I want:
(note the 10 and 20 have to work with 0 not being replaced in them).
If I have:
23
12
0
15
9
0
10
20
0
I want:
23
12
15
9
10
20
You may want to convert this to an array to re-process it, but the same thing can be done with a regular expression:
string.gsub(/^\s+0+$/m, '')
The /m part is key and it makes the expression operate in multi-line mode, that is ^ and $ refer to the beginning and ending of a line, not the beginning and ending of the string as is usually the case.