How to access the prefix when using uniq -c - dash-shell

I encountered a problem in my program. I have a list of files and I sort them with this code to find out the 10 most frequent file types in the list.
find $DIR -type f | file -b $SAVEFILES | cut -c1-40 | sort -n | uniq -c | sort -nr | head -10
My output looks like this
168 HTML document, ASCII text
114 C source, ASCII text
102 ASCII text
33 ASCII text, with very long lines
30 HTML document, UTF-8 Unicode text, with
26 HTML document, ASCII text, with very lon
21 C source, UTF-8 Unicode text
20 LaTeX document, UTF-8 Unicode text, with
15 SVG Scalable Vector Graphics image
12 LaTeX document, ASCII text, with very lo
What I want to do is to access the values before the file types and replace them #. I can fdo that with a for loop but first I have somehow access them.
the expected output is something like this:
__HTML document, ASCII text : ################
__C source, ASCII text : ###########
__ASCII text : ##########
__ASCII text, with very long lines : ########
__HTML document, UTF-8 Unicode text, with : #######
__HTML document, ASCII text, with very lon: ####
__C source, UTF-8 Unicode text : ####
__LaTeX document, UTF-8 Unicode text, with: ###
__SVG Scalable Vector Graphics image : #
__LaTeX document, ASCII text, with very lo: #
EDIT: The # are not representing the exect number in my example. First line should have 168 #, second 114 # and so on

Append this:
| while read -r n text; do printf "__%s%$((48-${#text}))s: " "$text"; for ((i=0;i<$n;i++)); do printf "%s" "#"; done; echo; done
Change 48 according to your needs.
Output with your input:
__HTML document, ASCII text : ########################################################################################################################################################################
__C source, ASCII text : ##################################################################################################################
__ASCII text : ######################################################################################################
__ASCII text, with very long lines : #################################
__HTML document, UTF-8 Unicode text, with : ##############################
__HTML document, ASCII text, with very lon : ##########################
__C source, UTF-8 Unicode text : #####################
__LaTeX document, UTF-8 Unicode text, with : ####################
__SVG Scalable Vector Graphics image : ###############
__LaTeX document, ASCII text, with very lo : ############

A shell loop is never the right way to manipulate text, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
You can do what you asked for with this awk command:
$ awk '{printf "%-40s: %s\n", substr($0,9), gensub(/ /,"#","g",sprintf("%*s",$1,""))}' file
HTML document, ASCII text : ########################################################################################################################################################################
C source, ASCII text : ##################################################################################################################
ASCII text : ######################################################################################################
ASCII text, with very long lines : #################################
HTML document, UTF-8 Unicode text, with : ##############################
HTML document, ASCII text, with very lon: ##########################
C source, UTF-8 Unicode text : #####################
LaTeX document, UTF-8 Unicode text, with: ####################
SVG Scalable Vector Graphics image : ###############
LaTeX document, ASCII text, with very lo: ############
but the right way to do this is to get rid of everything from cut on and just do something like this:
find "$DIR" -type f | file -b "$SAVEFILES" |
awk '
{ types[substr($0,1,40)]++ }
END {
PROCINFO["sorted_in"] = "#ind_num_desc"
for (type in types) {
printf "%-*s: %s\n", 40, type, gensub(/ /,"#","g",sprintf("%*s",cnt[type],""))
if (++cnt == 10) {
break
}
}
}
'
The above use GNU awk for sorted_in and gensub() and the 2nd one is untested since you only provided sample input for the last part, printing the "#"s

The perl approach, add:
| perl -lpE 's/\s*(\d+)\s(.*)/sprintf "__%-40s: %s", $2, "#"x$1/e'
output
__HTML document, ASCII text : ########################################################################################################################################################################
__C source, ASCII text : ##################################################################################################################
__ASCII text : ######################################################################################################
__ASCII text, with very long lines : #################################
__HTML document, UTF-8 Unicode text, with : ##############################
__HTML document, ASCII text, with very lon: ##########################
__C source, UTF-8 Unicode text : #####################
__LaTeX document, UTF-8 Unicode text, with: ####################
__SVG Scalable Vector Graphics image : ###############
__LaTeX document, ASCII text, with very lo: ############
following #Ed's approach, just using perl
find "$DIR" -type f | file -b "$SAVEFILES" |\
perl -lnE '$s{substr$_,0,40}++;}{printf"__%-40s: %s\n",$_,"#"x$s{$_}for(splice#{[sort{$s{$b}<=>$s{$a}}keys%s]},0,9)'
readable:
perl -lnE '
$seen{ substr $_,0,40 }++;
END {
printf"__%-40s: %s\n", $_, "#" x $seen{$_}
for( splice #{[sort { $seen{$b} <=> $seen{$a} } keys %seen]},0,9 )
}'
Ps: Just note, the file utility just will test the files in the $SAVEFILES so, the find ... | file -b $SAVEFILES is pointless

Related

Print double quotes in Forth

The word ." prints a string. More precisely it compiles the (.") and the string up to the next " in the currently compiled word.
But how can I print
That's the "question".
with Forth?
In a Forth-2012 System (e.g. Gforth) you can use string literals with escaping via the word s\" as:
: foo ( -- ) s\" That's the \"question\"." type ;
In a Forth-94 system (majority of standard systems) you can use arbitrary parsing and the word sliteral as:
: foo ( -- ) [ char | parse That's the "question".| ] sliteral type ;
A string can be also extracted up to the end of the line (without printable delimiter); a multi-line string can be extracted too.
Specific helpers for particular cases can be easily defined.
For example, see the word s$ for string literals that are delimited by any arbitrary printable character, e.g.:
s$ `"test" 'passed'` type
Old school:
34 emit
Output:
"
Using gforth:
: d 34 emit ;
cr ." That's the " d ." question" d ." ." cr
Output:
That's the "question".

R-markdown latex font encoding failing

Edit: Problem solved. It turns out I had some missing packages, and rmarkdown doesn't give a very good error message for this. If anyone else has this problem, you have to use the "keep_tex" option in the yaml. You can then see in the tex file all of the required tex packages.
I'm on Windows 7. I have the latest Miktex, R, rmarkdown, rstudio.
When I try to produce a new rmarkdown document and output to pdf, it fails.
It seems to be a font encoding which is missing. I've tried searching the internet for omxenc.dfu and uenc.dfu files, but it turned up nothing. I was going to post this in the latex exchagne, but I can produce latex documents fine without using rmarkdown.
Any help would be great!
Here's the rmarkdown
---
title: "Untitled"
author: "Triceraflops"
date: "19 März 2018"
output: pdf_document
---
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for
authoring HTML, PDF, and MS Word documents. For more details on using R
Markdown see <http://rmarkdown.rstudio.com>.
And here's the output of the R Markdown console
"C:/Users/triceraflops/Documents/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS missing-encoding.utf8.md --to latex --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output missing-encoding.tex --template "C:\Users\triceraflops\Documents\R\win-library\3.4\rmarkdown\rmd\latex\default-1.17.0.2.tex" --highlight-style tango --latex-engine pdflatex --variable graphics=yes --variable "geometry:margin=1in"
output file: missing-encoding.knit.md
Output created: missing-encoding.pdf
Error in tools::file_path_as_absolute(output_file) :
file 'missing-encoding.pdf' does not exist
Calls: <Anonymous> -> <Anonymous>
In addition: Warning messages:
1: running command '"pdflatex" -halt-on-error -interaction=batchmode
"missing-encoding.tex"' had status 1
2: In readLines(logfile) :
incomplete final line found on 'missing-encoding.log'
Execution halted
And finally here's the last part of the output log file where it fails.
Now handling font encoding OMX ...
... no UTF-8 mapping file for font encoding OMX
Now handling font encoding U ...
... no UTF-8 mapping file for font encoding U
defining Unicode char U+00A9 (decimal 169)
defining Unicode char U+00AA (decimal 170)
defining Unicode char U+00AE (decimal 174)
defining Unicode char U+00BA (decimal 186)
defining Unicode char U+02C6 (decimal 710)
defining Unicode char U+02DC (decimal 732)
defining Unicode char U+200C (decimal 8204)
defining Unicode char U+2026 (decimal 8230)
defining Unicode char U+2122 (decimal 8482)
defining Unicode char U+2423 (decimal 9251)

Find all accented words (diacriticals) using grep?

I have a large list of words in a text file (one word per line) Some words have accented characters (diacriticals). How can I use grep to display only the lines that contain accented characters?
The best solution I have found, for a larger class of characters ("What words are not pure ASCII?") is using PCRE with -P option:
grep -P "[\x7f-\xff]" filename
This will find UTF-8 and ISO-8859-1(5) (Latin1, win1252, cp850) accented characters alike.
I have a solution. First strip the accents using "iconv" then do a "diff" for lines in the original file:
cat text-file | iconv -f utf8 -t ascii//TRANSLIT > noaccents-file
diff text-file noaccents-file | grep '<'

How to know charset of character?

A python script fail when trying to encode a supposed utf-8 string in iso-8859-1:
>>> 'à'.encode('iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0300' in position 1: ordinal not in range(256)
How to know wich charset is that character ? When encode it in utf-8:
>>> 'à'.encode('utf-8')
b'a\xcc\x80'
a then \xcc\x80. I can fount \xcc\x80 in http://www.utf8-chartable.de/unicode-utf8-table.pl?start=768&names=-&utf8=string-literal utf8 table.
But it is utf-8 ? If it is utf-8 why 'à'.encode('utf-8') can't encode this string in iso-8859-1 ?
It's a bit unclear where the 'à' character comes from. In fact, it's a combining character sequence and you need to normalize it. Next python script uses unicodedata module to self-explain and solve your questions:
import sys, platform
print (sys.stdout.encoding, platform.python_version())
print ()
import unicodedata
agraveChar='à' # copied from your post
agraveDeco='à' # typed as Alt+0224 (Windows, us keyboard)
# print Unicode names
print ('agraveChar', agraveChar, agraveChar.encode('utf-8'))
for ins in range( 0, len(agraveChar)):
print ( agraveChar[ins], unicodedata.name(agraveChar[ins], '???'))
print ('agraveDeco', agraveDeco, agraveDeco.encode('utf-8'))
for ins in range( 0, len(agraveDeco)):
print ( agraveDeco[ins], unicodedata.name(agraveDeco[ins], '???'))
print ('decomposition(agraveChar)', unicodedata.decomposition(agraveChar))
print ('\nagraveDeco normalized:\n')
print ("NFC to utf-8", unicodedata.normalize("NFC" , agraveDeco).encode('utf-8'))
print ("NFC to latin", unicodedata.normalize("NFC" , agraveDeco).encode('iso-8859-1'))
print ("NFKC to utf-8", unicodedata.normalize("NFKC", agraveDeco).encode('utf-8'))
print ("NFKC to latin", unicodedata.normalize("NFKC", agraveDeco).encode('iso-8859-1'))
Output:
==> D:\test\Python\40422359.py
UTF-8 3.5.1
agraveChar à b'\xc3\xa0'
à LATIN SMALL LETTER A WITH GRAVE
agraveDeco à b'a\xcc\x80'
a LATIN SMALL LETTER A
̀ COMBINING GRAVE ACCENT
decomposition(agraveChar) 0061 0300
agraveDeco normalized:
NFC to utf-8 b'\xc3\xa0'
NFC to latin b'\xe0'
NFKC to utf-8 b'\xc3\xa0'
NFKC to latin b'\xe0'
==>

What character encoding uses 2 underscores and a letter?

I'm currently parsing what looks to be a proprietary file format from a third-party commercial application. They seem to use a funny character encoding system and I need some help determining what it is, assuming it's not a proprietary encoding system as well.
I don't have a whole lot of different characters to analyze from but here is what I have so far:
__b -> blank space
__f -> forward slash
So for example, "Hello World" become "Hello__bWorld".
Does anybody have any idea what this is?
If not do you know of a resource on the web that can help me? Maybe there is a tool out there than can help in identifying character encoding?
It seems to be a proprietary encoding used by Numara FootPrints. This list of mappings comes from the FootPrints User Group forum. There is also a Perl script for decoding it.
Code Character
__b (space)
__a ' (single quote)
__q " (double quote)
__t ` (backquote)
__m # (at-sign)
__d . (period)
__u - (hyphen-minus)
__s ;
__c :
__p )
__P (
__3 #
__4 $
__5 %
__6 ^
__7 &
__8 *
__0 ~ (tilde)
__f / (slash)
__F \ (backslash)
__Q ?
__e ]
__E [
__g >
__G <
__B !
__W {
__w }
__C =
__A +
__I | (vertical line)
__M , (comma)
__Ux_ Unicode character with value 'x'

Resources