sha1sum in wikipedia dump - sha1

does anyone know how the sha1 sum in wikipedia dumps is build? I just found: "These contain information like the sha1 sum of each revision text..."
(http://meta.wikimedia.org/wiki/Data_dumps/Dump_format)
But when I try to calculate the sum of any revision text, I never get the same sum. So I thought maybe there is something more influencing this value. I took all the text between the "text"-tags.
Thanks

The sha1sum is converted from an hex- to a base36-number and it is just the revisiontext between the <text></text> -tags. Thanks to MaxSem!

Related

Finding characters in the middle of a string that includes "

I have the following extract of code. My aim is to extract the value 7.4e-07 after the symbol DAN. My usual go-to formula (using MID & FIND formula's) for this can't work because DAN is surrounded by ", and therefore confuses the formula.
{"data":{"log":{"address":[{"balances":[{"currency":{"address":"example1","symbol":"ROB"},"value":0.0},{"currency":{"address":"example2","symbol":"DAN"},"value":7.4e-07},{"currency":{"address":"example3","symbol":"COLIN"},"value":0.0},{"currency":{"address":"example4","symbol":"BOB"},"value":0.0},{"currency":{"address":"example5","symbol":"PAUL"},"value":13426.64}}}
I will always need to find the number shown in the 'value' after DAN. However, all other data surrounding will change so cannot be used in the search formula.
Any help would be appreciated.
The extract the digit you want, it can be achieved by using regex, split, index, here is the formula, accept if help :)
=index(split(REGEXEXTRACT(A1,"\""DAN\""},\""value\"":[\d.a-zA-Z-]+"),":"),0,2)
This is the regex I used to extract the value including the beginning text
"DAN"},"value":[\d.a-zA-Z-]+
This is outcome from the regex,
You could try an arrayformula to work down the sheet, extracting all values after 'DAN':
=arrayformula(regexreplace(A1:A,".*(DAN...........)([\w\.\-]*)(\}.*)","$2"))

ASCII Representation of Hexadecimal

I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.

Does class DSSP of biopython gives the relative solvent accessibility value of amino acids?

I would like to have relative solvent accessibilities of amino acids in a protein, presently, using DSSP module of biopython. I am not sure if the output has rsa (relative solvent accessibility) or is it needed to be calculated? Any help would be appreciated.
Thanks. 
Short answer: yes, it does. It outputs a normalisation per residue type.
https://github.com/biopython/biopython/blob/master/Bio/PDB/DSSP.py line 44
It normalises followed Sander & Rost, (1994), Proteins, 20:216-226 instructions. The question is, is that enough for you? It might not be.
Beware that biopython code does not allow RSA > 1, what makes sense, but I'd raise a warning rather than silently capping the number.
There are other aways of normalising it as well.

How to print % symbol in receipt from raw text in ESC POS?

Please, I need help. What is the right command in ESC POS so I can print the % symbol in my receipt? I used raw text for my receipt, but when I write the % symbol, for example, Standard Tax 6%, instead of printing Tax 6%, it printed Tax 60... Please help me.
Ok, I figured it out. I'm using printf in Linux terminal to type the command and the text for my receipt. I just add \x25, where 25 is the hexadecimal for % symbol in Ascii Table. Simple things, but I took some time to figure it out, as I'm a beginner and not-so-good in understanding English, can I say that, hee. I hope this will help someone out there later on that might in the same situation in writing raw text and command in ESC POS.

which is faster to find a random string: random line order or sorted?

We want to find a random string, e.g.: "ASDF555". We have a very BIG file with unique lines containing this string. Which one is faster (in time, with an easy grep command) to find the mentioned string? If the "BIG file" is:
sorted
or random?
Of course, the ASDF555 could be anything!
We are thinking of that it's faster to have the lines in random order, since the string could be random too. But we cannot prove this idea..
grep does not "know" your file is sorted, so it needs to go over it line by line - so the fact it's sorted is inconsequential. To rephrase - the fact that a file is sorted cannot harm your search speed - you can also go over a file line by line until you find the desired string.
However, if the file is indeed sorted, you may implement a better searching algorithm (e.g., binary searching) instead of using grep.

Resources