How can I encode/decode shift-jis in elixir? - character-encoding

Given text in shift-jis encoding, how can I decode it into Elixir's native UTF-8 encoding, and vice-versa?

The Codepagex library supports this. You just need to figure out what it calls SHIFT_JIS.
Codepagex uses the mappings available from unicode.org. There is one for shift-jis but it's marked as OBSOLETE, so is not available in Codepagex. However, Microsoft's CP932 is also available, which is effectively SHIFT_JIS, so you can use that.
Config
It's not enabled by default, so you need to enable in in config (and re-compile with mix deps.compile codepagex --force if necessary):
config :codepagex, :encodings, [
"VENDORS/MICSFT/WINDOWS/CP932"
]
Encode/Decode
iex(1)> shift_jis = "VENDORS/MICSFT/WINDOWS/CP932"
"VENDORS/MICSFT/WINDOWS/CP932"
iex(2)> test = Codepagex.from_string!("テスト", shift_jis)
<<131, 101, 131, 88, 131, 103>>
iex(3)> Codepagex.to_string!(test, shift_jis)
"テスト"
Example repo
I made an example repo where you can see it in action.

Related

How to detect mime type of file in Rails in 2022?

Is there a built in way in Ruby/Rails to detect the mime type of a file given its URL, and not relying on its extension?
For example lets say there is an image file at an external URL that I want to get the mime type of: https://www.example.com/images/file
The file does not have an extension, but let's assume it is a jpeg.
How can I verify this/get the file's mime type in Rails? Would ideally love a way to do this with built in functionality and not have to rely on a third party gem.
I've looked over this question.
I don't think it's worth not using a third party gem for this. The problem space is well documented and stable, and most of the libraries are too.
But if you must, it can be done without an external gem. Especially if you're going to constrain yourself to a small subset of file types to "whitelist". The "magic number" pattern for most image files is pretty straightforward once you get the file on your disk:
image = File.new("filename.jpg","r")
irb(main):006:0> image.read(10)
=> "\xFF\xD8\xFF\xE0\x00\x10JFIF"
Marcel, which you linked in your reference, if nothing else, can be a great reference for the magic number sequences you'll need:
MAGIC = [
['image/jpeg', [[0, "\377\330\377"]]],
['image/png', [[0, "\211PNG\r\n\032\n"]]],
['image/gif', [[0, 'GIF87a'], [0, 'GIF89a']]],
['image/tiff', [[0, "MM\000*"], [0, "II*\000"], [0, "MM\000+"]]],
['image/bmp', [[0, 'BM', [[26, "\001\000", [[28, "\000\000"], [28, "\001\000"], [28, "\004\000"], [28, "\b\000"], [28, "\020\000"], [28, "\030\000"], [28, " \000"]]]]]]],
# .....
]

Bazel - depend on generated outputs

I have a yaml file in a Bazel monorepo that has constants I'd like to use in several languages. This is kind of like how protobuffs are created and used.
How can I parse this yaml file at build time and depend on the outputs?
For instance:
item1: "hello"
item2: "world"
nested:
nested1: "I'm nested"
nested2: "I'm also nested"
I then need to parse this yaml file so it can be used in many different languages (e.g., Rust, TypeScript, Python, etc.). For instance, here's the desired output for TypeScript:
export default {
item1: "hello",
item2: "world",
nested: {
nested1: "I'm nested",
nested2: "I'm also nested",
}
}
Notice, I don't want TypeScript code that reads the yaml file and converts it into an object. That conversion should be done in the build process.
For the actual conversion, I'm thinking of writing that in Python, but it doesn't need to be. This would then mean the python also needs to run at build time.
P.S. I care mostly about the functionality, so I'm flexible with the exactly how it's done. I'm even fine using another file format aside from yaml.
Thanks to help from #oakad, I was able to figure it out. Essentially, you can use genrule to create outputs.
So, assuming you have some target like python setup to generate the output named parse_config, you can just do this:
genrule(
name = "generated_output",
srcs = [],
outs = ["output.txt"],
cmd = "./$(execpath :parse_config) > $#" % name,
exec_tools = [":parse_config"],
visibility = ["//visibility:public"],
)
The generated file is output.txt. And you can access it via //lib/config:generated_output.
Note, essentially the cmd is piping the stdout into the file contents. In Python that means anything printed will appear in the generated file.

How to get the PDB id of a mystery sequence?

I have a bunch of proteins, from something called proteinnet.
Now the sequences there have some sort of ID, but it is clearly not a PDB id, so I need to find that in some other way. For each protein I have the amino acid sequence. I'm using biopython, but I'm not very experienced in it yet and couldn't find this in the guide.
So my question is how do I find a proteins PDB id given that I have the amino acid sequence of the protein? (Such that I can download the PDB file for the protein)
hi I was playing a little bit ago with the RCSB PDB search API,
ended up with this piece of code (can't find examples on rcsb pdb website anymore),
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 27 16:20:43 2020
#author: Pietro
"""
import PDB_searchAPI_5
from PDB_searchAPI_5.rest import ApiException
import json
#"value":"STEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS"
# Defining the host is optional and defaults to https://search.rcsb.org/rcsbsearch/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = PDB_searchAPI_5.Configuration(
host = "http://search.rcsb.org/rcsbsearch/v1"
)
data_entry_1 = '''{
"query": {
"type": "terminal",
"service": "sequence",
"parameters": {
"evalue_cutoff": 1,
"identity_cutoff": 0.9,
"target": "pdb_protein_sequence",
"value": "STEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS"
}
},
"request_options": {
"scoring_strategy": "sequence"
},
"return_type": "entry"
}'''
# Enter a context with an instance of the API client
with PDB_searchAPI_5.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = PDB_searchAPI_5.SearchServiceApi(api_client)
try:
# Get RCSB PDB data schema as JSON schema extended with RCSB metadata.
pippo = api_instance.run_json_queries_get(data_entry_1)
except ApiException as e:
print("Exception when calling SearchServiceApi->run_json_queries_get: %s\n" % e)
exit()
print(type(pippo))
print(dir(pippo))
pippox = pippo.__dict__
print('\n bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb \n' ,pippox)
print('\n\n ********************************* \n\n')
print(type(pippox))
pippoy = pippo.result_set
print(type(pippoy))
for i in pippoy:
print('\n',i,'\n', type(i))
print('\n LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL\n')
for i in pippoy:
for key in i:
print('\n', i['identifier'], ' score : ', i['score'])
the search module (import PDB_searchAPI_5) was generated with: openapi-generator-cli-4.3.1.jar link here
the open api specs where 1.7.3 now they are 1.7.15 see https://search.rcsb.org/openapi.json
the data_entry_1 bit was copied from rcsb pdb website but can't find it anymore,
it was saying something about mmseqs2 being the sofware doing the search, played with:
"evalue_cutoff": 1,
"identity_cutoff": 0.9,
parameters but didn't find a way to select only 100% identity
here the PDB_searchAPI_5 install it in a virtual enviroment with:
pip install PDB-searchAPI-5-1.0.0.tar.gz
was generated by openapi-generator-cli-4.3.1.jar with:
java -jar openapi-generator-cli-4.3.1.jar generate -g python -i pdb-search-api-openapi.json --additionalproperties=generateSourceCodeOnly=True,packageName=PDB_searchAPI_5
don't put spaces in --additionalproperties part (took one week to figure it out)
the README.md file is the most important part as it explain how to use the OPEN-API client
you need your fasta sequences here:
"value":"STEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS"
the score = 1 should be the exact match,
probably the biopython blast module is easier, but it blast NIH database instead of RCSB PDB, sorry can't elaborate more on this, still need to figure out what is a JSON file, and wasnt able to find a better free tool that automatically generate a better OPEN-API python client (I believe is kind of not so easy task... but we always want more...)
to get API documentation try:
java -jar openapi-generator-cli-4.3.1.jar generate -g html -i https://search.rcsb.org/openapi.json --skip-validate-spec
You get html document or for pdf: https://mrin9.github.io/RapiPdf/
http://search.rcsb.org/openapi.json
works as well as https://search.rcsb.org/openapi.json so that you can look at the exchanges between client and server with wireshark

Read into Dask from Minio raises issue with reading / converting binary string JSON into utf8

I'm trying to read JSON-LD into Dask from Minio. The pipeline works but the strings come from Minio as binary strings
So
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read())
results in
b'\n{\n "#context": "http://schema.org/",\n "#type": "Dataset",\n ...
I can simply convert this with
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read().decode("utf-8"))
and now everything is as I expect it.
However, I am working with Dask and when reading into the a bag with
dgraphs = db.read_text('s3://bucket/prefa/prefb/*.jsonld',
storage_options={
"key": key,
"secret": secret,
"client_kwargs": {"endpoint_url":"https://example.org"}
}).map(json.loads)
I can not get the content coming from Minio to become strings vs binary strings. I need these converted before they hit the json.loads map I suspect.
I assume I can inject the "decode" in here somehow as well, but I can't resolve how.
Thanks
As the name implies, read_text opens the remote file in text mode, equivalent to open(..., 'rt'). The signature of read_text includes the various decoding arguments, such as UTF8 as the default encoding. You should not need to do anything else, but please post a specific error if you are having trouble, ideally with example file contents.
If your data isn't delimited by lines, read_text might not be right for you, and you can do something like
#dask.delayed()
def read_a_file(fn):
# or preferably open in text mode and json.load from the file
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
return json.loads(f.read().decode("utf-8"))
output = [read_a_file(f) for f in filenames]
and then you can create a bag or dataframe from this, as required.

Which compression types support chunking in dask?

When processing a large single file, it can be broken up as so:
import dask.bag as db
my_file = db.read_text('filename', blocksize=int(1e7))
This works great, but the files I'm working with have a high level of redundancy and so we keep them compressed. Passing in compressed gzip files gives an error that seeking in gzip isn't supported and so it can't be read in blocks.
The documentation here http://dask.pydata.org/en/latest/bytes.html#compression suggests that some formats support random access.
The relevant internal code I think is here:
https://github.com/dask/dask/blob/master/dask/bytes/compression.py#L47
It looks like lzma might support it, but it's been commented out.
Adding lzma into the seekable_files dict like in the commented out code:
from dask.bytes.compression import seekable_files
import lzmaffi
seekable_files['xz'] = lzmaffi.LZMAFile
data = db.read_text('myfile.jsonl.lzma', blocksize=int(1e7), compression='xz')
Throws the following error:
Traceback (most recent call last):
File "example.py", line 8, in <module>
data = bag.read_text('myfile.jsonl.lzma', blocksize=int(1e7), compression='xz')
File "condadir/lib/python3.5/site-packages/dask/bag/text.py", line 80, in read_text
**(storage_options or {}))
File "condadir/lib/python3.5/site-packages/dask/bytes/core.py", line 162, in read_bytes
size = fs.logical_size(path, compression)
File "condadir/lib/python3.5/site-packages/dask/bytes/core.py", line 500, in logical_size
g.seek(0, 2)
io.UnsupportedOperation: seek
I assume that the functions at the bottom of the file (get_xz_blocks) for example can be used for this, but don't seem to be in use anywhere in the dask project.
Are there compression libraries that do support this seeking and chunking? If so, how can they be added?
Yes, you are right that the xz format can be useful to you. The confusion is, that the file may be block-formatted, but the standard implementation lzmaffi.LZMAFile (or lzma) does not make use of this blocking. Note that block-formatting is only optional for zx files, e.g., by using --block-size=size with xz-utils.
The function compression.get_xz_blocks will give you the set of blocks in a file by reading the header only, rather than the whole file, and you could use this in combination with delayed, essentially repeating some of the logic in read_text. We have not put in the time to make this seamless; the same pattern could be used to write blocked xz files too.

Resources