Listing chinese filenames in directory with python - character-encoding

I am trying to list the names and sizes of all files in a directory but get an error when there are files in chinese, i am using Python 2.7 on windows 7
this is my code
import os
path = '\'
listing = os.listdir(path)
for infile in listing:
if infile.endswith(".csv"):
print infile + ";"+ str(os.path.getsize(path + infile))
this is the error I get
Traceback (most recent call last):
File "file_size.py", line 8, in <module>
print infile + ";"+ str(os.path.getsize(path + infile))
File "C:\Python27\lib\genericpath.py", line 49, in getsize
return os.stat(filename).st_size
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: '\DB?1333366331.436754.048342.csv'
C:\>python file_size.py
File "file_size.py", line 7
if infile.endswith(".csv"):
^
IndentationError: unindent does not match any outer indentation level
The name of the file that caused the error is DB表1333366331.436754.048342.csv
How can i avoid this problem?
thanks in advance

I would try making your root path unicode. My guess is that listdir is using the same encoding as the initial string and is erroring when reading the non-ascii character.
i.e.
path = u'\'
Source:
http://docs.python.org/library/os.html#os.listdir
"Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode objects. Undecodable filenames will still be returned as string objects."

Related

How to address a Snakemake error at the stage of DAG computation?

I'm running in to an error with my Snakemake variant identification pipeline, when the original DAG of jobs is built. I believe this is a memory issue; when I test with a short list of input files, the DAG is constructed without issue, however, when I try with 300+ input paired-fastq, I receive the following error:
Building DAG of jobs...
Traceback (most recent call last):
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/__init__.py", line 633, in snakemake
keepincomplete=keep_incomplete,
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/workflow.py", line 568, in execute
dag.check_incomplete()
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/dag.py", line 281, in check_incomplete
incomplete = self.incomplete_files
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/dag.py", line 402, in incomplete_files
filterfalse(self.needrun, self.jobs),
File "/home/k/.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/dag.py", line 399, in <genexpr>
job.output
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/persistence.py", line 205, in incomplete
return any(map(lambda f: f.exists and marked_incomplete(f), job.output))
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/persistence.py", line 205, in <lambda>
return any(map(lambda f: f.exists and marked_incomplete(f), job.output))
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/persistence.py", line 203, in marked_incomplete
return self._read_record(self._metadata_path, f).get("incomplete", False)
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/persistence.py", line 322, in _read_record_cached
return self._read_record_uncached(subject, id)
File "/home//.conda/envs/snakemake/lib/python3.6/site-packages/snakemake/persistence.py", line 328, in _read_record_uncached
return json.load(f)
File "/home//.conda/envs/snakemake/lib/python3.6/json/__init__.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/home//.conda/envs/snakemake/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home//.conda/envs/snakemake/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home//.conda/envs/snakemake/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I'm not sure how to resolve this - if this is a known bug or if there is a way to define my pipeline to build a less complex DAG? I am including the first section of my Snakemake file as well. I use the rule all to define all desired output files.
################################
#### Mtb bwa/GATK Snakemake ####
################################
import numpy as np
from collections import defaultdict
import pandas as pd
samples_df = pd.read_table('config/tgen_samples2a.tsv',sep = ',').set_index("sample", drop=False)
sample_names = list(samples_df['sample'])
batch_names = list(samples_df['batch'])
#print(sample_names)
# fastq1 input function definition
def fq1_from_sample(wildcards):
return samples_df.loc[wildcards.sample, "fastq_1"]
# fastq2 input function definition
def fq2_from_sample(wildcards):
return samples_df.loc[wildcards.sample, "fastq_2"]
# Define config file. Stores sample names and other things.
configfile: "config/config.yaml"
# Define a rule for running the complete pipeline.
rule all:
wildcard_constraints:
batch="IS-.+"
input:
trim = expand(['results/{batch}/{samp}/trim/{samp}_trim_1.fq.gz'], zip, samp=sample_names,batch=batch_names),
kraken=expand('results/{batch}/{samp}/kraken/{samp}_trim_kr_1.fq.gz', zip, samp=sample_names,batch=batch_names),
bams=expand('results/{batch}/{samp}/bams/{samp}_{mapper}_{ref}_sorted.bam', zip, samp=sample_names,batch=batch_names, ref = config['ref']*len(sample_names), mapper = config['mapper']*len(sample_names)), # When using zip, need to use vectors of equal lengths for all wildcards.
per_samp_run_stats = expand('results/{batch}/{samp}/stats/{samp}_{mapper}_{ref}_combined_stats.csv', zip, samp=sample_names,batch=batch_names, ref = config['ref']*len(sample_names), mapper = config['mapper']*len(sample_names)),
amr_stats=expand('results/{batch}/{samp}/stats/{samp}_{mapper}_{ref}_amr.csv', samp=sample_names,batch=batch_names, ref=config['ref'], mapper=config['mapper']),
cov_stats=expand('results/{batch}/{samp}/stats/{samp}_{mapper}_{ref}_cov_stats.txt', samp=sample_names,batch=batch_names, ref=config['ref'], mapper=config['mapper']),
all_sample_stats=expand('results/{batch}/stats/combined_per_run_sample_stats.csv',batch = batch_names),
vcfs=expand('results/{batch}/{samp}/vars/{samp}_{mapper}_{ref}_{caller}_qfilt.vcf.gz', samp=sample_names,batch=batch_names, ref=config['ref'], mapper=config['mapper'], caller = config['caller']),
ann_vcfs=expand('results/{batch}/{samp}/vars/{samp}_{mapper}_{ref}_gatk_ann.vcf.gz', samp=sample_names,batch=batch_names, ref=config['ref'], mapper=config['mapper'], caller = config['caller']),
fastas=expand('results/{batch}/{samp}/fasta/{samp}_{mapper}_{ref}_{caller}_{filter}.fa', samp=sample_names,batch=batch_names, ref=config['ref'], mapper=config['mapper'], caller = config['caller'], filter=config['filter']),
profiles=expand('results/{batch}/{samp}/stats/{samp}_{mapper}_{ref}_lineage.csv', samp=sample_names,batch=batch_names, ref=config['ref'], mapper=config['mapper'])
# Trim reads for quality.
rule trim_reads:
input:
p1=fq1_from_sample,
p2=fq2_from_sample
output:
trim1='results/{batch}/{sample}/trim/{sample}_trim_1.fq.gz',
trim2='results/{batch}/{sample}/trim/{sample}_trim_2.fq.gz'
log:
'results/{batch}/{sample}/trim/{sample}_trim_reads.log'
shell:
'{config[scripts_dir]}trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'
# Filter reads taxonomically with Kraken.
rule taxonomic_filter:
input:
trim1='results/{batch}/{samp}/trim/{samp}_trim_1.fq.gz',
trim2='results/{batch}/{samp}/trim/{samp}_trim_2.fq.gz'
output:
kr1='results/{batch}/{samp}/kraken/{samp}_trim_kr_1.fq.gz',
kr2='results/{batch}/{samp}/kraken/{samp}_trim_kr_2.fq.gz',
kraken_report='results/{batch}/{samp}/kraken/{samp}_kraken.report',
kraken_stats = 'results/{batch}/{samp}/kraken/{samp}_kraken_stats.csv'
log:
'results/{batch}/{samp}/kraken/{samp}_kraken.log'
threads: 8
shell:
'{config[scripts_dir]}run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_report} &>> {log}'
Thank you in advance for help using Snakemake!
All the best,
I kind of doubt memory is an issue. 300+ is not much, especially if each of them is processed independently of the others.
Try to start from the subset of samples that you say worked and gradually increase it until you see the problem appearing. Perhaps you have some funny value in your sample sheet or in your config? json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) hints at something like that in my impression.
The answer was from #TroyComi, above: after deleting the .snakemake directory, the issue was resolved. Thank you!

Drake Visualizer : Unknown file extension in readPolyData when using .dae file

I am trying to add a custom mesh (a torus) .dae file for collision and visual to my .sdf model.
When I run my program, drake visualizer gives the following error
File "/opt/drake/lib/python2.7/site-packages/director/lcmUtils.py", line 119, in handleMessage
callback(msg)
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 352, in onViewerLoadRobot
self.addLinksFromLCM(msg)
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 376, in addLinksFromLCM
self.addLink(Link(link), link.robot_num, link.name)
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 299, in __init__
self.geometry.extend(Geometry.createGeometry(link.name + ' geometry data', g))
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 272, in createGeometry
polyDataList, visInfo = Geometry.createPolyDataFromFiles(geom)
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 231, in createPolyDataFromFiles
polyDataList = [ioUtils.readPolyData(filename)]
File "/opt/drake/lib/python2.7/site-packages/director/ioUtils.py", line 25, in readPolyData
raise Exception('Unknown file extension in readPolyData: %s' % filename)
Exception: Unknown file extension in readPolyData: /my_path/model.dae
Since prius.sdf also uses prius.dae, I assume this is possible. What am I doing wrong?
tl;dr drake_visualizer doesn't load dae files. If you put a similarly named .obj file in the same folder it will load that (and you can leave your sdf file still referencing the dae file).
Long answer:
drake_visualizer has a very specific, arbitrary protocol for loading files. Given an arbitrary file name (e.g., my_geometry.dae) it will
Strip off the extension.
Try the following files (in order), loading the first one it finds:
my_geometry.vtm
my_geometry.vtp
my_geometry.obj
original extension.
It can load: vtm, vtp, ply, obj, and stl files.
The worst thing is if you have both a vtp and an obj file in the same folder with the same name and you specify the obj, it'll still favor the vtp file.

Command line args in F# fsx

I run my .fsx file like
>fsi A.fsx
In this file I read csv with CsvProvider that has to have path to csv data.
type Data = CsvProvider<"my_data.txt", ";", Schema
I need to pass file name as command line argument and it is possible
>fsi A.fsx my_data.txt
I can read it like
let originalPath = fsi.CommandLineArgs.ElementAt(1)
Problem is, that file name used in CsvProvider constructor needs to be constant and command line argument is not. How I can initialize CsvProvider from command line argument?
The value inside the angle brackes <"my_data.txt"...> specifies an example format file and is checked at compile time, hence the need for it to be a constant string. Assuming your .fsx script merely wants to load a different CSV file of the same general format, you would use
let contents = Data.Load(originalPath)

Encoding::UndefinedConversionError: "\x96" from ASCII-8BIT to UTF-8 error while writing to a file in ruby

I am having a function which writes the filenames to a file in ruby.
but I am getting this error:
Encoding::UndefinedConversionError: "\x96" from ASCII-8BIT to UTF-8
to overcome this is used:
file = File.open("names", "w")
file.puts(filename.force_encoding("utf-8"))
doing this solved this problem, but when I am again reading through the file and try to open files whose names are stored in names file.
I am getting an error saying CANNNOT STAT: NO SUCH FILE OR FOLDER EXISTS.
any suggestions are welcome..!!
Well, I will try to suggest.
It sounds like you have received this file from a workstation, running Windows. It looks like this file’s original name is
Volunteer Log – in Page.docx
That said, is was stored using Encoding::CP1252. OK, you are to handle CP1252 in a proper way:
file = File.open 'names', 'w'
file.puts filename.force_encoding(Encoding::CP1252).encode(Encoding::UTF_8)
Hope it helps.

How to detect and convert DOS/Windows end of line to UNIX end of line in Ruby

I have implemented a CSV upload in Ruby (on Rails) that works fine when the file is uploaded from a browser that runs on UNIX-like systems
However I have a file that as uploaded by a real customer contains the famous ^M as end of lines (I guess it was uploaded from Windows)
I need to detect this situation and replace the character before the file is processed
Here is the code that creates the file
# create the file on the server
path = File.join(directory, name)
# write the file
File.open(path, 'wb') { |f| f.write(uploadData.read) }
Do I need to change the "wb" to "w" and this would solve the problem ?
The CR (^M as you say it) char is "\r" in Ruby (and many other languages), so if you're sure your line endings also have the LF char (Windows uses CRLF as the line ending) then you can just remove all the CRs at the ends of the lines ($ matches at the end of a line, before the last "\n"):
uploadData.read.gsub /\r$/, ''
If you're not sure you're going to have the LF (eg. MacOS 9 used to use a plain CR at the end of the line) then replace any CR optionally followed by a LF with an LF:
uploadData.read.gsub /\r\n?/, "\n"

Resources