Remove disconnected structures of compounds - java-stream

I am uploading 3 different chemical files to my application, one at a time. Each file contains SMILE of compound, but the tag name is different. I am creating an IAtomContainer stream by reading file. I want to remove the disconnected structures from the stream. Is there any way to remove it instead of manually checking SMILES. I am using cdk 1.5.13.

ConnectivityChecker.isConnected(IAtomContainer);
this is working. Its returning boolean value.

Related

Do the apoc.import use merge or create to add new data?

CALL apoc.import.csv(
[{fileName: 'file:/persons.csv', labels: ['Person']}],
[{fileName: 'file:/knows.csv', type: 'KNOWS'}],
{delimiter: '|', arrayDelimiter: ',', stringIds: false}
)
For this example, internally, does the 'import' use merge or create to add nodes, relationships and properties? I tested, it seems it uses 'create' to add new rows even for a new ID record. Is there a way to control this? When to use apoc.load VS apoc.import? It seems apoc.load is a lot more flexible, where users can choose to use cypher commands specifically for purposes. Right?
From the source of CsvEntityLoader (which seems to be doing the work under the covers), nodes are blindly created rather than being merged.
While there's an ignoreDuplicateNodes configuration property you can set, it just ignores IDs duplicated within the incoming CSV (i.e. it's not de-duplicating the incoming records against your existing graph). You could protect yourself from creating duplicate nodes by creating an appropriate unique constraint on any uniquely-identifying properties, which would at least prevent you accidentally running the same import twice.
Personally I'd only use apoc.import.csv to do a one-off bulk load of data into a fresh graph (or to load a dump from another graph that was exported as a CSV by something like apoc.export.csv.*). And even then, you've got the batch import tool that'll do that job with higher performance for large datasets.
I tend to use either the built-in LOAD CSV command or apoc.load.csv for most things, as you can control exactly what you do with each record coming in from the file (such as performing a MERGE rather than a CREATE).
As indicated by by #Pablissimo's answer, the ignoreDuplicateNodes config option (when explicitly set to true) does not actually check for duplicates in the DB - it just checks within the file. A request to address this hole was brought up before, but nothing has been done yet to address it. So, if this is a concern for your use case, then you should not use apoc.import.csv.
The rest of this answer applies iff your files never specify nodes that already exist in your DB.
If your node CSV file follows the neo4j-admin import command's import file header format and has a header that specifies the :ID field for the column containing the node's unique ID, then the apoc.import.csv procedure should, by default, fail when it encounters duplicate node IDs (within the same file). That is because the procedure's ignoreDuplicateNodes config value defaults to false (you can specify true to skip duplicate IDs instead of failing).
However, since your node imports are not failing but are generating duplicate nodes, that implies your node CSV file does not specify the :ID field as appropriate. To fix this, you need to add the :ID field and call the procedure with the config option ignoreDuplicateNodes:true. Or, you can modify those CSV files somehow to remove duplicate rows.

ID3 Parser and Editor

I'm writing an ID3 parser and editor. It does already support ID3v1, v2.1-2.3. Are there any other widely used ID3 versions or extensions? For example, I've read about Enhanced ID3v1 tag (which goes before ID3v1) and starts with "TAG+", but I've never seen it inside MP3 files. Should I implement support for it anyway?
"ID3v2.1" never existed.
Yes, Enhanced TAG identifies by TAG+, which extends IDv1.
For a list of all metadata systems to be expected in MP3 files see https://stackoverflow.com/a/62366354 - top priority should have ID3v2.4 as you will encounter those most aside from ID3v2.3. Then go for informal and/or legacy ones because those can still be encountered (just because files become old doesn't mean they cease to exist).
Keep the following things in mind when parsing files:
A file can have both: IDv1 and IDv2 tags.
A file can have multiple IDv2 tags (i.e. IDv2.3 and IDv2.4). Although it shouldn't occur it should pose no problem to your parser to also accept multiple tags of the same version.
ID3v2 is not limited to MP3 files (but IDv1 and all its informal extensions are).
Consider the following parsing order in an MP3 file:
Check for ID3v1 at the end of the file.
Check for ID3v1.2 in front of ID3v1.
Check for Enhanced TAG in front of ID3v1.
Check for multiple ID3v2 at the start of file and, as for ID3v2.4, a footer at the end of the file in front of all ID3v1-like tags.

Does the order of hdf5 close matter?

I am implementing a HDF5 layer in an interpreted language with automatic reclamation facilities (garbage collect).
When a proxy to a HDF5 entity (H5File, H5Group, H5Dataset, H5Dataspace, H5Datatype, etc...) will be no longer referenced, it will be automatically reclaimed. With ephemeron like facility, I can arrange to be noticed and invoke the corresponding close function automagically (H5Fclose, H5Gclose, H5Dclose, etc...) in order to release the target resource.
By default, I have no control on the order of reclamation. However, if ever order of close counts, then I can arrange to keep a strong pointer on a parent proxy (for example the H5 File) from within any other entity. If order does not count, then I will avoid this useless complication.
So my questions:
Can I invoke H5Fclose(fid); before H5Gclose(gid); where previously gid=H5Gcreate(fid,'/foo',H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);?
Can I continue to operate on the group once I closed the containing file? For example, is it legal to call H5Fclose(fid); before gid2=H5Gcreate(gid,'bar',H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); in above example? If not, are there other entities concerned, or is it just file?
Doh, case of blindness, the documentation tells that close is delayed until all objects have been closed, so 1. order does not count and 2. is legal.
https://support.hdfgroup.org/HDF5/doc1.6/RM_H5F.html#File-Close
However, it may not work in every circumstances, so it's not recommended.
H5Fclose terminates access to an HDF5 file by flushing all data to storage and terminating access to the file through file_id.
If this is the last file identifier open for the file and no other access identifier is open (e.g., a dataset identifier, group identifier, or shared datatype identifier), the file will be fully closed and access will end.
Delayed close:
Note the following deviation from the above-described behavior. If H5Fclose is called for a file but one or more objects within the file remain open, those objects will remain accessible until they are individually closed. Thus, if the dataset data_sample is open when H5Fclose is called for the file containing it, data_sample will remain open and accessible (including writable) until it is explicitely closed. The file will be automatically closed once all objects in the file have been closed.
Be warned, however, that there are circumstances where it is not possible to delay closing a file. For example, an MPI-IO file close is a collective call; all of the processes that opened the file must close it collectively. The file cannot be closed at some time in the future by each process in an independent fashion. Another example is that an application using an AFS token-based file access privilage may destroy its AFS token after H5Fclose has returned successfully. This would make any future access to the file, or any object within it, illegal.
In such situations, applications must close all open objects in a file before calling H5Fclose. It is generally recommended to do so in all cases.

Is there a way to avoid the target table in informatica (powercenter)?

In informatica mapping design, there must be a target table, but in my design, I only use informatica to call store procedures, and after they were called, all work has been done, so I don't need a target table to be inserted or updated.
I used a non-exist table as the target table, and one nonsense field as the input port(cause there must be at least one input port!), then unchecked or the option(insert, update,delete) in the session configuration, so that the informatica would not generated DML SQL statements, avoiding "no table" errors.
But then informatica treat the input row as reject row and try to write it into a bad file. And cause I unchecked the insert option, the session log showed that there was an error that it couldn't be insert into the bad file!
Strangely, this error never showed in the monitor, and all session run successfully! It only appeared in informatica's meta table.
Is there a better way to avoid this problem, although it has no effect to my result? Is there a possibility to use a non-exist table and do nothing to it (include reject the input rows)?
Use a filter transformation just before the target and put filter condition 'FALSE'
No rows will go to the target
I had run into this same issue when i wanted to just execute a stored procedure and nothing else.
I solved this by creating a dummy source object that had one port and a dummy target with one port of the same datatype. In the source qualifier I added a SQL statement select 1 from dual (since it's Oracle).
I then added a filter object that was set to false. Then I connected the single port from the source/qualifier through the filter and finally to the target.
When the mapping is run, the source qualifier will return 1 row of one value, this will pass through to the filter but nothing will come out of the filter because the filter is set to false. This mapping will always be successful and valid because all ports are connected a nothing makes it to the "dummy" target thus no bad file logs or failure, etc.
Let me know if you need any clarification and I can update this answer.
No, you always need a target for the mapping to be valid. But I would rather work with a flat file target instead of a database table, you'll have much less work to do.
If you're on Linux / Unix, you can even route the file to /dev/null (use folder:/dev/, file:null) so the file is not actually written to the filesystem.
And using one dummy port is the right way. As you have said, you need at least one port, even if you don't really use it.
As odd as this may sound (Unix systems): neither source, nor target need to exist.
Source (flat file): /dev/null, column DUMMY
Target (flat file): /dev/null, column DUMMY
And you don't need to use any databases for the session to succeed, nor use any filters. It runs.

Merging/appending multiple pcap files to an existing one without overwriting

I am using tshark to filter some packets based on Display/Read filters from one file into another.
I want to have one final output file out.pcap after executing multiple read filters over number of files and combine all into out.pcap.
I was trying to use mergecap but it does not allow to append (combine) two file and store in one of them without overwriting.
Is there any way to do this, as I don't want to keep creating temporary files and merge all them together at the end.
This is not possible that I know of with existing tools, although given the way the capture file format is layed out, it should be possible to write a new tool (or extend mergecap) to do this.

Resources