Can we create multiple datasets in single TDB directory - jena

String directory = "directoryPath";
Dataset dataset = TDBFactory.createDataset(directory);
Dataset datasetnew = TDBFactory.createDataset(directory);
Is the dataset a reference to the TDB directory or it's like a folder inside the TDB directory? Will datasetnew again creates a reference to the Directory or separate datset folder will get created?
Basically, I have a use case of creating one dataset per user where the user can save all its models and does not interfere with models of other users.
If this is not how it's done, can someone suggest me a way!

When you specify the same directory, a new directory will not be created. In actual fact both dataset and datsetnew point to the same physical RDF store. I.e. if you try to create an transaction on both of them you will get a Currently in an active transaction org.apache.jena.dboe.transaction.txn.TransactionException. So NO, you cannot create multiple datasets in a single TDB directory.
Another observation is that you seem to be using TDB1 rather than TDB2. TDB2 is the later version and it would therefore be the best to use. Then your code will look as follows:
Path path = Paths.get(".").toAbsolutePath().normalize();
String dbDir = path.toFile().getAbsolutePath() + "/db/";
Location location = Location.create(dbDir);
Dataset dataset = TDB2Factory.connectDataset(location);
String strQuery = "INSERT DATA {<http://dbpedia.org/resource/Grace_Hopper> <http://xmlns.com/foaf/0.1/surname> \"Hopper\" .}";
dataset.begin(ReadWrite.WRITE);
UpdateRequest updateRequest = UpdateFactory.create(strQuery);
UpdateProcessor updateProcessor = UpdateExecutionFactory.create(updateRequest, dataset);
updateProcessor.execute();
dataset.commit();
dataset.close();

Related

How to get Twitter mentions id using academictwitteR package?

I am trying to create several network analyses from Twitter. To get the data, I used the academictwitteR package and their get_all_tweets command.
get_all_tweets(
users = c("LegaSalvini"),
start_tweets = "2007-01-01T00:00:00Z",
end_tweets = "2022-07-01T00:00:00Z",
file = "tweets_lega",
data_path = "tweetslega/",
bind_tweets = FALSE
)
## Binding JSON files into data.frame objects
tweets_bind_lega <- bind_tweets(data_path = "tweetslega/")
##Tyding
tweets_bind_lega_tidy <- bind_tweets(data_path = "tweetslega/", output_format = "tidy")
With this, I can easily access the ids for the creation of a retweet and reply network. However, the tidy format does not provide a tidy column for the mentions, instead it deletes them.
However, they are in my untidy df tweets_bind_lega , but stored as a list tweets_bind_afd$entities$mentions. Now I would like to somehow unnest this list and create a tidy df with a column that has contains the mentioned Twitter user ids.
Has anyone created a mention network with academictwitteR before and can help me out?
Thanks!

Queries make Twitter-stream application too slow in saving data

I have an application which streams Twitter data which are stored in a Neo4j database. The data I store regard tweets, users, hashtag and their relationships (user posts tweet, tweet tags hashtags, user retweets tweet).
Now, each time I get a new tweet what I do is:
Check if the database already contains the tweet: if so, I update it with the new information (retweet count, like count), else I save it
Check if the database already contains the user: if so, I update it with the new infos, else I save it
Check if the database already contains the hashtag: if it doesn't, I add it
And so on, same process for saving the relationships.
Here are the queries:
static String cqlAddTweet = "merge (n:Tweet{tweet_id: {2}}) on create set n.text={1}, n.location={3}, n.likecount={4}, n.retweetcount={5}, n.topic={6}, n.created_at={7} on match set n.likecount={4}, n.retweetcount={5}";
static String cqlAddHT = "merge (n:Hashtag{text:{1}})";
static String cqlHTToTweet = "match (n:Tweet),(m:Hashtag) where n.tweet_id={1} and m.text={2} merge (n)-[:TAGS]->(m)";
static String cqlAddUser = "merge (n:User{user_id:{3}}) on create set n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6} on match set n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6}";
static String cqlUserToTweet = "match (n:User),(m:Tweet) where m.tweet_id={2} and n.user_id={1} merge (n)-[:POSTS]->(m)";
static String cqlUserRetweets = "match (n:Tweet{tweet_id:{1}}), (u:User{user_id:{2}}) create (u)-[:RETWEETS]->(n)";
Since it is very slow in saving data, I suppose that this system can have better performances if I didn't run all those queries which scan the data each time.
Do you have any suggestion to improve my application?
Thank you and excuse me in advance if this may seem silly.
Make sure you have indexes (or uniqueness constraints, if appropriate) on the following label/property pairs. That will allow your queries to avoid scanning through all nodes with the same label (when starting a query).
:Tweet(tweet_id)
:Hashtag(text)
:User(user_id)
By the way, a couple of your queries can be simplified (but this should not affect the performance):
static String cqlAddTweet = "MERGE (n:Tweet{tweet_id: {2}}) ON CREATE SET n.text={1}, n.location={3}, n.topic={6}, n.created_at={7} SET n.likecount={4}, n.retweetcount={5}";
static String cqlAddUser = "MERGE (n:User{user_id:{3}}) SET n.name={1}, n.username={2}, n.followers={4}, n.following={5}, n.profilePic={6}";

How to load table from file

I know how to write table content in a text file and restore it. But what is the best practice to write a custom table type in a file?
Here is my situation:
I have a list of tables f.e. objective1 = {} objective2 = {} objective3 = {} ..
Each objective has its own execute function which checks some conditions and execute some commands depending on the conditions.
Each new game a pick a random set of objectives and store them in an array Objectives = { [1] = { objective1 , objective3 } }
Now I want to save the array of random objectives and load them later from the file.
Is there any possibility to save the name of the tables in the array and to restore them by the name?
Would be great to have a solution without using a factory pattern or using indices for saving.
If any of the data in the table is not a userdata (for eg. a SQL or socket connection, or other libraries) you can use the pickle module to write the data to a file and later load the same from there.
I personally use a different pickle library (written by Walter Doekes). The usage is as follows:
local TableToStore = { 'a', 'b', 1, {'nested', 'content', {c = 'inside'}} }
pickle.store( 'backupFileName.bak', {TableToStore = TableToStore} )
and to restore the data, I can simply use
dofile "backupFileName.bak"
and I'll get back a table named TableToStore after this.

How can you join two or more dictionaries created by Bio.SeqIO.index?

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?
SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).
The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
from Bio import SeqIO
class MultiIndexDict:
def __init__(self, *indexes):
self._indexes = indexes
def __getitem__(self, key):
for idx in self._indexes:
try:
return idx[key]
except KeyError:
pass
raise KeyError("{0} not found".format(key))
indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)
print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]
In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

DBF Large Char Field

I have a database file that I beleive was created with Clipper but can't say for sure (I have .ntx files for indexes which I understand is what Clipper uses). I am trying to create a C# application that will read this database using the System.Data.OleDB namespace.
For the most part I can sucessfully read the contents of the tables there is one field that I cannot. This field called CTRLNUMS that is defined as a CHAR(750). I have read various articles found through Google searches that suggest field larger than 255 chars have to be read through a different process than the normal assignment to a string variable. So far I have not been successful in an approach that I have found.
The following is a sample code snippet I am using to read the table and includes two options I used to read the CTRLNUMS field. Both options resulted in 238 characters being returned even though there is 750 characters stored in the field.
Here is my connection string:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\datadir;Extended Properties=DBASE IV;
Can anyone tell me the secret to reading larger fields from a DBF file?
using (OleDbConnection conn = new OleDbConnection(connectionString))
{
conn.Open();
using (OleDbCommand cmd = new OleDbCommand())
{
cmd.Connection = conn;
cmd.CommandType = CommandType.Text;
cmd.CommandText = string.Format("SELECT ITEM,CTRLNUMS FROM STUFF WHERE ITEM = '{0}'", stuffId);
using (OleDbDataReader dr = cmd.ExecuteReader())
{
if (dr.Read())
{
stuff.StuffId = dr["ITEM"].ToString();
// OPTION 1
string ctrlNums = dr["CTRLNUMS"].ToString();
// OPTION 2
char[] buffer = new char[750];
int index = 0;
int readSize = 5;
while (index < 750)
{
long charsRead = dr.GetChars(dr.GetOrdinal("CTRLNUMS"), index, buffer, index, readSize);
index += (int)charsRead;
if (charsRead < readSize)
{
break;
}
}
}
}
}
}
You can find a description of the DBF structure here: http://www.dbf2002.com/dbf-file-format.html
What I think Clipper used to do was modify the Field structure so that, in Character fields, the Decimal Places held the high-order byte of the size, so Character field sizes were really 256*Decimals+Size.
I may have a C# class that reads dbfs (natively, not ADO/DAO), it could be modified to handle this case. Let me know if you're interested.
Are you still looking for an answer? Is this a one-off job or something that needs doing regularly?
I have a Python module that is primarily intended to extract data from all kinds of DBF files ... it doesn't yet handle the length_high_byte = decimal_places hack, but it's a trivial change. I'd be quite happy to (a) share this with you and/or (b) get a copy of such a DBF file for testing.
Added later: Extended-length feature added, and tested against files I've created myself. Offer to share code with anyone who would like to test it still stands. Still interested in getting some "real" files myself for testing.
3 suggestions that might be worth a shot...
1 - use Access to create a linked table to the DBF file, then use .Net to hit the table in the access database instead of going direct to the DBF.
2 - try the FoxPro OLEDB provider
3 - parse the DBF file by hand. Example is here.
My guess is that #1 should work the easiest, and #3 will give you the opportunity to fine tune your cussing skills. :)

Resources