Google Fusion Tables - How to get ROWID in media download - google-fusion-tables

I am querying a fusion table using a request like this:
https://www.googleapis.com/fusiontables/v2/query?alt=media&sql=SELECT ROWID,State,County,Name,Location,Geometry,[... many more ...] FROM <table ID>
The results from this query exceed 10MB, so I must include the alt=media option (for smaller queries, where I can remove this option, the problem does not exist). The response is in csv, as promised by the documentation. The first line of the response appears to be a header line which exactly matches my query string (except that it shows rowid instead of ROWID):
rowid,State,County,Name,Location,Geometry,[... many more ...]
The following rows however do not include the row id. Each line begins with the second item requested in the query - it seems as though the row id was ignored:
WV,Calhoun County,"Calhoun County, WV, USA",38.858 -81.1196,"<Polygon><outerBoundaryIs>...
Is there any way around this? How can I retrieve row ids from a table when the table is large?

Missing ROWIDs are also present for "media" requests made via Google's Python API Client library, e.g.
def doGetQuery(query):
request = FusionTables.query().sqlGet_media(sql = query);
response = request.execute();
ft = {'kind': "fusiontables#sqlresponse"};
s = queryResult.decode(); # bytestring to string
data = [];
for i in s.splitlines():
data.append(i.split(','));
ft['columns'] = data.pop(0); # ['Rowid', 'A', 'B', 'C']
ft['rows'] = data; # [['a1', 'b1', 'c1'], ...]
return ft;
You may at least have one fewer issue than I - this sqlGet_media method can only be made with a "pure" GET request - a query long enough (2 - 8k characters) to get sent as overridden POST generates a 502 Bad Gateway error, even for tiny response sizes such as the result from SELECT COUNT(). The same query as a non-media request works flawlessly (provided the response size is not over 10 MB, of course).
The solution to both your and my issue is to batch the request using OFFSET and LIMIT, such that the 10 MB response limit isn't hit. Estimate the size of an average response row at the call site, pass it into a wrapper function, and let the wrapper handle adding OFFSET and LIMIT to your input SQL, and then collate the multi-query result into a single output of the desired format:
def doGetQuery(query, rowSize = 1.):
limitValue = floor(9.5 * 1024 / rowSize) # rowSize in kB
offsetValue = 0;
ft = {'kind': "fusiontables#sqlresponse"};
data = [];
done = False;
while not done:
tail = ' '.join(['OFFSET', str(offsetValue), 'LIMIT', str(limitValue)]);
request = FusionTables.query().sqlGet(sql = query + ' ' + tail);
response = request.execute();
offsetValue += limitValue;
if 'rows' in response.keys():
data.extend(response['rows']);
# Check the exit condition.
if 'rows' not in response.keys() or len(response['rows']) < limitValue:
done = True;
if 'columns' not in ft.keys() and 'columns' in response.keys():
ft['columns'] = response['columns'];
ft['rows'] = data;
return ft;
This wrapper can be extended to handle actual desired uses of offset and limit. Ideally for FusionTable or other REST API methods they provide a list() and list_next() method for native pagination, but no such features are present for FusionTable::Query. Given the horrendously slow rate of FusionTable API / functionality updates, I wouldn't expect ROWIDs to magically appear in alt=media downloads, or for GET-turned-POST media-format queries to ever work, so writing your own wrapper is going to save you a lot of headache.

Related

BioPython Entrez article limit

I've been using the classic article function which returns the articles for a string
from Bio import Entrez, __version__
print('Biopython version : ', __version__)
def article_machine(t):
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='100000',
retmode='xml',
term=t)
return list(Entrez.read(handle)["IdList"])
print(len(article_machine('T cell')))
I've noticed now that there's a limit to the amount of articles I receive (not the one I put in retmax).
The amount I get is 9999 PMIDS, for key words who used to return 100k PMIDS (T cell for example)
The amount I get now
The amount I used to get
I know it's not a bug in the package itself but in NCBI.
Has someone managed to solve it?
from : The E-utilities In-Depth: Parameters, Syntax and More
retmax
Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If usehistory is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.
To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using that contains additional logic to batch PubMed search results
Unfortunately my code devised on the above mentioned info, doesnt work:
from Bio import Entrez, __version__
print('Biopython version : ', __version__)
def article_machine(t):
all_res = []
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
rettype='count',
term=t)
number = int(Entrez.read(handle)['Count'])
print(number)
retstart = 0
while retstart < number:
retstart += 1000
print('\n retstart now : ' , retstart ,' out of : ', number)
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
rettype='xml',
retstart = retstart,
retmax = str(retstart),
term=t)
all_res.extend(list(Entrez.read(handle)["IdList"]))
return all_res
print(article_machine('T cell'))
changing while retstart < number: with while retstart < 5000:
the code works, but as soon as retmax exceeds 9998, that is using
the former while loop needed to access all the results, I get the following error:
RuntimeError: Search Backend failed: Exception:
'retstart' cannot be larger than 9998. For PubMed,
ESearch can only retrieve the first 9,999 records
matching the query.
To obtain more than 9,999 PubMed records, consider
using EDirect that contains additional logic
to batch PubMed search results automatically
so that an arbitrary number can be retrieved.
For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/
See https://www.ncbi.nlm.nih.gov/books/NBK25499/ that actually should be https://www.ncbi.nlm.nih.gov/books/NBK179288/?report=reader
Try to have a look at NCBI APIs to see if there is something could work out your problem usinng a Python interface, I am not an expert on this , sorry : https://www.ncbi.nlm.nih.gov/pmc/tools/developers/

In Factual how to get results with unique field values (similar to GROUP BY in SQL)?

I've just set out on the path to discovery of Factual API and I cannot see how to achieve a retrieval of a selection of entries each with a unique value in the specified field.
For example, give me 10 results from various cities:
q.limit(10);
q.field("locality").unique(); // no such filter exists
factual.fetch("places", q);
This would be an equivalent query in MySQL:
SELECT * FROM places GROUP BY locality LIMIT 10;
What I want is a little bit similar to facets:
FacetQuery fq = new FacetQuery("locality").maxValuesPerFacet(10);
fq.field("country").isEqual("gb");
FacetResponse resp = factual.fetch("places", fq);
but instead of the total for each result I would like to see a random object with all the information.
Is anything like this possible?

Confused by the number of hits received for a search query through Twitter4j?

Below is the code that I use for searching,
Query query = new Query( queryStr );
query.setCount( 100 );
QueryResult result = twitter.search( query );
List<Status> tweets = result.getTweets();
I issued two queries: Q1 = "twitter text research" and Q2 = "twitter research" and I the code above returned 100 and 65 results for the two queries respectively. Since all the results for Q1 valid for Q2 (since search results are identified by keyword matching) why the number of hits for Q2 is only 65? What's happening here?
I get more than 100 result from the two queries, but just like you I get less results with the longer query
Keep this in mind from the Search API documentation
Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead.
For Twitter if you add "text" at your Query you just get less relevant results. The Search API isn't for completeness.
Edit
If you want to get more than 100 results you need to move to the next "page"
Query query = new Query( queryStr );
query.setCount( 100 );
QueryResult result;
do {
result = twitter.search(query);
List<Status> tweets = result.getTweets();
for (Status tweet : tweets) {
System.out.println("#" + tweet.getUser().getScreenName() + " - " + tweet.getText());
}
} while ((query = result.nextQuery()) != null);
If you only want to know how many tweets you receive, you need to call
System.out.println(result.getCount());

ZF2 paginator performance with fetchAll without limit?

Most of cases in ZF2 I would do a paginator like this:
public function fetchAll()
{
$resultSet = $this->tableGateway->select();
return $resultSet;
}
$iteratorAdapter = new \Zend\Paginator\Adapter\Iterator($this->getSampleTable()->fetchAll());
$paginator = new \Zend\Paginator\Paginator($iteratorAdapter);
The problem of this aproch is that It is not limiting the query result, and fetchAll returns all the results (producing big trafic between db and php).
In pure php I would do a pagination like this:
$PAGE_NUM = $_GET['page_num'];
$result_num = mysq_query("SELECT * FROM table WHERE some condition");
$result = mysq_query("SELECT * FROM table WHERE some condition LIMIT $PAGE_NUM, 20");
echo do_paginator($result_num, 20, $PAGE_NUM);//THIS JUST MAKE THE CLASSIC < 1 2 3 4 5 >
The advantage of this is that I fetch from the DB only the data I need in the pagination thanks to LIMIT option in mysql. This is translate in good performance on query that returns too much records.
Can I use the ZF2 paginator with a effective limit to the results, or I need to make my own paginator?
Use the Zend\Paginator\Adapter\DbSelect adapter instead, it does exactly what you're asking for, applying limit and offset to the query you give it to begin with, and just fetching those records.
Additionally this adapter does not fetch all records from the database in order to count them. Instead, the adapter manipulates the original query to produce a corresponding COUNT query. Paginator then executes that COUNT query to get the number of rows. This does require an extra round-trip to the database, but this is many times faster than fetching an entire result set and using count(), especially with large collections of data.

ASP.NET MVC3 -Performance Improvement through paging concept, I Need an example?

I am working on application built on ASP.NET MVC 3.0 and displaying the data in MVC WebGrid.
I am using LINQ to get the records from Entities to EntityViewModel. In doing this I have to convert the records from entity to EntityViewModel.
I have 30K records to be displayed in the grid, for each and every record there are 3 flags where It has to go 3 other tables and compare the existence of the record and paint with true or false and display the same in grid.
I am displaying 10 records at a time, but it is bit very slow as I am getting all the records and storing in my application.
The Paging is in place (I mean to say -only 10 records are being displayed in web grid) but all the records are getting loaded into the application which is taking 15-20 seconds. I have checked the place where this time is being spent by the processor. It's happening in the painting place(where every record is being compared with 3 other tables).
I have converted LINQ query to SQL and I can see my SQL query is getting executed under 2 seconds. By this , I can strongly say that, I do not want to spend time on SQL indexing as the speed of SQL query is good enough.
I have two options to implement
1) Caching for MVC
2) Paging(where I should get only first ten records).
I want to go with the paging technique for performance improvement .
Now my question is how do I pass the number 10(no of records to service method) so that It brings up only ten records. And also how do I get the next 10 records when clicking on the next page.
I would post the code, but I cannot do it as it has some sensitive data.
Any example how to tackle this situation, many thanks.
If you're using SQL 2005 + you could use ROW_NUMBER() in your stored procedure:
http://msdn.microsoft.com/en-us/library/ms186734(v=SQL.90).aspx
or else if you just want to do it in LINQ try the Skip() and Take() methods.
As simple as:
int page = 2;
int pageSize = 10;
var pagedStuff = query.Skip((page - 1) * pageSize).Take(pageSize);
You should always, always, always be limiting the amount of rows you get from the database. Unbounded reads kill applications. 30k turns into 300k and then you are just destroying your sql server.
Jfar is on the right track with .Skip and .Take. The Linq2Sql engine (and most entity frameworks) will convert this to SQL that will return a limited result set. However, this doesn't preclude caching the results as well. I recommend doing that as well. That fastest trip to SQL Server is the one you don't have to take. :) I do something like this where my controller method handles paged or un-paged results and caches whatever comes back from SQL:
[AcceptVerbs("GET")]
[OutputCache(Duration = 360, VaryByParam = "*")]
public ActionResult GetRecords(int? page, int? items)
{
int limit = items ?? defaultItemsPerPage;
int pageNum = page ?? 0;
if (pageNum <= 0) { pageNum = 1; }
ViewBag.Paged = (page != null);
var records = null;
if (page != null)
{
records = myEntities.Skip((pageNum - 1) * limit).Take(limit).ToList();
}
else
{
records = myEntities.ToList();
}
return View("GetRecords", records);
}
If you call it with no params, you get the entire results set (/GetRecords). Calling it will params will get you the restricted set (/GetRecords?page=3&items=25).
You could extend this method further by adding .Contains and .StartsWith functionality.
If you do decide to go the custom stored procedure route, I'd recommend using "TOP" and "ROW_NUMBER" to restrict results rather than a temp table.
Personally I would create a custom stored procedure to do this and then call it through Linq to SQL. e.g.
CREATE PROCEDURE [dbo].[SearchData]
(
#SearchStr NVARCHAR(50),
#Page int = 1,
#RecsPerPage int = 50,
#rc int OUTPUT
)
AS
SET NOCOUNT ON
SET FMTONLY OFF
DECLARE #TempFound TABLE
(
UID int IDENTITY NOT NULL,
PersonId UNIQUEIDENTIFIER
)
INSERT INTO #TempFound
(
PersonId
)
SELECT PersonId FROM People WHERE Surname Like '%' + SearchStr + '%'
SET #rc = ##ROWCOUNT
-- Calculate the final offset for paging --
DECLARE #FirstRec int, #LastRec int
SELECT #FirstRec = (#Page - 1) * #RecsPerPage
SELECT #LastRec = (#Page * #RecsPerPage + 1)
-- Final select --
SELECT p.* FROM People p INNER JOIN #TempFound tf
ON p.PersonId = tf.PersonId
WHERE (tf.UID > #FirstRec) AND (tf.UID < #LastRec)
The #rc parameter is the total number of records found.
You obviously have to model it to your own table, but it should run extremely fast..
To bind it to an object in Linq to SQL, you just have to make sure that the final selects fields match the fields of the object it is to be bound to.

Resources