You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to test mc-providers searches against the mcweb API. Doing this requires getting the list of domains from lists of source and collection ids.
The following program:
from typing import Dict, Optional
import mediacloud.api
from helpers import TOKEN
NG_NATIONAL = 38376341 #nigeria national (436)
class MyDirectoryApi(mediacloud.api.DirectoryApi):
def _query(self, endpoint: str, params: Optional[Dict] = None, method: str = 'GET') -> Dict:
print(method, endpoint, params)
return super()._query(endpoint, params, method)
mc_dir = MyDirectoryApi(TOKEN)
offset = 0
ids = {}
index = 0
while True:
srcs = mc_dir.source_list(collection_id = NG_NATIONAL, offset = offset)["results"]
if not srcs:
break
dups = []
for src in srcs:
id_ = src["id"]
if id_ in ids:
dups.append((id_, ids[id_], index)) # srcid, prev index, new index
else:
ids[id_] = index
index += 1
if dups:
print(offset, dups)
offset += len(srcs)
print("sources returned", offset)
print("unique sources returned", len(ids))
Returns the expected NUMBER of sources, but some of them are duplicates of previous entries!
Here is the output: the numbers are batch offset followed by a list of tuples: (source_id, original_position, new_position)
I originally posited that this could happen if the query being paginated didn't sort the rows in a "stable" way, but this doesn't seem to be the case (the output is identical from run to run).
Adding the debug output to the Api object _query method shows my code could quit when it gets back a less-than-full-sized result (but it doesn't know the page size)!
The text was updated successfully, but these errors were encountered:
I'm trying to test mc-providers searches against the mcweb API. Doing this requires getting the list of domains from lists of source and collection ids.
The following program:
Returns the expected NUMBER of sources, but some of them are duplicates of previous entries!
Here is the output: the numbers are batch offset followed by a list of tuples: (source_id, original_position, new_position)
I originally posited that this could happen if the query being paginated didn't sort the rows in a "stable" way, but this doesn't seem to be the case (the output is identical from run to run).
Adding the debug output to the Api object _query method shows my code could quit when it gets back a less-than-full-sized result (but it doesn't know the page size)!
The text was updated successfully, but these errors were encountered: