Trouble fetching all sources from a collection with /api/sources/sources/ endpoint #872

philbudne · 2024-11-26T17:01:08Z

I'm trying to test mc-providers searches against the mcweb API. Doing this requires getting the list of domains from lists of source and collection ids.

The following program:

from typing import Dict, Optional

import mediacloud.api
from helpers import TOKEN

NG_NATIONAL = 38376341 #nigeria national (436)

class MyDirectoryApi(mediacloud.api.DirectoryApi):

    def _query(self, endpoint: str, params: Optional[Dict] = None, method: str = 'GET') -> Dict:
        print(method, endpoint, params)
        return super()._query(endpoint, params, method)

mc_dir = MyDirectoryApi(TOKEN)

offset = 0
ids = {}
index = 0
while True:
    srcs = mc_dir.source_list(collection_id = NG_NATIONAL, offset = offset)["results"]
    if not srcs:
        break
    dups = []
    for src in srcs:
        id_ = src["id"]
        if id_ in ids:
            dups.append((id_, ids[id_], index)) # srcid, prev index, new index
        else:
            ids[id_] = index
        index += 1
    if dups:
        print(offset, dups)
    offset += len(srcs)

print("sources returned", offset)
print("unique sources returned", len(ids))

Returns the expected NUMBER of sources, but some of them are duplicates of previous entries!

Here is the output: the numbers are batch offset followed by a list of tuples: (source_id, original_position, new_position)

GET sources/sources/ {'limit': 0, 'offset': 0, 'collection_id': 38376341}
GET sources/sources/ {'limit': 0, 'offset': 100, 'collection_id': 38376341}
100 [(295700, 50, 100), (295699, 60, 101), (295697, 53, 102), (295696, 70, 103), (295695, 63, 104), (143765, 94, 106), (295690, 62, 107), (295689, 69, 108), (143654, 96, 109), (295680, 67, 110), (18028, 97, 111), (271658, 98, 113), (295676, 86, 115), (18013, 57, 116), (295675, 77, 117), (295672, 72, 118), (295671, 93, 119), (295670, 76, 120), (18014, 99, 121), (295668, 89, 122), (295667, 88, 123), (295666, 85, 124), (295665, 80, 125), (295664, 78, 126), (295663, 84, 127), (286811, 83, 128), (295669, 75, 133)]
GET sources/sources/ {'limit': 0, 'offset': 200, 'collection_id': 38376341}
200 [(295771, 137, 200), (295772, 140, 201), (295775, 147, 202), (295777, 172, 203), (295778, 190, 204), (295780, 199, 205), (295782, 173, 207)]
GET sources/sources/ {'limit': 0, 'offset': 300, 'collection_id': 38376341}
GET sources/sources/ {'limit': 0, 'offset': 400, 'collection_id': 38376341}
GET sources/sources/ {'limit': 0, 'offset': 436, 'collection_id': 38376341}
sources returned 436
unique sources returned 402

I originally posited that this could happen if the query being paginated didn't sort the rows in a "stable" way, but this doesn't seem to be the case (the output is identical from run to run).

Adding the debug output to the Api object _query method shows my code could quit when it gets back a less-than-full-sized result (but it doesn't know the page size)!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble fetching all sources from a collection with /api/sources/sources/ endpoint #872

Trouble fetching all sources from a collection with /api/sources/sources/ endpoint #872

philbudne commented Nov 26, 2024

Trouble fetching all sources from a collection with /api/sources/sources/ endpoint #872

Trouble fetching all sources from a collection with /api/sources/sources/ endpoint #872

Comments

philbudne commented Nov 26, 2024