Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble fetching all sources from a collection with /api/sources/sources/ endpoint #872

Open
philbudne opened this issue Nov 26, 2024 · 0 comments

Comments

@philbudne
Copy link
Contributor

I'm trying to test mc-providers searches against the mcweb API. Doing this requires getting the list of domains from lists of source and collection ids.

The following program:

from typing import Dict, Optional

import mediacloud.api
from helpers import TOKEN

NG_NATIONAL = 38376341 #nigeria national (436)

class MyDirectoryApi(mediacloud.api.DirectoryApi):

    def _query(self, endpoint: str, params: Optional[Dict] = None, method: str = 'GET') -> Dict:
        print(method, endpoint, params)
        return super()._query(endpoint, params, method)

mc_dir = MyDirectoryApi(TOKEN)

offset = 0
ids = {}
index = 0
while True:
    srcs = mc_dir.source_list(collection_id = NG_NATIONAL, offset = offset)["results"]
    if not srcs:
        break
    dups = []
    for src in srcs:
        id_ = src["id"]
        if id_ in ids:
            dups.append((id_, ids[id_], index)) # srcid, prev index, new index
        else:
            ids[id_] = index
        index += 1
    if dups:
        print(offset, dups)
    offset += len(srcs)

print("sources returned", offset)
print("unique sources returned", len(ids))

Returns the expected NUMBER of sources, but some of them are duplicates of previous entries!

Here is the output: the numbers are batch offset followed by a list of tuples: (source_id, original_position, new_position)

GET sources/sources/ {'limit': 0, 'offset': 0, 'collection_id': 38376341}
GET sources/sources/ {'limit': 0, 'offset': 100, 'collection_id': 38376341}
100 [(295700, 50, 100), (295699, 60, 101), (295697, 53, 102), (295696, 70, 103), (295695, 63, 104), (143765, 94, 106), (295690, 62, 107), (295689, 69, 108), (143654, 96, 109), (295680, 67, 110), (18028, 97, 111), (271658, 98, 113), (295676, 86, 115), (18013, 57, 116), (295675, 77, 117), (295672, 72, 118), (295671, 93, 119), (295670, 76, 120), (18014, 99, 121), (295668, 89, 122), (295667, 88, 123), (295666, 85, 124), (295665, 80, 125), (295664, 78, 126), (295663, 84, 127), (286811, 83, 128), (295669, 75, 133)]
GET sources/sources/ {'limit': 0, 'offset': 200, 'collection_id': 38376341}
200 [(295771, 137, 200), (295772, 140, 201), (295775, 147, 202), (295777, 172, 203), (295778, 190, 204), (295780, 199, 205), (295782, 173, 207)]
GET sources/sources/ {'limit': 0, 'offset': 300, 'collection_id': 38376341}
GET sources/sources/ {'limit': 0, 'offset': 400, 'collection_id': 38376341}
GET sources/sources/ {'limit': 0, 'offset': 436, 'collection_id': 38376341}
sources returned 436
unique sources returned 402

I originally posited that this could happen if the query being paginated didn't sort the rows in a "stable" way, but this doesn't seem to be the case (the output is identical from run to run).

Adding the debug output to the Api object _query method shows my code could quit when it gets back a less-than-full-sized result (but it doesn't know the page size)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant