Secondary analysis of the scraping results?

Support / help / discussion forum for twitter bot
Post Reply
hacking
Posts: 57
Joined: Sat Oct 07, 2023 8:09 am

Secondary analysis of the scraping results?

Post by hacking »

Hello Martin:
Under a specific filter, filter out a batch of followers' names (URLs) for someone. Over time, the status of users in the list may change, for example, some people have already started following me. How can I update this list? Is it possible to scraping based on this list? For example, to remove those who have already followed me or those who are inactive for a long time?
User avatar
martin@rootjazz
Site Admin
Posts: 34712
Joined: Fri Jan 25, 2013 10:06 pm
Location: The Funk
Contact:

Re: Secondary analysis of the scraping results?

Post by martin@rootjazz »

hacking wrote: Sun Nov 19, 2023 3:37 am Hello Martin:
Under a specific filter, filter out a batch of followers' names (URLs) for someone. Over time, the status of users in the list may change, for example, some people have already started following me. How can I update this list? Is it possible to scraping based on this list? For example, to remove those who have already followed me or those who are inactive for a long time?
You would need to run a filter on the list.

Custom search step:
USER ID URL

then set your filepath.

The filter will be applied to the list as input and the new output will be profiles who meet your filter
hacking
Posts: 57
Joined: Sat Oct 07, 2023 8:09 am

Re: Secondary analysis of the scraping results?

Post by hacking »

Hello Martin,

Following the provided instructions, I attempted to perform the operation, but unfortunately, no new result file was generated. Here is the specific situation:

Initially, I conducted a scrape without any filtering conditions for a particular user, gathering data on 100,000 followers.
Subsequently, following the given instructions, I attempted a secondary scrape. However, the task appears to be continuously running without producing any output, even after waiting for several hours.
I would be grateful for any guidance or solution you can provide to address this matter.

Thank you for your time and assistance.

Best regards
User avatar
martin@rootjazz
Site Admin
Posts: 34712
Joined: Fri Jan 25, 2013 10:06 pm
Location: The Funk
Contact:

Re: Secondary analysis of the scraping results?

Post by martin@rootjazz »

Look at your logs, what is happening.

Are all the results being ignored? Then your filter is too strict - or setup wrongly (or there is a bug)
hacking
Posts: 57
Joined: Sat Oct 07, 2023 8:09 am

Re: Secondary analysis of the scraping results?

Post by hacking »

The criteria for the secondary scrape are not stringent; it simply includes users who have been active within the last 100 days.
The log file is:logs_84650
User avatar
martin@rootjazz
Site Admin
Posts: 34712
Joined: Fri Jan 25, 2013 10:06 pm
Location: The Funk
Contact:

Re: Secondary analysis of the scraping results?

Post by martin@rootjazz »

ok, so the filter on a LIST doesn't save as it processes, it will save at the end of the action. from your logs, the action just hadn't completed




Regards,
Martin
User avatar
martin@rootjazz
Site Admin
Posts: 34712
Joined: Fri Jan 25, 2013 10:06 pm
Location: The Funk
Contact:

Re: Secondary analysis of the scraping results?

Post by martin@rootjazz »

hacking
Posts: 57
Joined: Sat Oct 07, 2023 8:09 am

Re: Secondary analysis of the scraping results?

Post by hacking »

Are the two features(Global unique results for search regardless of account / filters usedResume
search position on repeats/ restarts) effective for the secondary scrape? I noticed that it is indeed necessary to wait for the scrape to complete before saving the results. In practice, the initial scrape yields a significant amount of data, often tens of thousands of entries. The secondary scrape of this data takes a considerable amount of time, with several days passing without any progress, leading me to believe that there might be an error in the program.

Explanation for why I am doing this: My scraping operation is, in fact, quite simple, and the filters are not complex. I just want to fetch the followers of a specific account who have been active within the last 100 days (or any other specified period). Originally, this task could be completed in one go. However, I found that when scraping accounts with a large number of followers (several hundred thousand), it is challenging to obtain a clean scrape due to Twitter's daily limits, program interruptions, and the impact of restarting the task. Therefore, I decided to scrape all the follower data for the target account and analyze it as static data gradually.
User avatar
martin@rootjazz
Site Admin
Posts: 34712
Joined: Fri Jan 25, 2013 10:06 pm
Location: The Funk
Contact:

Re: Secondary analysis of the scraping results?

Post by martin@rootjazz »

hacking wrote: Mon Feb 12, 2024 6:30 am Are the two features(Global unique results for search regardless of account / filters usedResume.
I'll have to check, I don't remember. I think is per search global
search position on repeats/ restarts) effective for the secondary scrape? I
search position isn't going to work on files, it works with twitter ID for the search page

noticed that it is indeed necessary to wait for the scrape to complete before saving the results.
Did you see the link above? Should save after every 10 results found.
In practice, the initial scrape yields a significant amount of data, often tens of thousands of entries. The secondary scrape of this data takes a considerable amount of time, with several days passing without any progress, leading me to believe that there might be an error in the program.
The more data to filter, the longer it takes. Look at the logs, it will tell you what's going on
Explanation for why I am doing this: My scraping operation is, in fact, quite simple, and the filters are not complex. I just want to fetch the followers of a specific account who have been active within the last 100 days (or any other specified period).
To find out the active date of an account, the program must make multiple requests per profile to find out the most recent: tweet / like / retweet (along with user_details). So if you have 100k results, 400k new requests are made. It takes time
Post Reply