Secondary analysis of the scraping results？

hacking · Post by **hacking** » Sun Nov 19, 2023 3:37 am

Hello Martin：
Under a specific filter, filter out a batch of followers' names (URLs) for someone. Over time, the status of users in the list may change, for example, some people have already started following me. How can I update this list? Is it possible to scraping based on this list? For example, to remove those who have already followed me or those who are inactive for a long time?

Post by **martin@rootjazz** » Mon Nov 20, 2023 5:31 pm

hacking wrote: ↑Sun Nov 19, 2023 3:37 am Hello Martin：
Under a specific filter, filter out a batch of followers' names (URLs) for someone. Over time, the status of users in the list may change, for example, some people have already started following me. How can I update this list? Is it possible to scraping based on this list? For example, to remove those who have already followed me or those who are inactive for a long time?

You would need to run a filter on the list.

Custom search step:
USER ID URL

then set your filepath.

The filter will be applied to the list as input and the new output will be profiles who meet your filter

hacking · Post by **hacking** » Fri Feb 09, 2024 5:27 pm

Hello Martin,

Following the provided instructions, I attempted to perform the operation, but unfortunately, no new result file was generated. Here is the specific situation:

Initially, I conducted a scrape without any filtering conditions for a particular user, gathering data on 100,000 followers.
Subsequently, following the given instructions, I attempted a secondary scrape. However, the task appears to be continuously running without producing any output, even after waiting for several hours.
I would be grateful for any guidance or solution you can provide to address this matter.

Thank you for your time and assistance.

Best regards

Post by **martin@rootjazz** » Sat Feb 10, 2024 7:53 pm

Look at your logs, what is happening.

Are all the results being ignored? Then your filter is too strict - or setup wrongly (or there is a bug)

hacking · Post by **hacking** » Sun Feb 11, 2024 4:06 am

The criteria for the secondary scrape are not stringent; it simply includes users who have been active within the last 100 days.
The log file is:logs_84650

Post by **martin@rootjazz** » Sun Feb 11, 2024 5:22 pm

ok, so the filter on a LIST doesn't save as it processes, it will save at the end of the action. from your logs, the action just hadn't completed

Regards,
Martin

Post by **martin@rootjazz** » Sun Feb 11, 2024 8:35 pm

https://rootjazz.com/twitterdub/updatetesting.html

saves out during the process now

hacking · Post by **hacking** » Mon Feb 12, 2024 6:30 am

Are the two features(Global unique results for search regardless of account / filters usedResume
search position on repeats/ restarts) effective for the secondary scrape? I noticed that it is indeed necessary to wait for the scrape to complete before saving the results. In practice, the initial scrape yields a significant amount of data, often tens of thousands of entries. The secondary scrape of this data takes a considerable amount of time, with several days passing without any progress, leading me to believe that there might be an error in the program.

Explanation for why I am doing this: My scraping operation is, in fact, quite simple, and the filters are not complex. I just want to fetch the followers of a specific account who have been active within the last 100 days (or any other specified period). Originally, this task could be completed in one go. However, I found that when scraping accounts with a large number of followers (several hundred thousand), it is challenging to obtain a clean scrape due to Twitter's daily limits, program interruptions, and the impact of restarting the task. Therefore, I decided to scrape all the follower data for the target account and analyze it as static data gradually.

Post by **martin@rootjazz** » Mon Feb 12, 2024 7:36 pm

hacking wrote: ↑Mon Feb 12, 2024 6:30 am Are the two features(Global unique results for search regardless of account / filters usedResume.

I'll have to check, I don't remember. I think is per search global

search position on repeats/ restarts) effective for the secondary scrape? I

search position isn't going to work on files, it works with twitter ID for the search page

noticed that it is indeed necessary to wait for the scrape to complete before saving the results.

Did you see the link above? Should save after every 10 results found.

In practice, the initial scrape yields a significant amount of data, often tens of thousands of entries. The secondary scrape of this data takes a considerable amount of time, with several days passing without any progress, leading me to believe that there might be an error in the program.

The more data to filter, the longer it takes. Look at the logs, it will tell you what's going on

Explanation for why I am doing this: My scraping operation is, in fact, quite simple, and the filters are not complex. I just want to fetch the followers of a specific account who have been active within the last 100 days (or any other specified period).

To find out the active date of an account, the program must make multiple requests per profile to find out the most recent: tweet / like / retweet (along with user_details). So if you have 100k results, 400k new requests are made. It takes time

Rootjazz

Secondary analysis of the scraping results？

Secondary analysis of the scraping results？

Re: Secondary analysis of the scraping results？

Re: Secondary analysis of the scraping results？

Re: Secondary analysis of the scraping results？

Re: Secondary analysis of the scraping results？

Re: Secondary analysis of the scraping results？

Re: Secondary analysis of the scraping results？

Re: Secondary analysis of the scraping results？

Re: Secondary analysis of the scraping results？