Follow action & Bad URLs

Moses · Post by **Moses** » Mon May 19, 2014 4:32 am

Follow action is pausing for bad URLs and when i tried redoing the process it still paused for the same bad URL.

Is there a way it could skip those bad URLs on next runs?

Post by **martin@rootjazz** » Mon May 19, 2014 1:58 pm

These are bad URLs, the program is pausing because it is unexpected. The URL has not been skipped due to previous processing, nor filtered out for whatever reason.

I don't think it should be skipped, for the safety of the account.

If you want quicker actions, you will need to filter the list via a 404 checker. If you do not have a 404 checker, I can add that to the features suggestion list

Moses · Post by **Moses** » Mon May 19, 2014 6:15 pm

Which 404 checker do you use?

I am using the free Scrapbox link checker and it looks like it will take all day or more to check my list that has over 116,000 blog URLs and growing.

I think it would be more easier if the app can save those bad URLs and not process them on next runs, OR when scrapping users in SCRAPE tab there should be an option to filter out bad URLs.

Post by **martin@rootjazz** » Mon May 19, 2014 9:58 pm

Which 404 checker do you use?

I don't, I checked a couple and they were 404s

I am using the free Scrapbox link checker and it looks like it will take all day or more to check my list that has over 116,000 blog URLs and growing.

all day to check 116 URLs isn't so bad.

I think it would be more easier if the app can save those bad URLs and not process them on next runs, OR when scrapping users in SCRAPE tab there should be an option to filter out bad URLs.

Your suggestion has been noted and added to the feature suggestions list. Depending on the demand for the change / feature will alter how quickly it gets implemented.

Moses · Post by **Moses** » Mon May 19, 2014 10:59 pm

all day to check 116 URLs isn't so bad.

No, i said 116,000 URLs.

The Scrapbox software started checking the URLs this morning and now it's the afternoon and the current status is 16,860 / 116,744. So i'm guessing it will take the whole day to check all of them.

I scrapped these users in the SCRAPER tab and saved them to list, so i think the app didn't filter out the 404 URLs when scrapping them or perhaps the users deleted their accounts or were banned since the last time i scrapped them.

I believe anyone who uses a list of URLs instead of search terms when Following / Liking would have 404 issues, especially with a big list since users in that list will delete their accounts or get banned

Post by **martin@rootjazz** » Tue May 20, 2014 12:59 pm

No, i said 116,000 URLs.

yes typo from me, obviously 116 checks isn't going to take all day

I scrapped these users in the SCRAPER tab and saved them to list, so i think the app didn't filter out the 404 URLs when scrapping them or perhaps the users deleted their accounts or were banned since the last time i scrapped them.

The scrape just pulls what it finds, if tumblr is showing accounts on the page that do not exist, then they will be scraped.

I believe anyone who uses a list of URLs instead of search terms when Following / Liking would have 404 issues, especially with a big list since users in that list will delete their accounts or get banned

The problem is, if the routine auto verifies, it is going to take as long as scrape box. If scrapebox is taking 24 hours to verify your list, then TJ will take just as long. Whenever those checks are made, whether it is in one go or as required, that 24 hours of checks has to be made

Moses · Post by **Moses** » Tue May 20, 2014 4:27 pm

These are bad URLs, the program is pausing because it is unexpected. The URL has not been skipped due to previous processing, nor filtered out for whatever reason.

I don't think it should be skipped, for the safety of the account.

What would happen if they were skipped?

The problem is, if the routine auto verifies, it is going to take as long as scrape box. If scrapebox is taking 24 hours to verify your list, then TJ will take just as long. Whenever those checks are made, whether it is in one go or as required, that 24 hours of checks has to be made

That's true, but i think if the bad urls were saved in each run then skipped / ignored on next runs it wouldn't take as much time and it would be like already processed urls

Post by **martin@rootjazz** » Tue May 20, 2014 10:30 pm

Moses wrote:
These are bad URLs, the program is pausing because it is unexpected. The URL has not been skipped due to previous processing, nor filtered out for whatever reason.

I don't think it should be skipped, for the safety of the account.
What would happen if they were skipped?

The problem is, if the routine auto verifies, it is going to take as long as scrape box. If scrapebox is taking 24 hours to verify your list, then TJ will take just as long. Whenever those checks are made, whether it is in one go or as required, that 24 hours of checks has to be made
That's true, but i think if the bad urls were saved in each run then skipped / ignored on next runs it wouldn't take as much time and it would be like already processed urls

I believe I added a check so that one an error occurs, if it was a 404 it is skipped, as that is a known error and can be skipped safely.

As for maintaining lists of all processed items across all actions. I will add it to the features list, but the existing code is not made to store logs of thousands and thousands. Thus the whole system would need to be written for pooling to avoid loading large files direct to memory, which is no small update.

Moses · Post by **Moses** » Fri May 30, 2014 4:14 pm

I believe I added a check so that one an error occurs, if it was a 404 it is skipped, as that is a known error and can be skipped safely.

I just ran some Follow / Like runs and noticed there were failures and drop outs in the beginning that were paused for instead of skipped like you said.

The failures were 404 errors of Follow & Drops Outs were of Like. When i checked the Drop Out urls they were all RSS pages (not sure if that's bad or not)

ID: 38320

Post by **martin@rootjazz** » Fri May 30, 2014 8:24 pm

Ok, re-added.

Today is not my day...

Rootjazz

Follow action & Bad URLs

Follow action & Bad URLs

Re: Follow action & Bad URLs

Re: Follow action & Bad URLs

Re: Follow action & Bad URLs

Re: Follow action & Bad URLs

Re: Follow action & Bad URLs

Re: Follow action & Bad URLs

Re: Follow action & Bad URLs

Re: Follow action & Bad URLs

Re: Follow action & Bad URLs