@quekky wrote:
There are some sites that have the links deep inside the site. After calling the main page, I get a list of pages that contain the links. How do I crawl the 2nd page?
I tried this but it give me error "unknown url type: '{{url}}'"
tasks: feed: rss: http://somewordpresssite.com/feed/ accept_all: yes exec: echo "Got wordpress page {{title}} - {{url}}" template: crawlpage templates: crawlpage: html: "{{url}}" regexp: accept: - sometext from: title exec: - echo "Got link {{title}} - {{url}}" - my_own_script.sh "{{url}}"
Another code I tried:
tasks: feed: html: https://somesite.com/ accept_all: yes exec: echo "Got page {{title}} - {{url}}" list_add: - entry_list: pages crawlpage: entry_list: pages html: "{{url}}" regexp: accept: - sometext from: title exec: - echo "Got link {{title}} - {{url}}" - my_own_script.sh "{{url}}"
There are a few of sites that I want to process that is something similar.
Some sites have 3 or 4 links deep
Posts: 2
Participants: 2