This project is read-only.
In this page, check the command line options, sample ways to execute tokenCrawler, and instructions about the source code.

Command Line Options:

Options:
   /Token           Specifies the token (can be defined as a regular
                    expression) to be searched for in the crawled files

Input (chose one):
   /File            Specifies the input file that contains
                    the links to crawl (one per line)
   /Url             Specifies the url to be crawled looking for the
                    pattern. if crawling multiple sites, use -File option.

Other options:
   /Headers         find the pattern on the Response Headers
                    of the HTTP request ? defaults to false
   /Help            Displays this help text
   /IgnoreCase      Specifies the token (can be defined as
                    a regular expression) to be searched for
                    in the crawled files (default is true)
   /MaxResults      The maximum number of result excerts to show when the
                    pattern is found. 0 means unlimited.(defaults to 5)
   /Output          The file to output the results of execution to
   /Verbose         Identify verbose level:0 minimum,
                    1 normal, 2 full (defaults to 1)

Only /Token and either /File or /Url are mandatory

Some sample ways to execute:

  • look for "tostring" in http://www.sapo.pt
TokenCrawler.exe /token tostring /url www.sapo.pt
  • search the sites defined in the sites.txt file and that use HTML5 doctype!! (make sure you have a sites.txt file ready)
TokenCrawler.exe /File sites.txt /Token "^<!doctype html>"
  • search x-ua-compatible tag in HTTP Headers and in Body and js files. Will search in each url defined in sample .txt file.
TokenCrawler.exe /token x-ua-compatible /headers /File TestSites/x-ua-compatible.txt
  • search for sites that use IE9 pinning capabilities.
TokenCrawler.exe /token "(msapplication|pinify)" /File TestSites/ie9pin.txt
  • search for sites that do not define Doctype and start directly with <html> in lower case
TokenCrawler.exe /file sites.txt -token "^<html>" -IgnoreCase false
  • searching for the use of html5 Canvas
TokenCrawler.exe /token "(\<canvas|createElement\('canvas'\))" /File TestSites/canvas.txt
  • searching for the use of html5 figcaption tag
TokenCrawler.exe /token "<figcaption\b[^>]*>(.*?)</figcaption>"  /url www.wipikedia.com


Source Code

If you download the source code, use NuGet to install tha missing references for:
  1. HTMLAgilityPack
  2. Plossum CommandLine

Last edited Dec 21, 2011 at 3:33 PM by tiagonmas, version 31

Comments

No comments yet.