httrack,复制某个站点到本地。
命令行语法格式(SYNOPSIS)
httrack [ url ]… [ -filter ]… [ +filter ]… [ -O, –path ] [ -w, –mirror ] [ -W,
–mirror-wizard ] [ -g, –get-files ] [ -i, –continue ] [ -Y, –mirrorlinks ] [ -P,
–proxy ] [ -%f, –httpproxy-ftp[=N] ] [ -%b, –bind ] [ -rN, –depth[=N] ] [ -%eN,
–ext-depth[=N] ] [ -mN, –max-files[=N] ] [ -MN, –max-size[=N] ] [ -EN, –max-time[=N] ]
[ -AN, –max-rate[=N] ] [ -%cN, –connection-per-second[=N] ] [ -GN, –max-pause[=N] ] [
-cN, –sockets[=N] ] [ -TN, –timeout[=N] ] [ -RN, –retries[=N] ] [ -JN, –min-rate[=N] ]
[ -HN, –host-control[=N] ] [ -%P, –extended-parsing[=N] ] [ -n, –near ] [ -t, –test ]
[ -%L, –list ] [ -%S, –urllist ] [ -NN, –structure[=N] ] [ -%D,
–cached-delayed-type-check ] [ -%M, –mime-html ] [ -LN, –long-names[=N] ] [ -KN,
–keep-links[=N] ] [ -x, –replace-external ] [ -%x, –disable-passwords ] [ -%q,
–include-query-string ] [ -o, –generate-errors ] [ -X, –purge-old[=N] ] [ -%p, –pre‐
serve ] [ -%T, –utf8-conversion ] [ -bN, –cookies[=N] ] [ -u, –check-type[=N] ] [ -j,
–parse-java[=N] ] [ -sN, –robots[=N] ] [ -%h, –http-10 ] [ -%k, –keep-alive ] [ -%B,
–tolerant ] [ -%s, –updatehack ] [ -%u, –urlhack ] [ -%A, –assume ] [ -@iN, –proto‐
col[=N] ] [ -%w, –disable-module ] [ -F, –user-agent ] [ -%R, –referer ] [ -%E, –from
] [ -%F, –footer ] [ -%l, –language ] [ -%a, –accept ] [ -%X, –headers ] [ -C,
–cache[=N] ] [ -k, –store-all-in-cache ] [ -%n, –do-not-recatch ] [ -%v, –display ] [
-Q, –do-not-log ] [ -q, –quiet ] [ -z, –extra-log ] [ -Z, –debug-log ] [ -v, –verbose
] [ -f, –file-log ] [ -f2, –single-log ] [ -I, –index ] [ -%i, –build-top-index ] [
-%I, –search-index ] [ -pN, –priority[=N] ] [ -S, –stay-on-same-dir ] [ -D,
–can-go-down ] [ -U, –can-go-up ] [ -B, –can-go-up-and-down ] [ -a,
–stay-on-same-address ] [ -d, –stay-on-same-domain ] [ -l, –stay-on-same-tld ] [ -e,
–go-everywhere ] [ -%H, –debug-headers ] [ -%!, –disable-security-limits ] [ -V,
–userdef-cmd ] [ -%W, –callback ] [ -K, –keep-links[=N] ] [
命令简述(DESCRIPTION)
httrack用于将线上网站下载到本地目录,递归构建所有目录,从服务器获取HTML、图像、其他文件到计算机中。
HTTrack安置原始站点的相对链接结构,所以只需在浏览器中打开一个“镜像”网页,然后可以从链接到链接来浏览网站,就像在线查看一样。
HTTrack还可以更新现有的镜像站点,并恢复中断的下载。
命令支持的选项及含义(OPTIONS)
General options:
-O path for mirror/logfiles+cache (-O path mirror[,path cache and logfiles]) (–path <param>)
Action options:
-w *mirror web sites (–mirror)
-W mirror web sites, semi-automatic (asks questions) (–mirror-wizard)
-g just get files (saved in the current directory) (–get-files)
-i continue an interrupted mirror using the cache (–continue)
-Y mirror ALL links located in the first level pages (mirror links) (–mirrorlinks)
Proxy options:
-P proxy use (-P proxy:port or -P user:pass@proxy:port) (–proxy <param>)
-%f *use proxy for ftp (f0 don t use) (–httpproxy-ftp[=N])
-%b use this local hostname to make/send requests (-%b hostname) (–bind <param>)
Limits options:
-rN set the mirror depth to N (* r9999) (–depth[=N])
-%eN set the external links depth to N (* %e0) (–ext-depth[=N])
-mN maximum file length for a non-html file (–max-files[=N])
-mN,N2 maximum file length for non html (N) and html (N2)
-MN maximum overall size that can be uploaded/scanned (–max-size[=N])
-EN maximum mirror time in seconds (60=1 minute, 3600=1 hour) (–max-time[=N])
-AN maximum transfer rate in bytes/seconds (1000=1KB/s max) (–max-rate[=N])
-%cN maximum number of connections/seconds (*%c10) (–connection-per-second[=N])
-GN pause transfer if N bytes reached, and wait until lock file is deleted
Flow control:
-cN number of multiple connections (*c8) (–sockets[=N])
-TN timeout, number of seconds after a non-responding link is shutdown (–timeout[=N])
-RN number of retries, in case of timeout or non-fatal errors (*R1) (–retries[=N])
-JN traffic jam control, minimum transfert rate (bytes/seconds) tolerated for a link (–min-rate[=N])
-HN host is abandonned if: 0=never, 1=timeout, 2=slow, 3=timeout or slow (–host-control[=N])
链接选项(Links options)
-%P(–extended-parsing[=N])
*扩展解析,尝试解析所有链接,即使是未知的标签或Javascript。(%P0表示不使用)
-n(–near)
获取HTML文件附近的非HTML文件(例如:位于外部的图像)
-t(–test)
测试所有网址(甚至是禁止的)。
-%L <file>
add all URL located in this text file (one URL per line) (–list <param>)
-%S <file>
add all scan rules located in this text file (one scan rule per line) (–urllist <param>)
Build options:
-NN structure type (0 *original structure, 1+: see below) (–structure[=N])
-or user defined structure (-N “%h%p/%n%q.%t”)
-%N delayed type check, don t make any link test but wait for files download to start
use)
-%D cached delayed type check, don t wait for remote type during updates, to speedup
-%M generate a RFC MIME-encapsulated full-archive (.mht) (–mime-html)
-LN long names (L1 *long names / L0 8-3 conversion / L2 ISO9660 compatible)
-KN keep original links (e.g. http://www.adr/link) (K0 *relative link, K absolute
(–keep-links[=N])
-x replace external html links by error pages (–replace-external)
-%x do not include any password for external password protected websites (%x0 include)
-%q *include query string for local files (useless, for information purpose only) (%q0
-o *generate output html file in case of error (404..) (o0 don t generate) (–gener‐
-X *purge old files after update (X0 keep delete) (–purge-old[=N])
-%p preserve html files as is (identical to -K4 -%F “” ) (–preserve)
-%T links conversion to UTF-8 (–utf8-conversion)
Spider options:
-bN accept cookies in cookies.txt (0=do not accept,* 1=accept) (–cookies[=N])
-u check document type if unknown (cgi,asp..) (u0 don t check, * u1 check but /, u2
-j *parse Java Classes (j0 don t parse, bitmask: |1 parse default, |2 don t parse
-sN follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always
-%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies)
-%k use keep-alive if possible, greately reducing latency for small files and test
-%B tolerant requests (accept bogus responses on some servers, but not standard!)
-%s update hacks: various hacks to limit re-transfers when updating (identical size,
-%u url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..)
-%A assume that a type (cgi,asp..) is always linked with a mime type (-%A
-can also be used to force a specific file type: –assume foo.cgi=text/html
-@iN internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only) (–protocol[=N])
-%w disable a specific external mime module (-%w htsswf -%w htsjava) (–disable-module <param>)
Browser ID:
-F user-agent field sent in HTTP headers (-F “user-agent name”) (–user-agent <param>)
-%R default referer field sent in HTTP headers (–referer <param>)
-%E from email address sent in HTTP headers (–from <param>)
-%F footer string in Html code (-%F “Mirrored [from host %s [file %s [at %s]]]”
-%l preffered language (-%l “fr, en, jp, *” (–language <param>)
-%a accepted formats (-%a “text/html,image/png;q=0.9,*/*;q=0.1” (–accept <param>)
-%X additional HTTP header line (-%X “X-Magic: 42” (–headers <param>)
Log, index, cache
-C
create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (–cache[=N])
-k store all files in cache (not useful if files on disk) (–store-all-in-cache)
-%n do not re-download locally erased files (–do-not-recatch)
-%v
display on screen filenames downloaded (in realtime)
– %v1 short version
– %v2 full animation (–display)
-Q no log – quiet mode (–do-not-log)
-q no questions – quiet mode (–quiet)
-z log – extra infos (–extra-log)
-Z log – debug (–debug-log)
-v log on screen (–verbose)
-f *log in files (–file-log)
-f2 one single log file (–single-log)
-I *make an index (I0 don t make) (–index)
-%i make a top index for a project folder (* %i0 don t make) (–build-top-index)
-%I make an searchable index for this mirror (* %I0 don t make) (–search-index)
Expert options:
-pN priority mode: (* p3) (–priority[=N])
-p0 just scan, don t save anything (for checking links)
-p1 save only html files
-p2 save only non html files
-*p3 save all files
-p7 get html files before, then treat other files
-S stay on the same directory (–stay-on-same-dir)
-D *can only go down into subdirs (–can-go-down)
-U can only go to upper directories (–can-go-up)
-B can both go up&down into the directory structure (–can-go-up-and-down)
-a *stay on the same address (–stay-on-same-address)
-d stay on the same principal domain (–stay-on-same-domain)
-l stay on the same TLD (eg: .com) (–stay-on-same-tld)
-e go everywhere on the web (–go-everywhere)
-%H debug HTTP headers in logfile (–debug-headers)
Guru options: (do NOT use if possible)
-#X *use optimized engine (limited memory boundary checks) (–fast-engine)
-#0 filter test (-#0 *.gif www.bar.com/foo.gif ) (–debug-testfilters <param>)
-#1 simplify test (-#1 ./foo/bar/../foobar)
-#2 type test (-#2 /foo/bar.php)
-#C cache list (-#C *.com/spider*.gif (–debug-cache <param>)
-#R cache repair (damaged cache) (–repair-cache)
-#d debug parser (–debug-parsing)
-#E extract new.zip cache meta-data in meta.zip
-#f always flush log files (–advanced-flushlogs)
-#FN maximum number of filters (–advanced-maxfilters[=N])
-#h version info (–version)
-#K scan stdin (debug) (–debug-scanstdin)
-#L maximum number of links (-#L1000000) (–advanced-maxlinks[=N])
-#p display ugly progress information (–advanced-progressinfo)
-#P catch URL (–catch-url)
-#R old FTP routines (debug) (–repair-cache)
-#T generate transfer ops. log every minutes (–debug-xfrstats)
-#u wait time (–advanced-wait)
-#Z generate transfer rate statictics every minutes (–debug-ratestats)
Dangerous options: (do NOT use unless you exactly know what you are doing)
-%! bypass built-in security limits aimed to avoid bandwidth abuses (bandwidth, simul‐
-IMPORTANT
-USE IT WITH EXTREME CARE
Command-line specific options:
-V execute system command after each files ($0 is the filename: -V “rm \$0”)
-%W use an external library function as a wrapper (-%W myfoo.so[,myparameters])
Details: Option N
-N0 Site-structure (default)
-N1 HTML in web/, images/other files in web/images/
-N2 HTML in web/HTML, images/other in web/images
-N3 HTML in web/, images/other in web/
-N4 HTML in web/, images/other in web/xxx, where xxx is the file extension (all gif
-N5 Images/other in web/xxx and HTML in web/HTML
-N99 All files in web/, with random names (gadget !)
-N100 Site-structure, without www.domain.xxx/
-N101 Identical to N1 exept that “web” is replaced by the site s name
-N102 Identical to N2 exept that “web” is replaced by the site s name
-N103 Identical to N3 exept that “web” is replaced by the site s name
-N104 Identical to N4 exept that “web” is replaced by the site s name
-N105 Identical to N5 exept that “web” is replaced by the site s name
-N199 Identical to N99 exept that “web” is replaced by the site s name
-N1001 Identical to N1 exept that there is no “web” directory
-N1002 Identical to N2 exept that there is no “web” directory
-N1003 Identical to N3 exept that there is no “web” directory (option set for g option)
-N1004 Identical to N4 exept that there is no “web” directory
-N1005 Identical to N5 exept that there is no “web” directory
-N1099 Identical to N99 exept that there is no “web” directory
Details: User-defined option N
%n Name of file without file type (ex: image)
%N Name of file, including file type (ex: image.gif)
%t File type (ex: gif)
%p Path [without ending /] (ex: /someimages)
%h Host name (ex: www.someweb.com)
%M URL MD5 (128 bits, 32 ascii bytes)
%Q query string MD5 (128 bits, 32 ascii bytes)
%k full query string
%r protocol name (ex: http)
%q small query string MD5 (16 bits, 4 ascii bytes)
%[param] param variable in query string
%[param:before:after:empty:notfound] advanced variable extraction
Details: User-defined option N and advanced variable extraction
%[param:before:after:empty:notfound]
-param : parameter name
-before
-after : string to append if the parameter was found
-notfound
-empty : string replacement if the parameter was empty
-all fields, except the first one (the parameter name), can be empty
Details: Option K
-K0 foo.cgi?q=45 -> foo4B54.html?q=45 (relative URI, default)
-K -> http://www.foobar.com/folder/foo.cgi?q=45 (absolute URL) (–keep-links[=N])
-K3 -> /folder/foo.cgi?q=45 (absolute URI)
-K4 -> foo.cgi?q=45 (original URL)
-K5 -> http://www.foobar.com/folder/foo4B54.html?q=45 (transparent proxy URL)
Details: Option %W: External callbacks prototypes
see htsdefines.h
相关文件(FILES)
/etc/httrack.conf
The system wide configuration file.
相关的环境变量(ENVIRONMENT)
HOME
Is being used if you defined in /etc/httrack.conf the line path ~/websites/#
问题诊断(DIAGNOSTICS)
Errors/Warnings
are reported to hts-log.txt by default, or to stderr if the -v option was specified.
存在的限制(LIMITS)
These are the principals limits of HTTrack for that moment. Note that we did not heard
about any other utility that would have solved them.
– Several scripts generating complex filenames may not find them (ex:
img.src=’image’+a+Mobj.dst+’.gif’)
– Some java classes may not find some files on them (class included)
– Cgi-bin links may not work properly in some cases (parameters needed). To avoid them:
use filters like -*cgi-bin*
参考文献
- man 1 httrack, version 3.49-2
- FAQ: http://www.httrack.com/html/faq.html
- HTML documentation: http://www.httrack.com/html