wget with recursive, overwrite & erobots off

About writing shell scripts and making the most of your shell
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Locked
Jator
Level 2
Level 2
Posts: 80
Joined: Sat Mar 13, 2021 10:58 am

wget with recursive, overwrite & erobots off

Post by Jator »

Sorry if this is a redundent question. I've googled many phrases trying to find out how to do the following:

use wget to download a series of pages off my local server. I know the following options:

- r = recursive download of the webpages behind the index.html file
- O = overwrite the previous files, if not present, wget errors out the download
- 'erobots=off' = ignore robot.txt from my website so it can download the files

Here's the variations of the command I use:

Code: Select all

wget -r O -erobots=off http://192.168.0.2/stats
wget -rO -erobots=off http://192.168.0.2/stats
wget -r -O -erobots=off http://192.168.0.2/stats


None work however, after the initial download, I typically get an "192.168.0.2/stats: Is a directory
" error.

Not sure if the inxi is needed for this type of request, but after several wrist slaps, I'll include it anyway

Code: Select all

System:
  Kernel: 5.11.0-22-generic x86_64 bits: 64 compiler: N/A 
  Desktop: Cinnamon 5.0.4 wm: muffin 5.0.1 dm: LightDM 1.30.0 
  Distro: Linux Mint 20.2 Uma base: Ubuntu 20.04 focal 
Machine:
  Type: Convertible System: LENOVO product: 81X2 v: IdeaPad Flex 5 14ARE05 
  serial: <filter> Chassis: type: 31 v: IdeaPad Flex 5 14ARE05 
  serial: <filter> 
  Mobo: LENOVO model: LNVNB161216 v: SDK0J40709 WIN serial: <filter> 
  UEFI: LENOVO v: EECN35WW date: 04/16/2021 
Battery:
  ID-1: BAT0 charge: 40.5 Wh condition: 53.6/52.5 Wh (102%) volts: 12.2/11.6 
  model: LGC L19L3PD6 type: Li-poly serial: <filter> status: Discharging 
  cycles: 11 
CPU:
  Topology: 6-Core model: AMD Ryzen 5 4500U with Radeon Graphics bits: 64 
  type: MCP arch: Zen rev: 1 L2 cache: 3072 KiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm 
  bogomips: 28446 
  Speed: 1315 MHz min/max: 1400/2375 MHz boost: enabled Core speeds (MHz): 
  1: 1262 2: 1680 3: 1386 4: 1397 5: 1446 6: 1397 
Graphics:
  Device-1: AMD Renoir vendor: Lenovo driver: amdgpu v: kernel 
  bus ID: 04:00.0 chip ID: 1002:1636 
  Display: x11 server: X.Org 1.20.9 driver: amdgpu,ati 
  unloaded: fbdev,modesetting,vesa resolution: 1920x1080~60Hz 
  OpenGL: renderer: AMD RENOIR (DRM 3.40.0 5.11.0-22-generic LLVM 11.0.0) 
  v: 4.6 Mesa 20.2.6 direct render: Yes 
Audio:
  Device-1: AMD driver: snd_hda_intel v: kernel bus ID: 04:00.1 
  chip ID: 1002:1637 
  Device-2: AMD Raven/Raven2/FireFlight/Renoir Audio Processor 
  vendor: Lenovo driver: N/A bus ID: 04:00.5 chip ID: 1022:15e2 
  Device-3: AMD Family 17h HD Audio vendor: Lenovo driver: snd_hda_intel 
  v: kernel bus ID: 04:00.6 chip ID: 1022:15e3 
  Sound Server: ALSA v: k5.11.0-22-generic 
Network:
  Device-1: Realtek RTL8822CE 802.11ac PCIe Wireless Network Adapter 
  vendor: Lenovo driver: rtw_8822ce v: N/A port: 2000 bus ID: 02:00.0 
  chip ID: 10ec:c822 
  IF: wlp2s0 state: up mac: <filter> 
  IF-ID-1: ppp0 state: unknown speed: N/A duplex: N/A mac: N/A 
Drives:
  Local Storage: total: 465.76 GiB used: 14.03 GiB (3.0%) 
  ID-1: /dev/nvme0n1 vendor: Crucial model: CT500P5SSD8 size: 465.76 GiB 
  speed: 31.6 Gb/s lanes: 4 serial: <filter> rev: P4CR311 scheme: GPT 
Partition:
  ID-1: / size: 456.96 GiB used: 14.02 GiB (3.1%) fs: ext4 
  dev: /dev/nvme0n1p2 
Sensors:
  System Temperatures: cpu: 38.9 C mobo: 38.0 C gpu: amdgpu temp: 38 C 
  Fan Speeds (RPM): N/A 
Repos:
  No active apt repos in: /etc/apt/sources.list 
  Active apt repos in: /etc/apt/sources.list.d/google-chrome.list 
  1: deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main
  Active apt repos in: /etc/apt/sources.list.d/nm-l2tp-network-manager-l2tp-focal.list 
  1: deb http://ppa.launchpad.net/nm-l2tp/network-manager-l2tp/ubuntu focal main
  Active apt repos in: /etc/apt/sources.list.d/official-package-repositories.list 
  1: deb http://packages.linuxmint.com uma main upstream import backport #id:linuxmint_main
  2: deb http://archive.ubuntu.com/ubuntu focal main restricted universe multiverse
  3: deb http://archive.ubuntu.com/ubuntu focal-updates main restricted universe multiverse
  4: deb http://archive.ubuntu.com/ubuntu focal-backports main restricted universe multiverse
  5: deb http://security.ubuntu.com/ubuntu/ focal-security main restricted universe multiverse
  6: deb http://archive.canonical.com/ubuntu/ focal partner
Info:
  Processes: 265 Uptime: 1d 42m Memory: 15.08 GiB used: 2.97 GiB (19.7%) 
  Init: systemd v: 245 runlevel: 5 Compilers: gcc: 9.3.0 alt: 9 Shell: bash 
  v: 5.0.17 running in: gnome-terminal inxi: 3.0.38 
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
t42
Level 11
Level 11
Posts: 3717
Joined: Mon Jan 20, 2014 6:48 pm

Re: wget with recursive, overwrite & erobots off

Post by t42 »

did you try wget -r http://192.168.0.2/stats/*
-=t42=-
Jator
Level 2
Level 2
Posts: 80
Joined: Sat Mar 13, 2021 10:58 am

Re: wget with recursive, overwrite & erobots off

Post by Jator »

Code: Select all

jay@flex5:~/.conky/wget$ wget -r http://192.168.0.2/stats/*
Warning: wildcards not supported in HTTP.
--2021-07-12 10:00:46--  http://192.168.0.2/stats/*
Connecting to 192.168.0.2:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-07-12 10:00:46 ERROR 404: Not Found.
Confirmed I'm able to see the webpages via browser. Also deleted the downloaded files to see if it was a overwrite permission issue but same error response.
FreedomTruth
Level 4
Level 4
Posts: 443
Joined: Fri Sep 23, 2016 10:19 am

Re: wget with recursive, overwrite & erobots off

Post by FreedomTruth »

Jator wrote: Mon Jul 12, 2021 8:54 am Sorry if this is a redundent question. I've googled many phrases trying to find out how to do the following:

use wget to download a series of pages off my local server. I know the following options:

- r = recursive download of the webpages behind the index.html file
- O = overwrite the previous files, if not present, wget errors out the download
- 'erobots=off' = ignore robot.txt from my website so it can download the files

Here's the variations of the command I use:

Code: Select all

wget -r O -erobots=off http://192.168.0.2/stats
wget -rO -erobots=off http://192.168.0.2/stats
wget -r -O -erobots=off http://192.168.0.2/stats

wget --help
-O, --output-document=FILE write documents to FILE

man wget

Code: Select all

       -O file
       --output-document=file
           The documents will not be written to the appropriate files, but all will be concatenated together and
           written to file.  If - is used as file, documents will be printed to standard output, disabling link
           conversion.  (Use ./- to print to a file literally named -.)

           Use of -O is not intended to mean simply "use the name file instead of the one in the URL;" rather, it is
           analogous to shell redirection: wget -O file http://foo is intended to work like wget -O - http://foo >
           file; file will be truncated immediately, and all downloaded content will be written there.

           For this reason, -N (for timestamp-checking) is not supported in combination with -O: since file is always
           newly created, it will always have a very new timestamp. A warning will be issued if this combination is
           used.

           Similarly, using -r or -p with -O may not work as you expect: Wget won't just download the first file to
           file and then download the rest to their normal names: all downloaded content will be placed in file. This
           was disabled in version 1.11, but has been reinstated (with a warning) in 1.11.2, as there are some cases
           where this behavior can actually have some use.

           A combination with -nc is only accepted if the given output file does not exist.

           Note that a combination with -k is only permitted when downloading a single document, as in that case it
           will just convert all relative URIs to external ones; -k makes no sense for multiple URIs when they're all
           being downloaded to a single file; -k can be used only when the output is a regular file.
Note the paragraph about using -r with -O. I don't think this is what you intended to do.
1000
Level 6
Level 6
Posts: 1034
Joined: Wed Jul 29, 2020 2:14 am

Re: wget with recursive, overwrite & erobots off

Post by 1000 »

Do you read man in terminal ?

Code: Select all

man wget
https://linux.die.net/man/1/wget
Similarly, using -r or -p with -O may not work as you expect: .... all downloaded content will be placed in file.
Jator
Level 2
Level 2
Posts: 80
Joined: Sat Mar 13, 2021 10:58 am

Re: wget with recursive, overwrite & erobots off

Post by Jator »

Thanks for all the responses. I can download the numerous files (index, subfoldolders, sub web pages) using:

Code: Select all

wget -r -erobots=off 0 http://192.168.0.2/stats
I can also overwrite single pages if I download a single page

Code: Select all

wget -O http://192.168.0.2/stats/e1/index.html
What I want to do is the combination of the above, pull all files periodically and overwrite the entire set of files say every 30 minutes so I can pull information out into various scrips without having to repeatedly pull from the website. I connect via VPN sometimes and it's much slower when I pull the information singularly through curl or other lookup scripts.

If Im using it in a manner it wasn't designed to, may have to stay with my other method.

Thanks,

Jay
Jator
Level 2
Level 2
Posts: 80
Joined: Sat Mar 13, 2021 10:58 am

Re: wget with recursive, overwrite & erobots off

Post by Jator »

I guess I figured a workaround. Only issue is if my VPN isn't connected, the below script will error out after it removes the files before I download again whereas I was hoping if the VPN was down, the original script would error before overwritting and thus preserve those files since it wouldn't be able to connect to the website.

Code: Select all

#!/bin/bash

rm -r 192.168.0.2
wget -r -erobots=off http://192.168.0.2/stats
FreedomTruth
Level 4
Level 4
Posts: 443
Joined: Fri Sep 23, 2016 10:19 am

Re: wget with recursive, overwrite & erobots off

Post by FreedomTruth »

The "problem" is that -O is not overwrite. It specifies the output file name.
Since your source is apparently a directory, try putting a / at the end of it. Then -r seems to not complain about it being a directory. In your particular case:

Code: Select all

wget -r -erobots=off http://192.168.0.2/stats/
Locked

Return to “Scripts & Bash”