Python Script for archive.org

Post Reply
thistimeforsure (imported)
Articles: 0
Posts: 2
Joined: Wed Feb 28, 2024 4:00 pm

Posting Rank

Python Script for archive.org

Post by thistimeforsure (imported) »

Hey,

Thought I would share. I wrote this script a while ago to save all the stories from a archive.org snapshot.

It don't spend a lot of time making it clean/less hacky but if you want to save all of archive.org snapshot to a file it will do it for you.

It takes a while and you can add a url to the last_url variable if you have to run it several times.

Just change the p variable to the output file path.

import requests

from bs4 import BeautifulSoup

dl = '-----------------------------------------------'

p = r"C:\...PUT YOUR PATH HERE\out.txt"

f = open(p, 'w', encoding="utf-8")

#can be whatever snapshot url you want

r = requests.get(r"https://web.archive.org/web/20110107120 ... al=stories")

b = r.text

s = BeautifulSoup(b)

a = []

st = []

l1 = s.find_all('td',{'class':['cellodd','celleven']})

last_url = None

capture = False

for l in l1:

try:

c = l.contents[0]

a.append(c)

except:

pass

for l in a:

c = l.attrs['href']

r2 = requests.get(c)

b2 = r2.text

s2 = BeautifulSoup(b2)

ls = s2.find_all('a', href=True)

for e in ls:

if e.attrs['href'] == last_url or last_url == None:

capture = True

t = None

if capture == True:

try:

rt = requests.get(e.attrs['href'])

rtb = rt.text

rtbs = BeautifulSoup(rtb)

t = rtbs.text

st.append(t)

except:

pass

if t != None:

f.write(t)

f.close

f = open(p, 'a', encoding="utf-8")

alls = dl.join(st)

f.write(alls)

f.close

print('d')
fhunter
Site Admin
Articles: 0
Posts: 1634
Joined: Wed Nov 27, 2024 9:57 am
Location: Serbia
Has thanked: 57 times
Been thanked: 18 times

Posting Rank

Re: Python Script for archive.org

Post by fhunter »

one small comment here - to reduce load on archive.org and to speed things up - you should use session feature from requests module

session = requests.Session()

....

r = session.get(...)

kind of like that - so all calls are done not with requests.get(), but with session.get()

this way: requests module uses persistent connections, if enabled (saves DNS lookups, saves connection time), and keeps cookies from previous connections if set (not sure if it is beneficial here).
thistimeforsure (imported)
Articles: 0
Posts: 2
Joined: Wed Feb 28, 2024 4:00 pm

Posting Rank

Re: Python Script for archive.org

Post by thistimeforsure (imported) »

Man, you learn something new everyday. I had tried using a sessions object for another script I did because it needed OAuth, but I honestly, don't really know the underlying logic behind a lot of this stuff. I'll keep studying.
Post Reply

Return to “Favorite Stories & New Ideas”