Web Crawler Advice Sought

Technology, computers, sciences, mysteries and phenomena of all kinds, etc., etc. all here at The Loresraat!!

Moderator: Vraith

Post Reply
User avatar
ussusimiel
The Gap Into Spam
Posts: 5346
Joined: Tue May 31, 2011 12:34 am
Location: Waterford (milking cows), and sometimes still Dublin, Ireland

Web Crawler Advice Sought

Post by ussusimiel »

I'm looking for info about web crawlers and web spiders. What kind of information can they provide? Are there any good free ones? Is building the database behind them difficult?

I've done some study on the Web, but I'm still a bit at sea, so any advice or direction will be appreciated.

u.
Tho' all the maps of blood and flesh
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
User avatar
wayfriend
.
Posts: 20957
Joined: Wed Apr 21, 2004 12:34 am
Has thanked: 2 times
Been thanked: 6 times

Post by wayfriend »

I've never worked on a Web Crawler specifically.

In addition to what is on web pages, a crawler could develop a map showing what page linked to what. From there, you can build models that show what kinds of things are on pages that link to a page. That's sort of what google does: it doesn't care if a page says "Fruit of the Land", it care's what other pages say that that page is about "Fruit of the Land". It doesn't analyze an image to discover it's a piece of fruit, it looks at pages that reference it and notices that they say "here's a piece of fruit".

Unless you want to crawl a very small number of pages, I would say that the database would need to be astronomical in size and highly indexed. It's certainly a "big data" class of problem. So: very difficult.

Probably you already know all this. I could maybe help with more specific questions.
.
User avatar
Fist and Faith
Magister Vitae
Posts: 25488
Joined: Sun Dec 01, 2002 8:14 pm
Has thanked: 9 times
Been thanked: 57 times

Post by Fist and Faith »

Interesting that I'd never heard of this stuff until a couple weeks ago, when I read Robert Sawyer's WWW trilogy. My understanding (the correctness of which I make no claims) is that they are what gather the information for search engines. They are sorta the things that build a database? And absolutely no idea how they're made.
All lies and jest
Still a man hears what he wants to hear
And disregards the rest
-Paul Simon

Image
User avatar
ussusimiel
The Gap Into Spam
Posts: 5346
Joined: Tue May 31, 2011 12:34 am
Location: Waterford (milking cows), and sometimes still Dublin, Ireland

Post by ussusimiel »

I wouldn't be looking to crawl a huge number of pages, nor use a huge amount of computing power. It would be more of a steady build-up of information over time. I downloaded one free but wasn't able to get the database behind it hooked up. (I'm not including the links as I don't want anyone to go to any trouble. I asked just in case anyone had a nifty solution that might save me some of the inevitable hard slog :lol: )

u.
Tho' all the maps of blood and flesh
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
User avatar
Avatar
Immanentizing The Eschaton
Posts: 62038
Joined: Mon Aug 02, 2004 9:17 am
Location: Johannesburg, South Africa
Has thanked: 25 times
Been thanked: 32 times
Contact:

Post by Avatar »

They can collect anything publicly available for the most part. Anything secure is inaccessible unless you have rights.

Depending on the tool, it might be affected by a sites robots.txt file (specifying parts of the site that should not be indexed). Or it might not.

Afraid I don't know anything about using them to build a db though, except for the fact that technically, anything they crawl and copy becomes a db.

--A
User avatar
wayfriend
.
Posts: 20957
Joined: Wed Apr 21, 2004 12:34 am
Has thanked: 2 times
Been thanked: 6 times

Post by wayfriend »

u, if you tried an off the shelf crawler, it must come with instructions on how to build the database it requires. If you can't make it work, then there's a problem with their instructions. I am not sure if it requires something very specific (for example, a postgres database), or something described more abstractly (a JDBC resource), or if it expects you to write the adapter code on top of the database and link it into the code. Without more information, that's about all I can say. If you have specific database errors you are encountering, that would be a clue, too.
.
User avatar
ussusimiel
The Gap Into Spam
Posts: 5346
Joined: Tue May 31, 2011 12:34 am
Location: Waterford (milking cows), and sometimes still Dublin, Ireland

Post by ussusimiel »

Just a bit of an update on my crawling experience: I got a bit further with the free crawler that I'd downloaded. I needed to install MySQL so that I'd have the right server for the application. I managed to get some connectivity between the two, but then as I was digging around the web I realised that it wasn't a web crawler/spider that I was looking for but rather a web scraper (new name for me!).

That's given me a new direction to go in and I'm currently looking at some free(ish) online ones. Thanks for the input. I'll get back to you if I get stuck or make progress :lol:

u.
Tho' all the maps of blood and flesh
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
User avatar
Avatar
Immanentizing The Eschaton
Posts: 62038
Joined: Mon Aug 02, 2004 9:17 am
Location: Johannesburg, South Africa
Has thanked: 25 times
Been thanked: 32 times
Contact:

Post by Avatar »

:lol: And just what are you scraping...

...Never mind...I don't want to know. ;)

--A
User avatar
ussusimiel
The Gap Into Spam
Posts: 5346
Joined: Tue May 31, 2011 12:34 am
Location: Waterford (milking cows), and sometimes still Dublin, Ireland

Post by ussusimiel »

Avatar wrote::lol: And just what are you scraping...

...Never mind...I don't want to know. ;)
I don't think that you could even begin to imagine how banal the information I'm looking for is. I've started working for my brother who owns a construction machinery sales company. Part of my job is to maintain and build his mailing list. Basically I've become a part-time spam artist :oops:

If you get an odd email with a catalogue of used machinery parts just unsubscibe and forgive me. If you get another one (I've finally found out why sometimes unsubsrcibing doesn't seem to work (they just put you back on the list :-x )) you have permission to flame me :rocket:

u.
Tho' all the maps of blood and flesh
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
User avatar
Avatar
Immanentizing The Eschaton
Posts: 62038
Joined: Mon Aug 02, 2004 9:17 am
Location: Johannesburg, South Africa
Has thanked: 25 times
Been thanked: 32 times
Contact:

Post by Avatar »

Actually, that is by far the most common use (and original purpose) of "scrapers." To build a mailing list. Watch out, the opt-in / opt-out requirements are getting stricter all the time. ;)

--A
Post Reply

Return to “The Loresraat”