Web Crawler Advice Sought
Moderator: Vraith
- ussusimiel
- The Gap Into Spam
- Posts: 5346
- Joined: Tue May 31, 2011 12:34 am
- Location: Waterford (milking cows), and sometimes still Dublin, Ireland
Web Crawler Advice Sought
I'm looking for info about web crawlers and web spiders. What kind of information can they provide? Are there any good free ones? Is building the database behind them difficult?
I've done some study on the Web, but I'm still a bit at sea, so any advice or direction will be appreciated.
u.
I've done some study on the Web, but I'm still a bit at sea, so any advice or direction will be appreciated.
u.
Tho' all the maps of blood and flesh
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
- wayfriend
- .
- Posts: 20957
- Joined: Wed Apr 21, 2004 12:34 am
- Has thanked: 2 times
- Been thanked: 6 times
I've never worked on a Web Crawler specifically.
In addition to what is on web pages, a crawler could develop a map showing what page linked to what. From there, you can build models that show what kinds of things are on pages that link to a page. That's sort of what google does: it doesn't care if a page says "Fruit of the Land", it care's what other pages say that that page is about "Fruit of the Land". It doesn't analyze an image to discover it's a piece of fruit, it looks at pages that reference it and notices that they say "here's a piece of fruit".
Unless you want to crawl a very small number of pages, I would say that the database would need to be astronomical in size and highly indexed. It's certainly a "big data" class of problem. So: very difficult.
Probably you already know all this. I could maybe help with more specific questions.
In addition to what is on web pages, a crawler could develop a map showing what page linked to what. From there, you can build models that show what kinds of things are on pages that link to a page. That's sort of what google does: it doesn't care if a page says "Fruit of the Land", it care's what other pages say that that page is about "Fruit of the Land". It doesn't analyze an image to discover it's a piece of fruit, it looks at pages that reference it and notices that they say "here's a piece of fruit".
Unless you want to crawl a very small number of pages, I would say that the database would need to be astronomical in size and highly indexed. It's certainly a "big data" class of problem. So: very difficult.
Probably you already know all this. I could maybe help with more specific questions.
.
- Fist and Faith
- Magister Vitae
- Posts: 25487
- Joined: Sun Dec 01, 2002 8:14 pm
- Has thanked: 9 times
- Been thanked: 57 times
Interesting that I'd never heard of this stuff until a couple weeks ago, when I read Robert Sawyer's WWW trilogy. My understanding (the correctness of which I make no claims) is that they are what gather the information for search engines. They are sorta the things that build a database? And absolutely no idea how they're made.
All lies and jest
Still a man hears what he wants to hear
And disregards the rest -Paul Simon

Still a man hears what he wants to hear
And disregards the rest -Paul Simon

- ussusimiel
- The Gap Into Spam
- Posts: 5346
- Joined: Tue May 31, 2011 12:34 am
- Location: Waterford (milking cows), and sometimes still Dublin, Ireland
I wouldn't be looking to crawl a huge number of pages, nor use a huge amount of computing power. It would be more of a steady build-up of information over time. I downloaded one free but wasn't able to get the database behind it hooked up. (I'm not including the links as I don't want anyone to go to any trouble. I asked just in case anyone had a nifty solution that might save me some of the inevitable hard slog
)
u.

u.
Tho' all the maps of blood and flesh
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
- Avatar
- Immanentizing The Eschaton
- Posts: 62038
- Joined: Mon Aug 02, 2004 9:17 am
- Location: Johannesburg, South Africa
- Has thanked: 25 times
- Been thanked: 32 times
- Contact:
They can collect anything publicly available for the most part. Anything secure is inaccessible unless you have rights.
Depending on the tool, it might be affected by a sites robots.txt file (specifying parts of the site that should not be indexed). Or it might not.
Afraid I don't know anything about using them to build a db though, except for the fact that technically, anything they crawl and copy becomes a db.
--A
Depending on the tool, it might be affected by a sites robots.txt file (specifying parts of the site that should not be indexed). Or it might not.
Afraid I don't know anything about using them to build a db though, except for the fact that technically, anything they crawl and copy becomes a db.
--A
- wayfriend
- .
- Posts: 20957
- Joined: Wed Apr 21, 2004 12:34 am
- Has thanked: 2 times
- Been thanked: 6 times
u, if you tried an off the shelf crawler, it must come with instructions on how to build the database it requires. If you can't make it work, then there's a problem with their instructions. I am not sure if it requires something very specific (for example, a postgres database), or something described more abstractly (a JDBC resource), or if it expects you to write the adapter code on top of the database and link it into the code. Without more information, that's about all I can say. If you have specific database errors you are encountering, that would be a clue, too.
.
- ussusimiel
- The Gap Into Spam
- Posts: 5346
- Joined: Tue May 31, 2011 12:34 am
- Location: Waterford (milking cows), and sometimes still Dublin, Ireland
Just a bit of an update on my crawling experience: I got a bit further with the free crawler that I'd downloaded. I needed to install MySQL so that I'd have the right server for the application. I managed to get some connectivity between the two, but then as I was digging around the web I realised that it wasn't a web crawler/spider that I was looking for but rather a web scraper (new name for me!).
That's given me a new direction to go in and I'm currently looking at some free(ish) online ones. Thanks for the input. I'll get back to you if I get stuck or make progress
u.
That's given me a new direction to go in and I'm currently looking at some free(ish) online ones. Thanks for the input. I'll get back to you if I get stuck or make progress

u.
Tho' all the maps of blood and flesh
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
- ussusimiel
- The Gap Into Spam
- Posts: 5346
- Joined: Tue May 31, 2011 12:34 am
- Location: Waterford (milking cows), and sometimes still Dublin, Ireland
I don't think that you could even begin to imagine how banal the information I'm looking for is. I've started working for my brother who owns a construction machinery sales company. Part of my job is to maintain and build his mailing list. Basically I've become a part-time spam artistAvatar wrote:And just what are you scraping...
...Never mind...I don't want to know.

If you get an odd email with a catalogue of used machinery parts just unsubscibe and forgive me. If you get another one (I've finally found out why sometimes unsubsrcibing doesn't seem to work (they just put you back on the list


u.
Tho' all the maps of blood and flesh
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.
Are posted on the door,
There's no one who has told us yet
What Boogie Street is for.