Posts Tagged ‘spinn3r’

This is pretty nice. Google released Zippy as Open Source:

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems. (Snappy has previously been referred to as “Zippy” in some presentations and the likes.)

This means that along with open-vcdiff it is possible to use the full Google compression tool chain.

There’s a middle path here. You can go with someone like Softlayer or Rackspace and have your cake and eat it too.

Softlayer is a bit closer to being the cloud. We love them. Great company. Major partner for us… we’re going to be doubling down on servers this year and they’re going to get another big order from us.

This was the best decision I’ve made regarding Spinn3r I think. We gave this decision a lot of thought and were going to Colo but at the last minute I said felt that Colo was just a bad call and we went with Softlayer instead.

Win!

 

Facebook CTO Bret Taylor says buying servers was a mistake. A very big mistake. At the time, he was chief executive at FriendFeed, which eventually was sold to Facebook for the tidy sum of a reported $50 million. But these were the early days. He and his team needed to decide between buying servers or using Amazon Web Services. They bought the servers.

[From Facebook CTO Bret Taylor’s Biggest Mistake? Buying Servers – ReadWriteCloud]

 

We’re hiring an API Software Engineer to join the team over at Spinn3r.

We’re probably going to be hiring 2-3 engineers in the next month or so but don’t grow too fast. We want to focus on one position at a time so we can bring in the best potential hires.

This is a fun time to work in a startup though!

Job Description

Responsibilities:
Interact with customers both in the early sales cycle and support role to answer technical questions about our technology (crawling, ranking, etc)
Work with our API to understand throughput issues, protocol challenges, and optimize it for new issues as they arise.
Develop new version of our API as it evolves (more throughput, additional features, etc).
Monitor our crawler stats to enable understanding of operation and detect operational anomalies, monitor statistics, implement new features, etc.
Work on Java implementation of various new Spinn3r features as well as fix bugs in our current product. You will also be working on infrastructure in this position and responsible for various backend Java components of our architecture.
General passion and interest in technology (distributed systems, open content,
Web 2.0, etc).
should stress that while you’ll be interacting with customers, and providing support, our customers are exceedingly brilliant and amazingly knowledgeable about our space. They’re a major asset and staying in sync with them is very important for the company.

[From API Software Engineer at Spinn3r in San Francisco | LinkedIn]

Screen shot 2010-12-14 at 11.00.52 AM.pngAt Spinn3r we frequently deal with the chaos revolving around robots.txt so I thought I would throw few thoughts out there about the complexity of the issues involved here.

REP is not a EULA

This is part of the confusion around robots.txt. It’s not really clear that just because you can fetch a URL that the website will be happy with your use of that content.

It also isn’t clear that just because you fetched a URL that it means that you AGREED to a EULA limiting your rights.

In fact, I would argue that it does not limit your rights.

There is no forced click through and simply presenting your URL somewhere on your website, where a robot wouldn’t be able to read it, doesn’t mean that the company that originated this request has agreed to your EULA.

Limited amount of content.

Just because you’re allowed to fetch pages via robots.txt doesn’t mean that the website owner will be happy with you spidering their ENTIRE website and using EVERY single page.

Various social networks have routinely been upset and threatened lawsuits should bulk numbers of pages be downloaded from their site, even when robots.txt were in place.

This seems like a reasonable restriction.

An extension to propose a ‘limit’ on the number of URLs used for indexing purposes might make sense.

However, it’s a bit more complicated than that. What if you just use the URLs to compute rank for the top pages on that site and then discard the old ones?

The website would have no way to verify that you in fact discarded them.

Second, how long does the limit apply for? What if the limit is 1000 pages, and you index them, build your inverted index for full text search, then discard them entirely, and fetch another 1000. At any point in time you only have 1000 documents stored but you clearly have other secondary indexes built from these documents including link graphs, inverted indexes, etc.

Throttling

Throttling is another complicated problem. which robots.txt tries to solve but doesn’t go far enough.

The specification includes a ‘Delay’ option for delays between requests but this doesn’t actually solve the problem.

Serializing request isn’t necessarily required to throttle access to a website.

Fetching a maximim of 10 requests per second, even if they’re overlapping, should be fine for most websites.

Google implements a latency based throttle which measures HTTP response time and backs off when it starts to rise.

Destruction of URLs and expensive URLs

One thing that REP is that it’s unclear WHY a URL was disallowed. What happens if someone implements a GET URL that actually deletes resources or otherwise mangles databases.

These things have happened before and they will happen again.

Also, what about URLs that are just expensive to load and use high database
resources?

Implementing a better throttle vocabulary would help fix this problem of course but having ALL the robots fetch this URL even though it’s throttled could still DOS your website becuase the robots don’t actually coordinate their throttling.

Google, Bing, and Ask aren’t the only crawlers

This is another problem we frequently see… because website owners feel uncomfortable sharing their content with “just anybody” they will often mark their website content available for only search engines.

The problem is that Google, Bing, and Ask aren’t the only search engines in town.

This is a chilling effect for sites like Blekko (and a number of Spinn3r customers) becuase they have to go to every website that blocks them and beg for access.

This makes it harder for search engine startups becuase the resources required here would be very expensive.

One potential solution is to have profile based user agents supported in robots.txt which strict definitions in the REP.

If you’re a public search engine, (Google, Microsoft, or any small or stealth startup) then you can access (or be disallowed) to access the content without having to benefit just the big guys.

What’s a robot? What about RSS?

There is also an issue of what exactly a robot is?

We’ve seen people that have RSS feeds, that have a public API available, but then have a Disallow for all user agents.

How does that make sense? So if you’re a robot you can’t index their RSS feed?

Is NetNewsWire a robot? What about Firefox with their RSS bookmarks feature?

What about Google Reader? Is that a robot?

It would seem that an RSS feed or an API (with documentation) is an open for business sign inviting anyone to use the website (under the terms of use of your API of course) but your robots.txt just blocked them so your intentions are unclear.

Privacy is complicated

One problem that various social networks have run into is that they allow users to be ‘private’ by requiring the user to have an account on the website before showing their profile.

However, they then turn around and publish a snippet of their profile via unauthenticated HTTP.

If you’re clever you can build a search engine around this data and publish it in aggregate.

This can often frighten the user because they never intended their profile data to be used in this manner.

With facial recognition software it’s then possible to tie profiles together or use other textual signals to merge data from various social networks.

It’s fair to say that this could really alarm some users who are not up to date with the power of the Internet for de-anonymizing users.

This is a fair concern but some of the pressure is on the Social Networks to clarify privacy or their users and not to enable these awkward situations in the first place.

Politics

… and at the end of the day, REP can be used by a website JUST because they don’t like you or don’t understand what you’re up to.

This is the major problem as I see it. It will require website to clearly explain why and how their content is used.

It doesn’t help the situation that some websites are only allowing the big boys by default and blocking everyone else.

This is a dangerous situation because startups now need to beg for permission across thousands of websites (which is very expensive).

The way forward

I think the way forward here is to not attempt to solve all the problems with REP right now. Focus on a few things that are clearly broken and try to fix them.

Perfection seems to always be the enemy of progress and maybe REP will never be perfect but for now it’s all we have.

Spinn3r is hiring a cool Java engineer here in our San Francisco office.

It’s a great position with a super smart bunch of guys. Centrally located right in SOMA (2nd and Howard) and we have an AWESOME office (it’s 102 years old !)

Responsibilities:

* Maintain our current crawler.
* Monitor and implement statistics behind the current crawler to detect anomalies.
* Implement new features for customers
* Work on backend architecture to improve performance and stability.
* Implement custom protocol extension for enhanced metadata and site specific social media support.
* Work on new products and features using large datasets.

Requirements and Experience:

* Deep understanding of Java (Threads, IO, tuning, etc)
* Internet standards (HTTP, HTML, RSS, DNS, etc)
* SQL
* Basic understanding of distributed systems (load balancers, job control, batch
processsing, TCP, etc).
* Version control (preferably hg or git)
* Comfortable in a UNIX environment (ssh, bash, file manipulation, etc)
* Optional:
Debian, Python, Linux kernel, MySQL, Crawler Design.