you should not put 4PB of image files into a SQL database, and probably not on a traditional filesystem either -- you should be using an object-storage system for that workload, something software-defined to make scalability and replication easier on top of whatever hardware you choose to back it.
Depending on whether you want to purchase a product, or roll-your-own ... some options to check out: EMC ScaleIO, Ceph, OpenStack Swift
ZFS is not a clustered/distributed file system.
Vlinux may be interested in Ceph (recommended) or lustre.
Please note that for distributed file systems, you need multiple nodes and with all the hardware required, you may not be that well off.
>It looks like most virtualization solutions are either single node or requires a SAN. We are currently looking at ganeti but there does not look to be any automatic HA.
If you don't need to match strict RTO/RPO you don't need HA or whatever used for uptime improvements.
If you look at Ganeti it means you're OK with Xen so give a try to Remus.
http://wiki.xen.org/wiki/Remus
Ceph is also getting to a point it could be considered "good enough".
http://ceph.com/ceph-storage/block-storage/
If you prefer VMware then Google for Virtual SAN and various free clones of it.
The biggest question for you guys is how you plan on storing those pictures. If you're just planning on storing them as unstructured data, object storage would be your best bet.
Something like Ceph would probably work well for you guys if you need to roll out your own hardware.
With that said, if you're a startup, why are you not utilizing Amazon S3 / Azure Blob storage / etc? That would be much better for your cash flow / burn rate since you will not need to purchase hardware upfront, only pay AWS or Azure for the cost of storage.
Bluestore is set to be a stable feature in the Kraken release. Their release schedule seems to be "every six months plus however long it takes to get it right" so there's a little drift, but I'd anticipate a RC or stable release for Kraken by Jan/Feb at least. In fact there was a developer preview for Kraken released at the end of October.
It sounds like you're pretty familiar, but if you haven't seen it yet Sage's tech talk on Bluestore from June talks about the state of the Bluestore project, what it's intended features will be, and when it'll get mainlined.
Oh, and if you swing into the datahoarder IRC channel, you'll find a couple of us already testing that developer preview. Bluestore seems to work really well, though memory usage is a touch on the high side at the moment.
Not the same thing of course but Western Digital and Seagate are both moving the (ceph) daemon as close as possible to the drive by attaching an outboard Linux computer to the drive itself. In ceph each drive has a daemon but now it can also have a dedicated system. http://ceph.com/community/500-osd-ceph-cluster/
>The WDLabs Converged Microserver He8
>The Converged Microserver He8 is a microcomputer built on the existing production Ultrastar® He8 platform. The host used in the Ceph cluster is a Dual-Core Cortex-A9 ARM Processor running at 1.3 GHz with 1 GB of Memory, soldered directly onto the drive’s PCB (pictured). Options include 2 GB of memory and ECC protection. It contains the ARM NEON coprocessor to help with erasure code computations and XOR and crypto engines
Well, I don't think there's really a "max" in such a way that 12 drives are oke and 14 drives suddenly aren't but you have to wonder if classic raid is "the way to go" once the number of drives really start increasing. Filesystems like zfs and btrfs for example, they have raid functionality built in and do lots of other stuff for you like snapshotting and checksumming buteven those are still a single point of failure. Ceph for example is a distributed storage platform, it has no raid functionality but uses single drives and simply replicates your data 1-4 times depending on your paranoia level. http://ceph.com
AsRock is not unique in it's reliability. All the motherboard manufacturers have their lemons, the rate of which will vary over time. AsRocks boards have been increasing in quality, closing the gap. We're talking about a difference of singular percentages in terms of overall failure rate between the companies. I've had bad MSI boards, I've had bad Gigabyte boards, I've had bad Asus boards.
These organizations are dynamic and constantly balancing the necessary economies of being in the consumer space against reliability. You can play the odds or you can design for failure. Backup regardless of what it's running on.
Be objective and rational. And build a Ceph.
I can give you my ceph buildout excel sheet, with parts and all, you just have to change the number of chassis. i have the systems priced down to 10k for 144TB raw.
some good notes are these
>move /var to ssd on monitors
>kernel.pid_max = 4194303 on OSD's
here is a nice prep script for CEPH systems on debian
#/bin/bash echo nameserver 8.8.8.8 | tee /etc/resolvconf/resolv.conf.d/base ufw disable wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | apt-key add - echo deb http://ceph.com/debian-hammer/ $(lsb_release -sc) main | tee /etc/apt/sources.list.d/ceph.list useradd -d /home/ceph -s /bin/bash -m ceph echo "ceph ALL = (root) NOPASSWD:ALL" | tee /etc/sudoers.d/ceph echo "ceph:CHANGeME123!!!" | chpasswd apt-get update sysctl -w kernel.pid_max=4194303
Well.... RAID5 is erasure coding. RAID6 is two parities for increased redundancy.
There are other erasure coding schemes. There are other architectures to use erasure coding schemes in besides RAID.
There is some interesting work in that area: https://www.tahoe-lafs.org/trac/tahoe-lafs
ceph-deploy can be pointed at local repos or you can configure through a proxy. I would probably copy the repo to the deploy server if you aren't going to have access to any other boxes.
you should have an odd number of monitors (3). You can use the deploy box imo, it won't do anything most of the time.
There are also very good ansible playbooks to consider instead of ceph-deploy. https://github.com/ceph/ceph-ansible
I would house rados gateways on separate hosts. Same for mds. mon/mgr are really the only recommended ok for colocation.
you should have one osd per disk, not per server. so if you have 30 4tb disks, you would have 30 osds on that host.
2 networks a public and a private is a recommended setup, its not necessary though.
ceph using the default settings is basically raid1 across a failure domain you specify (host, rack, etc).
A lot of good information is at http://ceph.com/docs/master/start
For me none other than curiosity and the intrinsic urge to always improve. For DO and Linode, they claim that KVM is considerably faster than Xen. See http://www.cio.com/article/2937714/cloud-computing/why-linode-moved-to-kvm.html Choice quote: >What clear advantages are there with KVM that justify this move? >Feliciano: In a word, “performance.” Our internal UnixBench scores increased three-fold (~800 SBIS on Xen to ~2400 SBIS on KVM).
One thing I failed to mention in replying to your question about production usage is that both GlusterFS and Ceph are backed by pretty big companies with a vested interest in making both systems succeed and both are being used in production in large scale deployments in various scenarios. See http://ceph.com and https://www.gluster.org.
> You can upgrade daemons in your Ceph cluster while the cluster is online and in service
You can find some good info here: http://docs.ceph.com/docs/kraken/install/upgrading-ceph/
For tips and questions, you can try the ceph-users mailing list or #ceph on irc.
I don't know if you have considered it or not, but if you are mainly interested in object storage, perhaps taking a look at Ceph as well.
Ceph is an object store that has built in replication, self healing, fault detection, etc. It has an object store gateway that is Swift and S3 compatible, so interfacing with it is just as easy as those.
Can you provide a little more information on what your use case is for your object storage?
The reason that I ask is Swift is designed to mirror S3's functionality which means that it is designed to do "lazy replication" where objects are eventually replicated to all nodes. Ceph on the other hand has an immediate replication requirement, which means that when you write an object to a Ceph object store that object is immediately replicated the minimum number of replicas right from the start (with an eventual replica number matching your "ideal" replication level). So Swift replication is designed for a more geographical distribution, where as Ceph is designed for a more localized distribution (although it does provide the ability to do async replication to another Ceph cluster for backup/disaster recover purposes).
I don't know how well MDS is at this point. So I can't say anything about 1)
If you want to do the VM-approach use libvirt's built-in support for RBD rather than using the kernel rbd and mapping inside. That's what I would have done. (So 2. essentially)
http://ceph.com
http://zeromq.org
It's likely that Google's bigtable was written in C++, considering how many open sourced related libraries are.
Your operating system and it's network stack are written in C, as is the case for the routers and switches that power the internet. (This one raises some interesting philosophical questions, I think. If almost every part of DNS is written in C, is DNS? If almost all the things that actually send, receive, switch and route the packets that are on the internet are written in C, is the internet? Probably not, but it's a thought)
It's hard to write safe, fast, networked software in C, I can't disagree with you.
I do think it's worth mentioning that no one advertises using C though. Makes it harder to see when it's being used.
I have a test system setup running Ceph. I am working on the tiering at the moment (so highly used data is on SSD, less active data gets pushed to slower HHDs). The storage gets exported as an iSCSI target using Ceph RDB (It's also possible to have one or more front end Linux servers exporting NFS, CIFS or whatever else you need).
My only concern is that the speed of my WAN links to my secondary data centre will cause I/O blocking issues as per the wiki: https://wiki.ceph.com/FAQs/Can_Ceph_Support_Multiple_Data_Centers%3F
Which would lead me to have to duplicate my Ceph infrastructure at the remote site and use snapshots to move the data across the WAN. I'm not amazingly thrilled by this solution and would prefer to be able to specify a topology for replication so Ceph was aware of my WAN.
XenServer also has live migrations. Honestly, I love all the major hypervisors in their own regard, but
If you're using Qemu/KVM virtualization, you should look into Ceph distributed object store. It's a self-healing storage system, with block-level and file-level capabilities. The RBD (block level) protocol is like iSCSI, but built on redundancy and high-availability from the start.
it's very real, I'm running it in production right now on 3 servers. All my VMs are stored on it. start with this link to install it. you need at least 3 servers to run it without doing anything hacky http://ceph.com/docs/master/install/get-packages/
I just setup a 2 node cluster using ceph as the storage backend. I'm mainly using it for block storage for kvm virtual machines, but it works with regular filesystems too.
Seems like the site got restructured, and other things are also not where they used to be any more [0]
Luckily, the PG calculator is running completely in the browser and accessing the last snapshot in the web archive [1] works, though it is definitely not nice that the old link does not work anymore.
[0] https://lists.ceph.io/hyperkitty/list/[email protected]/message/J2OH7LX2TKLK5IWQBRLZG6ISXCINL2Y2/ [1] https://web.archive.org/web/20210301111112/http://ceph.com/pgcalc/
You can have all four drives as independent OSDs. Shouldn't be an issue. The system will see the different capacities, and place roughly twice as much data on the large drives, as compared to the smaller. Performance will be limited by the spinners, of course.
For PGs, there's a formula based on the total number of OSDs, and your use case. It used to be relatively simple, but now there's a calculator that somehow makes it seem more confusing. Go figure.
Actually, ceph now have own blockstore "bluestore", not a filestore for better Speed. But you need 3 machines at least today. I wish they could make a homeceph for 1 machines. http://ceph.com/ Obvious reason to Love ceph is combine different sizes of discs and SSD/spinning. No Raid but spread out files on many discs.
>I'm aware of how CRUSH works, and also aware how it's not suggested to be used as a georeplication mechanism unless you want to deal with poorly performing synchronous writes, and poor recovery performance. That or you have dedicated, low latency links and are willing to deal with that. http://docs.ceph.com/docs/master/radosgw/multisite/
That's radosgw you idiot... It's for object storage like s3...
http://ceph.com/geen-categorie/manage-a-multi-datacenter-crush-map-with-the-command-line/
Wow just wow.
Serious: Truncate is a command for creating sparse files, files that exist on your local drive that take up 0 bytes, but claim to be whatever size you want. These are commonly used for VM disk image files, so you can "thin-provision" and over-allocate your storage.
If you have a backup or AV software that doesn't understand sparse files, they will read the entire file. That 8EB file I mentioned (which I have created before for demonstration purposes) would be read for ~2500 years before whatever started scanning it finished.
You can't really chop off a section of the could with truncate
. You can build your own cloud with some system like Ceph, and mount that on your machine using its tools and kernel code. It then costs as much as you're willing to spend on hardware and electricity.
1/ Ceph has official images in the hub
2/ Resource wise you may fall short (depending on the hardware you are using and the use-case). Maybe some IOPS related problems as well.
3/ You mean having the ceph cluster deployed then run another container using it ? Yes (see ceph blog).
4/ If I were you I would use another docker node with Contiv set up to use your Ceph cluster.
Any how, that's an interesting project :)
>FLOSS is self-supported
http://ceph.com/help/professional/ - non free
irc://irc.oftc.net/#ceph - FREE!
next!
im going to say this one. support for emc... its there but all i ever hear was "you need to upgrade firmware" then comes cost.
EVERY
SINGLE
DAMN
FEATURE
COSTS SOOOOOOO MUCH MONEY! thanks but no thanks. id rather pay 2 devops guys $150,000 a year to keep up ceph than pay emc 1.2 mil for my 1.1PB cluster licensing and then pay for hardware. incidentally the ceph hardware costs 80k... 1.1PB for 80k.
for the amount of money it would take to true up my ScaleIO licensing, i can could purchase enough equipment to double the speed and quintuple the space. i would laugh but i actually find EMC to be tragic. when SDX breaks out like virtualization did, you will get left behind.
I work at a major research institution, we use a program called Lustre to store 1+ PB of data. It is a distributed file system that uses ~30 file servers to create a single name space.
Unfortunately Intel bought it recently, and is no longer providing maintenance releases to the community.
Other large single name space systems I'm aware of are glustre and ceph.
Preface: I will admit this is not exactly in my "wheelhouse" or whatever... but I still think it might be worth your time to look in to the Scale-Out NAS software solutions available. (I don't know if they are iSCSI capable though)
GlusterFS
Ceph
I also see Nutanix compared with the Gluster/Ceph - but I am not familiar with it and I don't know if there is a community-based release.
Maybe it's the openstack engineer in me talking, but http://ceph.com/ is something for object storage you should seriously take a look at.
It's crazy easy in basic setups and has a great way to expand if you need more capacity.
> Keep in mind there are 3 major design goals to any distributed file system: > > Speed > > Consistancy > > Distributedness
I would rather say Consistency, Availability and Partition tolerance as depicted in the CAP theorem. The way you pick your design goals has a significant impact on your system.
For instance, if you pick data reliability, it is usually given by distributing replicated data chunks among the storage nodes. Some recent file systems are based (e.g. RozoFS), or provide (e.g. Ceph) erasure coding to reduce significantly the cost of this redundancy.
A better solution than even ZFS would be a distributed software raid system. For example, Gluster, CephFS (still under development, but fairly close) and pNFS (In NFS 4.1, been in development for years, finally coming in Linux 4.0).
In that setup you have like 3 servers (well 3 monitors, you could probably have just 2 storage and a RPI) and you split up the files (or drives) into chunks and mirror them to any 2 (configurable) of the systems. That way you can turn off one of the systems and still access your data. Loose a drive and so on.
Also reads are way faster as you can read from multiple systems, and even multiple hard-drives in those systems simultaneously.
Rebuilds can be much faster than RAID because you can pull from multiple systems/drives and write to multiple systems/drives simultaneously. With a large number of drives, it could be the time it takes to copy 1 chunk (since it can do them all in parallel). Even small setups should see a massive reduction in time.
If we're talking about getting something going in a very short amount of time and just getting it to work, then stuffing the entire XML file contents into a DB field would get you the central access with some of the lock safety you will require. It definitely reeks of technical debt.
After reading over your post again, however, I need some clarification. If I understand this correctly, the origin server generates the XML file and you need to serve the XML file directly to customers? Or do you need to consume the XML file in the web instances? Is there only a single origin server? If there is only one data source writing the XML file then it would be okay to host them in S3. This is definitely the cheapest method. Any more than one and you could risk some data loss depending on your architecture. If you are serving the XML files directly to customers, you can even generate an expiring URL for them to directly download from S3 instead of putting load on your web servers.
If you have a bunch of old PCs to stuff them into you could create a GlusterFS cluster or Ceph.
Alternatively, if you want it all in one machine a RAID/HBA card and a JBOD enclosure would be the optimal option.
tl;dr: that's a lot of drives, do something crazy.
http://ceph.com/docs/master/start/quick-ceph-deploy/
That's the quick way to get a cluster off the ground.
If you need to ask questions. PM me. I'm more than willing to help, at a time zone appropriate delay.
Also have a squiz at the rest of the documentation so you can wrap your head around the what/where/why and how of ceph. Once you get an idea of what its trying to achieve, its easier to build.
Hi! I just found this subreddit and am excited to see what kind of answers you get.
The only thing I can think of that fits all of your requirements is Tahoe-LAFS, other than rolling your own object storage cloud with something like Ceph or OpenStack.
You can also go to Google and search for "list of Digital asset management software". There certainly exists solutions, its just a matter of finding something that fits your bill.
I hope you get some good answers.
Edit: More links - List of backup software on Wikipedia.