Saturday, September 12, 2015

Data Proliferation – Cause, Effect and Control



Data Proliferation refers to a massive amount of structured and unstructured data that is growing day by day and introduces challenges before business and government bodies for storing and managing this data. The issue refers to both physical (paper files) and logical data storage (Primary and secondary storage). The challenges may include additional storage space, network speed, hardware cost, data access speed etc.  An average computer user have much more memory for his storage need, however when it comes to businesses, governments and other entities collecting massive data on a daily basis, the problem of data proliferation may manifest. Data proliferation has been documented as a problem for the U.S. military since August 1971, in particular regarding the excessive documentation submitted during the acquisition of major weapon systems. Efforts to mitigate data proliferation and the problems associated with it are ongoing. Let us see what industry statistics says about massive amount of data produced on routine basis.

  • Approx. 72% of all internet users are active on social media sites with a majority of these users on Facebook spending 12-16 minutes each day.
  • Over 2.7 billion likes are clicked each day which is the easiest thing to do on Facebook.
  • 300 million images are uploaded every day that constitutes about 75% of posts; adding more than 500 terabytes of data on Facebook.
  • Amazon sales more than 200 million products every year throuout the world.
  • A single product on Amazon contain a lot of additional artefacts apart from the basic information like product images, associated reviews, alternate products, up sale and cross sale product information etc. This includes both structured and unstructured data and requires a massive data storage space.
  • A single query on google has to travel more than 1500 miles to fetch the information and return back it to the user from a data centre taking an average time of 0.2 seconds.
  • Approx. half a million tickets are booked everyday on IRCTC (Indian Railways Tickets booking site). The site serves more than 0.10 Million concurrent connections where users book their train tickets.
  • 790 out of approx. 900 Million mobile subscribers in India are active connections as on March, 2014. A huge amount of data is stored on servers for data related to customers, calls, SMS etc.
  • Global smartphone audience has surpassed 1.5 billion mark. These Smartphones are very resource hungry and require further storage needs to download, use and execute various applications like Facebook, WhatsApp, twitter etc. in some cases.

Apart from the above stats, there is massive data in government organisations that is sensitive and cannot be shared on internet; however it is assumed to be in petabytes. So it is evident from the above facts that we have a lot of data lying on servers adding to proliferation and the driving factors for this huge data may include but not limited to a good internet speed, affordability of cost per GB and more and more resource hungry devices and applications. This huge data comes with a lot of implications if continued to generate at this speed. Some of the major impacts may include:

  • More storage space for computers require more cooling which in turn result in additional need of water, energy and space which is negatively impacting the environment.
  • Although the results are faster, there is an increase in time spent on searching the relevant material on search engines. This is because of a huge amount of data that a web crawler has to look for to get the desired outcome. This eventually increase search cycle time.
  • The cost of storage per Gigabyte is drastically coming down for 3 decades from thousands of dollars to merely some cents per Gigabyte; however cost spent on infrastructure, software, associated maintenance, floor space along with IT staff to keep the scalability and performance of the systems is increasing multi fold.
  • More expansion of data footprints means more security threats. All this data stored on company networks requires additional tools, applications and expert staff to enforce security measurements on this data. Any small glitch in the security may prove to be a big compromise with the credibility of the company’s goodwill and may impact the business. 

Analysis on obsession of data shows that popularity of rich media on internet, getting free space on cloud, e-commerce sites and social networking are some of the driving factors that cause massive storage of structured and unstructured data on internet. Unstructured data includes excel sheets, presentations, PDF and word documents, pictures, videos and audio files. On the other side, high usage of registration activities on websites, new entrants in the e-Commerce space leads to structured data. Corporates and government authorities need to define strict policies for data retention. A better information management system is a backbone for any organization that helps them to draft various policies, take effective business decisions and take timely action to name a few. So let us see how we can minimize the adverse impacts of data proliferation.


As an individual 

  • Store only relevant data on secondary storage media. There are instances when people keep collecting a lot of raw data while preparing a final document; however this raw data is also copied which might not be needed while performing a system backup. Remove duplicate data. In absence of time, most of people tend to simply copy data to secondary storage by making new folders which is not an ideal way, instead storing the data in the same folders by overwriting or using some good version control tool if it is really needed to keep control on the data.
  • Use backup utilities which are an effective way of reducing waste data in the form of redundant files. Whenever needed, use incremental backup instead of simply creating copies of the same file prefixed or suffixed with date. This will make sure that only the changed contents are replicated in the destination folder.
  • Limit upload of pictures and videos on internet especially on social media sites. Market is flooded with high resolution digital cameras and smart phones targeted for people who are obsessed with taking snaps. These cameras and phone produce a considerable size of photo sometimes as big as 3-4 MBs. And also, all these pictures are not worth keeping. By keeping all these unwanted pictures, we are moving towards Data Proliferation.
  • Use designated tools to get backup of your smart phones so that you can keep track of your contacts, emails or media. These tools are very efficient and help taking incremental backups of all the data on your phone.
  • Avoid backing up all the forwarded pictures and movies received on Facebook, WhatsApp, viber, telegram etc. These forwards are neither useful nor solve any purpose for many years to come.

As an organisation

  • Use file systems that use effective metadata structures/tags so that a fast and relevant search result can be obtained. To a large extent this is already available with most of file systems available; users just need to use these metadata tags extensively. 
  • Educate Employees on various negative impacts of data proliferation so that a self-motivated environment is created to control the data.
  • Implement an effective Information lifecycle management so that the information should be kept for certain period else discarded. 
  • Define and implement data retention policies for effective governance across departments in an organization so that the data footprints can be reduced. This will help organisations to keep track of their storage needs.
  • Use network drives if more than one users are working on the same data or the data that needs to be shared between users. This will help to retain a single source of truth and help reducing data proliferation.
  • Define and implement data archiving policy. Data that is used very infrequently must be compressed before archiving. 
  • Integrate applications to minimize rekeying, manual errors and minimize query cycle time wherever possible.  


Conclusion
Let’s consider the above points and contribute to a better environment, an improved data in hand, quick and relevant search results and enhanced secondary storage space utilization. By implementing the above suggestions, we can help in various ways to disregard low or no value information at an early stage and keep only the relevant and high value information in long term secondary storage which is really quick and cheap to access. Delphix and Actifio are the tools among others that provide DaaS services helping control on data proliferation. 

References:
http://www.searchenginejournal.com/growth-social-media-2-0-infographic/77055/  
http://export-x.com/2013/12/many-products-amazon-sell/ 
http://expandedramblings.com/index.php/by-the-numbers-17-amazing-facebook-stats/  
http://gizmodo.com/5937143/what-facebook-deals-with-everyday-27-billion-likes-300-million-photos-uploaded-and-500-terabytes-of-data   
http://www.mkomo.com/cost-per-gigabyte 
http://www.treehugger.com/clean-technology/crazy-e-waste-statistics-explored-in-infographic.html   
http://www.emarketer.com/Article/Smartphone-Users-Worldwide-Will-Total-175-Billion-2014/1010536 
http://www.jeffbullas.com/2012/05/23/35-mind-numbing-youtube-facts-figures-and-statistics-infographic/   
http://storageioblog.com/data-footprint-reduction-part-1-life-beyond-dedupe-and-changing-data-lifecycles/