Munging and clean up of csv really needs to be done in a script. If not your whole process is vulnerable to human error and how would you audit your work? Also once you figured it out you can throw data into data.frames much faster since it is automated.
1) R is great for cleaning up data. (Packages: reshape2, dplyr, gdata, stringr, and lubridate. (package tm if working with text))
2) If you struggling on learning them I would suggest open refine. It will help get the job done be script able but will help you to think in a programmable way to clean up datasets.
Don't use Excel. I find it fun to clean up data actually. It is a little challenge.
GGplots2 isn't everything. I use it BUT you should learn the other tools also. Learn base plotting first. Find a good book and read it and you'll be happy you did. I suggest R for Everyone
OpenRefine (previously Google's) is extremely good at clustering text among other things. It has the only available implementation available of Metaphone3 (basically, the creator is a little special regarding its intellectual property).
Cons, is a gui, you can run it as a server but it cannot be imported in outside libraries out of the blue.
Here's a list of software that I've used and come across pretty regularly:
I'm not sure how useful it would be to really delve into learning the specifics of how to use these, but knowing what these are and what alternative options are out there for things like content management and digital preservation can't hurt. Most of the list I made is open-source software because the archives I've worked in preferred to go that route. The plus side of that is most of that list is free so you can play around with it...the downside being it can be very frustrating to install and actually get working.
The one exception might be OpenRefine. I have found that program to be super helpful with lots of different projects and I tell anyone who's even half interested that it's worth learning. It's a data cleanup tool and it made my life so much easier once I got the hang of it. I've used it for database projects but also just cleaning up messy box lists.
You might be interested in Open Refine for cleaning up datasets.
Among other features, it can cluster similar values (McDonalds Inc, MCDONALDS, McDonalds, INC. ) and then replace these instances with a single uniform value.
Cara. Gostaria muito de te ajudar mas não consigo agora. Algumas dicas: 1) http://openrefine.org serve para limpar dados. Por exemplo tirar dígitos, espaços ou formatar a coluna de telefone. 2) ele faz umas comparações fonéticas também.
I have not done it with R, but your problem with messy data remind me of google refine, now called open refine. Its made to deal with messy data.
EDIT: similar problem in https://www.r-bloggers.com/merging-data-sets-based-on-partially-matched-data-elements/
Depending what you would like to do with this data, this may be helpful: http://openrefine.org/
(I've used it for condensing datasets where people misspell things like "freshman" 10 different ways - it lets you easily change all the different versions of text into one category)
Google Drive is a good free alternative to Excel and Access. You won't be able to do as many things with your data, but you'll do most of what you need. You can also try OpenRefine for free. http://openrefine.org
There's a really powerful app called Open Refine that might be relevant to you. I would suggest going over to their website and watching the short introduction videos.
After watching the videos, if it seems like this might hold some value for you, then I'd try installing it (it is free).
If you hit difficulties, just ask (either here, or in Open Refine's own communities). Small, specific requests are easiest on the people offering support. The difference between "help completing this task" vs. "help planning this project" is significant.
You'll have to figure out how to identify which topic - make a list of keywords, for example - and then write a script to divide everything up that way. Right now it sounds like the topics are fairly distinct so it shouldn't be too hard to do yourself.
You might also want to look into open refine if you don't want to write a script, but personally I find it faster to write a quick python script to split up info in a csv into different categories if it's easy to find what's what.
OpenRefine is almost exactly what you want. It's not specific to R, but it does have an R package for it's API rrefine
This comment is right- if the "algorithm" for merge is trivial, you can just INSERT INTO another_table SELECT ... GROUP BY id
, with the appropriate aggregate function.
However, if you want to do complex cleanups without programming you will need a higher level tool such as http://openrefine.org/
First time posting on /r/python!
I'm knocking the rust off my programming skills by attempting to write a pure-Python interpreter for OpenRefine "scripts". OpenRefine is a great tool for exploring and cleaning datasets prior to analysing them. It also records an undo history of all actions that you can export as a sort of script in JSON format.
My concept is this: you use OR to explore the data interactively and design a cleaning process, but then export the process to JSON and integrate it into your analysis in Python. That way it can be repeated ad nauseam without having to fire up a full Java stack.
There are already ways of doing this by having a long-running OR process and throwing data at it through HTTP. I'm mostly doing this as an experiment, to try a different approach, see if it can be done, and explore what interesting challenges come up. I have no idea if I'll ever get round to a useful 1.0 release! Comments welcome, code here: jezcope/pyrefine
you can try using the http://openrefine.org/ software - import excel file, cluster and edit the cells together to the same name - export back to excel
I can walk you through it if you need - I have had to do this on LOTS of sets
Depending on the volume of your data i would suggest Open refine http://openrefine.org/ or if you have tons of data Pentaho may be a more robust solution, if you have doubts send me a message about either or I have worked extensively with both.
You could try using OpenRefine to explore and clean the data to see how it can be inconsistent. From that you can get an answer for this dataset. You can even use the OpenRefine API programmatically.
It would have to be something that cannot be easily automated using tools like OpenRefine. I am sure a lot of people would prefer to get a script which cleans individual records so that you do not have to send all of your data.
I think your best shot is to use OpenRefine (formerly known as Google Refine) with its Cluster feature. Go to http://openrefine.org/ and watch the first demo video. The Cluster feature is shown at around 3 minutes.
You can import your Excel File in OpenRefine, find the duplicates, fix them and export it again to Excel, or just use it to find the duplicate name and fix them in the original file manually.
Oh, i forgot Open Refine (free, open source : http://openrefine.org/). It's not a true web scraping software (it can't crawl a website), but when you feed it with a list of URL's in a column, it can extract the (static) source code, then parse HTML, XML or JSON elements with some functions very similar to Python's Beautiful Soup's.
It might be worth having a look at using openrefine; it is a good tool for this kind of data cleaning, and maybe a bit easier to get other people in your office to use vs. running Python scripts.
I'm about certain it's the format between the two states fields. For any matches to occur the structure needs to be identical between the stand alone table and shapefile. This means check for variations in spelling, abbreviations, and casing before attempting the join. This is called normalizing the table. If your join table is small manually inspect the data. If it is large explore data refining tools like openrefine.
Even simpler than scrapy is selenium webdriver: http://selenium-python.readthedocs.org/en/latest/getting-started.html
Then you could use xpath or beautifulsoup to get the page data.
Not Python but Refine (http://openrefine.org/) can download/parse mediawiki.
Edit: assuming you're using media wiki, there IS an api ~ https://www.mediawiki.org/wiki/API:Querypage
The visualizations map out deaths of foreign nationals in the Southern Arizona border as reported by the Pima County Office of the Medical Examiner. Red indicates a high density of deaths, fading into yellow and green indicating a low density of deaths. Hyperthermia (overheating) is the number one cause of death of foreign nationals in the Southern Arizona desert. Filtering out other deaths allows us to see the literal "hot-spots" of the desert. This information can be used by organizations like Humane Borders which provide emergency water stations. Unfortunately, the largest hotspot of deaths occur on the Tohono O’odham Nation Indian Reservation. The tribe does not allow Humane Borders to leave water stations on its land.
For more information, please see HumaneBorders.com
Open Refine, an experimental and discontinued Google project, was used in order to "clean" up the GPS coordinate data into a uniform xlm file format. This data was then geo-batched using Google Earth Pro, recently released for free, to convert the xlm file into a kml file which gives each GPS coordinate a point geometry. This kml point geometry data was then uploaded to Google Fusion Tables, another experimental Google project, which is used to filter data and create embeddable heatmaps.
edit : formatting